duttashi / learnr Goto Github PK

View Code? Open in Web Editor NEW

78.0 78.0 56.0 56.06 MB

Exploratory, Inferential and Predictive data analysis. Feel free to show your :heart: by giving a star :star:

License: MIT License

R 100.00%

exploratory-data-analysis inferential-statistics predictive-modeling r

learnr's Introduction

Hi there, I'm Ashish 👋

⚡ I love applied maths, programming, data science, and books

🌱 I’m addicted to learning and growing every day
🌍 I am currently sharing a little bit of my knowledge to the world through my blog.
✏️ I am current working on mixed data clustering
Connect with me on:
- 🏢 LinkedIn
📫 Learn more about me on:
- ✏️ Stories Data Speak
- 🎯 Projects
- 🔈 Research
- 💡 Reviewer

learnr's People

Contributors

Stargazers

Watchers

Forkers

alhassan-fadel ahsan28 ankur3107 ybj2004 ewillyliew mschun0621 jinshi27 omarun alfredleecs vinitta79 znichols7 liu0563 mhdakbar gracepoon manoj8385 mayank100sharma waltercueva jerinkumar matixr rajachanpreetsingh empauras mathiasfls htnani ioannispapadakis srisai85 hendrik147 gkaynakk vangao-go nturaga meesvv billtang abderraoufzema n2min atefar2 savasgykr derderi andrej1a tdamerau jil2104 learnml2020 bhavya801 nnurskurmanbekov gursedat yumingh97 champgourav007 david-hown anuj27akamboj shubhra-g kayan9896 umeshhsa quantitas faeron8 jjwuv mamjow sergealainf pradeep1955

learnr's Issues

Understanding Histograms and Density Plots

A collection of self curated notes to understand data visualization techniques.

How to group factor levels?

This Q was originally asked on SO. I'm reproducing it here for referencing purpose:

Suppose a dataset has a factor column with values like;

> mydata                    
   question id           value
1         1  1      not likely
2         2  1      not likely
3         3  1      not likely
4         4  1      not likely
5         5  1 slightly likely
6         1  2     very likely
7         2  2 slightly likely
8         3  2 slightly likely
9         4  2      not likely
10        5  2     very likely

So how do I group the factor levels for variable value into say two levels ?

How to replace multiple summarize statements by a custom function?

This question was originally asked on SO. Reproducing it here for reference purpose only.

A minimum example:

library(tidyverse)
col1 <- c("UK", "US", "UK", "US")
col2 <- c("Tech", "Social", "Social", "Tech")
col3 <- c("0-5years", "6-10years", "0-5years", "0-5years")
col4 <- 1:4
col5 <- 5:8

df <- data.frame(col1, col2, col3, col4, col5)

result1 <- df %>% 
  group_by(col1, col2) %>% 
  summarize(sum1 = sum(col4, col5))

result2 <- df %>% 
  group_by(col2, col3) %>% 
  summarize(sum1 = sum(col4, col5))

result3 <- df %>% 
  group_by(col1, col3) %>% 
  summarize(sum1 = sum(col4, col5))

Error: package or namespace load failed for ‘somepackagename’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]): there is no package called ‘somepackagename’

Question: At times, an error message like this, `Error: package or namespace load failed for ‘somepackagename’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]): there is no package called ‘somepackagename’ would pop up. How to solve this problem?

Set up library path location for installed packages, remove installed packages (clean up) and install only required packages

Last evening I decided to update the installed packages. During the update process, a prompt came up, do you want to install from sources the packages which need compilation r. I chose the option, yes and boy, this messed up everything in RStudio.

Lesson learnt: if such a message pop's up, choose No. See this post for reference.

Another problem was I had never set up the library path for the installed packages. So, I needed to set the library path. And finally, I needed to tweak the .Rprofile.site file. For windows OS it is located in, C:\Program Files\R\R-3.5.0\etc

"warning message: position_dodge requires non-overlapping x intervals", when plotting a boxplot

> str(data_balanced)
'data.frame':	610 obs. of  10 variables:
 $ VisitRsrc  : int  11 22 91 90 41 64 25 61 25 80 ...
 $ raisedhands: int  2 20 90 80 27 62 8 7 15 20 ...

> ggplot(data = data_balanced, aes(x=VisitRsrc, y=raisedhands, fill=gender)) +
+   geom_boxplot()+
+   coord_flip()+
+   scale_fill_discrete(name="Gender")+
+   facet_grid(~Relation)
Warning messages:
1: position_dodge requires non-overlapping x intervals 
2: position_dodge requires non-overlapping x intervals

Issue: "warning message: position_dodge requires non-overlapping x intervals", when plotting a box plot is generated. This warning is generated, when plotting continuous variables on both x and y- axis

How to replace values in single or multiple columns using either base R or dplyr?

Often in a data analysis project, there arises a need or requirement to replace values in either a single or multiple column. In this question, I will address this issue from two perspectives;

using baseR
using mutate from tidyverse package

Summarize every column in a data frame and ignore the missing values

Suppose a given dataframe contains missing values.

# inject random NA values in the mtcars dataset
mtcarsNew<- as.data.frame(lapply(mtcars, function(cc) 
  cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE)
      ]
  )
  )
# total missing values
R> sum(is.na(mtcarsNew))
[1] 44
R> colnames(is.na(mtcarsNew))
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

How to summarize every column in a dataframe such that it ignores the missing values

How to rename single or multiple column names in a data frame?

Often in a data analysis project, there arises a need or requirement to rename the column name in either a single or multiple column. An example is given below;

# create a data frame
> df<- data.frame(sample(4,size = 4, replace = TRUE),
                sample(4,size = 4, replace = TRUE),
                sample(4,size = 4, replace = TRUE),
                sample(4,size = 4, replace = TRUE)
                )
# show the column names
> colnames(df)
[1] "sample.4..size...4..replace...TRUE."  
[2] "sample.4..size...4..replace...TRUE..1"
[3] "sample.4..size...4..replace...TRUE..2"
[4] "sample.4..size...4..replace...TRUE..3"

As you can see the column names suck.. Need to make them more meaningful. How to do this?

System error Rterm: missing libatk-1.0-0.dll

Yesterday, when I launched the RStudio on my computer, I was greeted with this error message; To the best of my knowledge I did not change anything. I'm running RStudio version 1.0.136 and R version 3.3.3

Rterm.exe - System Error. 
  The program can't start because libatk-1.0-0.dll is missing from your computer. 
  Try reinstalling the program to fix this problem.

. Clicking the OK button on the error message, will not close it. And, RStudio will not work any longer.

Warning message: In plot.aggr(res, ...) : not enough horizontal space to display frequencies

When plotting missing data using the VIM package, I got the following warning message;
Warning message: In plot.aggr(res, ...) : not enough horizontal space to display frequencies

How to separate Date into year, month and date?

The dataframe looks like the following;

> str(dengue.train$week_start_date) Factor w/ 1049 levels "1990-04-30","1990-05-07",..: 1 2 3 4 5 6 7 8 9 10 ...

As we can see, the variable, week_start_date is read in Factor or String format. How do I change it numeric format?

How to collapse rows with same identifier and retain non-empty column values?

This question was originally asked on SO

Question

How to collapse (or merge?) rows with the same identifier and retain the non-empty (here, any nonzero values) values in each column?

Data

df = data.frame(produce = c("apples","apples", "bananas","bananas"),
                grocery1=c(0,1,1,1),
                grocery2=c(1,0,1,1),
                grocery3=c(0,0,1,1))

Desired output

 shopping grocery1 grocery2 grocery3
1   apples        1        1        0
2  bananas        1        1        1

How to detect outliers in categorical data

The inspiration for this question arose from this StackExchange post. In this post, it was asked how to treat categorical data for outlier detection?

How to sort a dataframe by column(s)?

I want to sort a data.frame by multiple columns. For example, with the data.frame below I would like to sort by column z (descending) then by column b (ascending):

dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
dd
    b x y z
1  Hi A 8 1
2 Med D 3 1
3  Hi A 9 1
4 Low C 9 2

Extracting multiple variables from multiple dataframes?

This question was originally asked on SO

Question: Suppose there are n dataframes (in this case 3). How to extract variables which appear in all n dataframes?

Dataset

df1 <- structure(list(Variable = c("a", "g", "e"), Val = c(0.9, 0.3, 
0.1)), class = "data.frame", row.names = c(NA, -3L))

df2 <- structure(list(Variable = c("h", "a", "e"), Val = c(0.2, 0.7, 
0.9)), class = "data.frame", row.names = c(NA, -3L))

df3 <- structure(list(Variable = c("z", "a", "e"), Val = c(0.5, 0.7, 
0.9)), class = "data.frame", row.names = c(NA, -3L))

How to determine what plot to draw for a given variable('s)?

In data visualization often we need to ascertain the appropriate visualization type.