moderndive / moderndive Goto Github PK

View Code? Open in Web Editor NEW

87.0 87.0 62.0 53.93 MB

R package for tidyverse-friendly introductory linear regression

Home Page: https://moderndive.github.io/moderndive/

License: GNU General Public License v3.0

R 96.54% Shell 0.27% TeX 3.19%

moderndive's Introduction

moderndive R Package

Overview

The moderndive R package consists of datasets and functions for tidyverse-friendly introductory linear regression. These tools leverage the well-developed tidyverse and broom packages to facilitate

Working with regression tables that include confidence intervals
Accessing regression outputs on an observation level (e.g. fitted/predicted values and residuals)
Inspecting scalar summaries of regression fit (e.g. R-squared, R-squared adjusted, and mean squared error)
Visualizing parallel slopes regression models using ggplot2-like syntax.

This R package is designed to supplement the book “Statistical Inference via Data Science: A ModernDive into R and the Tidyverse” available at ModernDive.com. For more background, read our Journal of Open Source Education paper.

Installation

Get the released version from CRAN:

install.packages("moderndive")

Or the development version from GitHub:

# If you haven't installed remotes yet, do so:
# install.packages("remotes")
remotes::install_github("moderndive/moderndive")

Basic usage

library(moderndive)
score_model <- lm(score ~ age, data = evals)

Get a tidy regression table with confidence intervals:

get_regression_table(score_model)

## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    4.46      0.127     35.2    0        4.21     4.71 
## 2 age         -0.006     0.003     -2.31   0.021   -0.011   -0.001

Get information on each point/observation in your regression, including fitted/predicted values and residuals, in a single data frame:

get_regression_points(score_model)

## # A tibble: 463 × 5
##       ID score   age score_hat residual
##    <int> <dbl> <int>     <dbl>    <dbl>
##  1     1   4.7    36      4.25    0.452
##  2     2   4.1    36      4.25   -0.148
##  3     3   3.9    36      4.25   -0.348
##  4     4   4.8    36      4.25    0.552
##  5     5   4.6    59      4.11    0.488
##  6     6   4.3    59      4.11    0.188
##  7     7   2.8    59      4.11   -1.31 
##  8     8   4.1    51      4.16   -0.059
##  9     9   3.4    51      4.16   -0.759
## 10    10   4.5    40      4.22    0.276
## # … with 453 more rows

Get scalar summaries of a regression fit including R-squared and R-squared adjusted but also the (root) mean-squared error:

get_regression_summaries(score_model)

## # A tibble: 1 × 9
##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.011         0.009 0.292 0.540 0.541      5.34   0.021     1   463

Visualize parallel slopes models using the geom_parallel_slopes() custom ggplot2 geometry:

library(ggplot2)
ggplot(evals, aes(x = age, y = score, color = ethnicity)) +
  geom_point() +
  geom_parallel_slopes(se = FALSE) +
  labs(x = "Age", y = "Teaching score", color = "Ethnicity")

Statement of Need

Linear regression has long been a staple of introductory statistics courses. While the curricula of introductory statistics courses has much evolved of late, the overall importance of regression remains the same. Furthermore, while the use of the R statistical programming language for statistical analysis is not new, recent developments such as the tidyverse suite of packages have made statistical computation with R accessible to a broader audience. We go one step further by leveraging the tidyverse and the broom packages to make linear regression accessible to students taking an introductory statistics course. Such students are likely to be new to statistical computation with R; we designed moderndive with these students in mind.

Contributor code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Six features

Why should you use the moderndive package for introductory linear regression? Here are six features:

Focus less on p-value stars, more confidence intervals
Outputs as tibbles
Produce residual analysis plots from scratch using ggplot2
A quick-and-easy Kaggle predictive modeling competition submission!
Visual model selection: plot parallel slopes & interaction regression models
Produce metrics on the quality of regression model fits

Data background

We first discuss the model and data background. The data consists of end of semester student evaluations for a sample of 463 courses taught by 94 professors from the University of Texas at Austin. This data is included in the evals data frame from the moderndive package.

In the following table, we present a subset of 9 of the 14 variables included for a random sample of 5 courses[1]:

ID uniquely identifies the course whereas prof_ID identifies the professor who taught this course. This distinction is important since many professors taught more than one course.
score is the outcome variable of interest: average professor evaluation score out of 5 as given by the students in this course.
The remaining variables are demographic variables describing that course’s instructor, including bty_avg (average “beauty” score) for that professor as given by a panel of 6 students.[2]

ID	prof_ID	score	age	bty_avg	gender	ethnicity	language	rank
355	71	4.9	50	3.333	male	minority	english	teaching
262	49	4.3	52	3.167	male	not minority	english	tenured
441	89	3.7	35	7.833	female	minority	english	tenure track
51	10	4.3	47	5.500	male	not minority	english	teaching
49	9	4.5	33	4.667	female	not minority	english	tenure track

1. Focus less on p-value stars, more confidence intervals

We argue that the summary.lm() output is deficient in an introductory statistics setting because:

The Signif. codes: 0 '' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 only encourage p-hacking. In case you have not yet been convinced of the perniciousness of p-hacking, perhaps comedian John Oliver can convince you.
While not a silver bullet for eliminating misinterpretations of statistical inference, confidence intervals present students with a sense of the associated effect sizes of any explanatory variables. Thus, practical as well as statistical significance is emphasized. These are not included by default in the output of summary.lm().

Instead of summary(), let’s use the get_regression_table() function:

get_regression_table(score_model)

## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    4.46      0.127     35.2    0        4.21     4.71 
## 2 age         -0.006     0.003     -2.31   0.021   -0.011   -0.001

Observe how the p-value stars are omitted and confidence intervals for the point estimates of all regression parameters are included by default. By including them in the output, we can then emphasize to students that they “surround” the point estimates in the estimate column. Note the confidence level is defaulted to 95%; this default can be changed using the conf.level argument:

get_regression_table(score_model, conf.level = 0.99)

## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    4.46      0.127     35.2    0        4.13     4.79 
## 2 age         -0.006     0.003     -2.31   0.021   -0.013    0.001

2. Outputs as tibbles

While one might argue that extracting the intercept and slope coefficients can be “simply” done using coefficients(score_model), what about the standard errors? For example, a Google query of “how do I extract standard errors from lm in R” yielded results from the R mailing list and from Cross Validated suggesting we run:

sqrt(diag(vcov(score_model)))

## (Intercept)         age 
## 0.126778499 0.002569157

We argue that this task shouldn’t be this hard, especially in an introductory statistics setting. To rectify this, the three get_regression_* functions all return data frames in the tidyverse-style tibble (tidy table) format. Therefore you can extract columns using the pull() function from the dplyr package:

library(dplyr)
get_regression_table(score_model) %>%
  pull(std_error)

## [1] 0.127 0.003

or equivalently you can use the $ sign operator from base R:

get_regression_table(score_model)$std_error

## [1] 0.127 0.003

Furthermore, by piping the above get_regression_table(score_model) output into the kable() function from the knitr package, you can obtain aesthetically pleasing regression tables in R Markdown documents, instead of tables written in jarring computer output font:

library(knitr)
get_regression_table(score_model) %>%
  kable()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	4.462	0.127	35.195	0.000	4.213	4.711
age	-0.006	0.003	-2.311	0.021	-0.011	-0.001

3. Produce residual analysis plots from scratch using `ggplot2`

How can we extract point-by-point information from a regression model, such as the fitted/predicted values and the residuals? (Note we only display the first 10 out of 463 of such values for brevity’s sake.)

fitted(score_model)

##        1        2        3        4        5        6        7        8 
## 4.248156 4.248156 4.248156 4.248156 4.111577 4.111577 4.111577 4.159083 
##        9       10 
## 4.159083 4.224403

residuals(score_model)

##           1           2           3           4           5           6 
##  0.45184376 -0.14815624 -0.34815624  0.55184376  0.48842294  0.18842294 
##           7           8           9          10 
## -1.31157706 -0.05908286 -0.75908286  0.27559666

But why have the original explanatory/predictor age and outcome variable score in evals, the fitted/predicted values score_hat, and residual floating around in separate vectors? Since each observation relates to the same course, we argue it makes more sense to organize them together in the same data frame using get_regression_points():

score_model_points <- get_regression_points(score_model)
score_model_points

## # A tibble: 10 × 5
##       ID score   age score_hat residual
##    <int> <dbl> <int>     <dbl>    <dbl>
##  1     1   4.7    36      4.25    0.452
##  2     2   4.1    36      4.25   -0.148
##  3     3   3.9    36      4.25   -0.348
##  4     4   4.8    36      4.25    0.552
##  5     5   4.6    59      4.11    0.488
##  6     6   4.3    59      4.11    0.188
##  7     7   2.8    59      4.11   -1.31 
##  8     8   4.1    51      4.16   -0.059
##  9     9   3.4    51      4.16   -0.759
## 10    10   4.5    40      4.22    0.276

Observe that the original outcome variable score and explanatory/predictor variable age are now supplemented with the fitted/predicted values score_hat and residual columns. By putting the fitted values, predicted values, and residuals next to the original data, we argue that the computation of these values is less opaque. For example, instructors can emphasize how all values in the first row of output are computed.

Furthermore, recall that since all outputs in the moderndive package are tibble data frames, custom residual analysis plots can be created instead of relying on the default plots yielded by plot.lm(). For example, we can check for the normality of residuals using the histogram of residuals shown in .

# Code to visualize distribution of residuals:
ggplot(score_model_points, aes(x = residual)) +
  geom_histogram(bins = 20) +
  labs(x = "Residual", y = "Count")

As another example, we can investigate potential relationships between the residuals and all explanatory/predictor variables and the presence of heteroskedasticity using partial residual plots, like the partial residual plot over age shown in . If the term “heteroskedasticity” is new to you, it corresponds to the variability of one variable being unequal across the range of values of another variable. The presence of heteroskedasticity violates one of the assumptions of inference for linear regression.

# Code to visualize partial residual plot over age:
ggplot(score_model_points, aes(x = age, y = residual)) +
  geom_point() +
  labs(x = "Age", y = "Residual")

4. A quick-and-easy Kaggle predictive modeling competition submission!

With the fields of machine learning and artificial intelligence gaining prominence, the importance of predictive modeling cannot be understated. Therefore, we’ve designed the get_regression_points() function to allow for a newdata argument to quickly apply a previously fitted model to new observations.

Let’s create an artificial “new” dataset consisting of two instructors of age 39 and 42 and save it in a tibble data frame called new_prof. We then set the newdata argument to get_regression_points() to apply our previously fitted model score_model to this new data, where score_hat holds the corresponding fitted/predicted values.

new_prof <- tibble(age = c(39, 42))
get_regression_points(score_model, newdata = new_prof)

## # A tibble: 2 × 3
##      ID   age score_hat
##   <int> <dbl>     <dbl>
## 1     1    39      4.23
## 2     2    42      4.21

Let’s do another example, this time using the Kaggle House Prices: Advanced Regression Techniques practice competition ( displays the homepage for this competition).

House prices Kaggle competition homepage.

This Kaggle competition requires you to fit/train a model to the provided train.csv training set to make predictions of house prices in the provided test.csv test set. We present an application of the get_regression_points() function allowing students to participate in this Kaggle competition. It will:

Read in the training and test data.
Fit a naive model of house sale price as a function of year sold to the training data.
Make predictions on the test data and write them to a submission.csv file that can be submitted to Kaggle using get_regression_points(). Note the use of the ID argument to use the id variable in test to identify the rows (a requirement of Kaggle competition submissions).

library(readr)
library(dplyr)
library(moderndive)

# Load in training and test set
train <- read_csv("https://moderndive.com/data/train.csv")
test <- read_csv("https://moderndive.com/data/test.csv")

# Fit model:
house_model <- lm(SalePrice ~ YrSold, data = train)

# Make predictions and save in appropriate data frame format:
submission <- house_model %>%
  get_regression_points(newdata = test, ID = "Id") %>%
  select(Id, SalePrice = SalePrice_hat)

# Write predictions to csv:
write_csv(submission, "submission.csv")

After submitting submission.csv to the leaderboard for this Kaggle competition, we obtain a “root mean squared logarithmic error” (RMSLE) score of 0.42918 as seen in .

Resulting Kaggle RMSLE score.

5. Visual model selection: plot parallel slopes & interaction regression models

For example, recall the earlier visualizations of the interaction and parallel slopes models for teaching score as a function of age and ethnicity we saw in Figures and . Let’s present both visualizations side-by-side in .

Students might be wondering “Why would you use the parallel slopes model on the right when the data clearly form an”X" pattern as seen in the interaction model on the left?" This is an excellent opportunity to gently introduce the notion of model selection and Occam’s Razor: an interaction model should only be used over a parallel slopes model if the additional complexity of the interaction model is warranted. Here, we define model “complexity/simplicity” in terms of the number of parameters in the corresponding regression tables:

# Regression table for interaction model:
interaction_evals <- lm(score ~ age * ethnicity, data = evals)
get_regression_table(interaction_evals)

## # A tibble: 4 × 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                  2.61      0.518      5.04   0        1.59     3.63 
## 2 age                        0.032     0.011      2.84   0.005    0.01     0.054
## 3 ethnicity: not minority    2.00      0.534      3.74   0        0.945    3.04 
## 4 age:ethnicitynot minor…   -0.04      0.012     -3.51   0       -0.063   -0.018

# Regression table for parallel slopes model:
parallel_slopes_evals <- lm(score ~ age + ethnicity, data = evals)
get_regression_table(parallel_slopes_evals)

## # A tibble: 3 × 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                  4.37      0.136     32.1    0        4.1      4.63 
## 2 age                       -0.006     0.003     -2.5    0.013   -0.012   -0.001
## 3 ethnicity: not minority    0.138     0.073      1.89   0.059   -0.005    0.282

The interaction model is “more complex” as evidenced by its regression table involving 4 rows of parameter estimates whereas the parallel slopes model is “simpler” as evidenced by its regression table involving only 3 parameter estimates. It can be argued however that this additional complexity is warranted given the clearly different slopes in the left-hand plot of .

We now present a contrasting example, this time from Chapter 6 of the online version of ModernDive Subsection 6.3.1 involving Massachusetts USA public high schools.[3] Let’s plot both the interaction and parallel slopes models in .

# Code to plot interaction and parallel slopes models for MA_schools
ggplot(
  MA_schools,
  aes(x = perc_disadvan, y = average_sat_math, color = size)
) +
  geom_point(alpha = 0.25) +
  labs(
    x = "% economically disadvantaged",
    y = "Math SAT Score",
    color = "School size"
  ) +
  geom_smooth(method = "lm", se = FALSE)

ggplot(
  MA_schools,
  aes(x = perc_disadvan, y = average_sat_math, color = size)
) +
  geom_point(alpha = 0.25) +
  labs(
    x = "% economically disadvantaged",
    y = "Math SAT Score",
    color = "School size"
  ) +
  geom_parallel_slopes(se = FALSE)

In terms of the corresponding regression tables, observe that the corresponding regression table for the parallel slopes model has 4 rows as opposed to the 6 for the interaction model, reflecting its higher degree of “model simplicity.”

# Regression table for interaction model:
interaction_MA <-
  lm(average_sat_math ~ perc_disadvan * size, data = MA_schools)
get_regression_table(interaction_MA)

## # A tibble: 6 × 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                594.       13.3      44.7     0      568.     620.   
## 2 perc_disadvan             -2.93      0.294    -9.96    0       -3.51    -2.35 
## 3 size: medium             -17.8      15.8      -1.12    0.263  -48.9     13.4  
## 4 size: large              -13.3      13.8      -0.962   0.337  -40.5     13.9  
## 5 perc_disadvan:sizemedi…    0.146     0.371     0.393   0.694   -0.585    0.877
## 6 perc_disadvan:sizelarge    0.189     0.323     0.586   0.559   -0.446    0.824

# Regression table for parallel slopes model:
parallel_slopes_MA <-
  lm(average_sat_math ~ perc_disadvan + size, data = MA_schools)
get_regression_table(parallel_slopes_MA)

## # A tibble: 4 × 7
##   term          estimate std_error statistic p_value lower_ci upper_ci
##   <chr>            <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept       588.       7.61     77.3     0       573.     603.  
## 2 perc_disadvan    -2.78     0.106   -26.1     0        -2.99    -2.57
## 3 size: medium    -11.9      7.54     -1.58    0.115   -26.7      2.91
## 4 size: large      -6.36     6.92     -0.919   0.359   -20.0      7.26

Unlike our earlier comparison of interaction and parallel slopes models in , in this case it could be argued that the additional complexity of the interaction model is not warranted since the 3 three regression lines in the left-hand interaction are already somewhat parallel. Therefore the simpler parallel slopes model should be favored.

Going one step further, notice how the three regression lines in the visualization of the parallel slopes model in the right-hand plot of have similar intercepts. In can thus be argued that the additional model complexity induced by introducing the categorical variable school size is not warranted. Therefore, a simple linear regression model using only perc_disadvan percent of the student body that are economically disadvantaged should be favored.

While many students will inevitably find these results depressing, in our opinion, it is important to additionally emphasize that such regression analyses can be used as an empowering tool to bring to light inequities in access to education and inform policy decisions.

6. Produce metrics on the quality of regression model fits

Recall the output of the standard summary.lm() from earlier:

## 
## Call:
## lm(formula = score ~ age, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9185 -0.3531  0.1172  0.4172  0.8825 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.461932   0.126778  35.195   <2e-16 ***
## age         -0.005938   0.002569  -2.311   0.0213 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5413 on 461 degrees of freedom
## Multiple R-squared:  0.01146,    Adjusted R-squared:  0.009311 
## F-statistic: 5.342 on 1 and 461 DF,  p-value: 0.02125

Say we wanted to extract the scalar model summaries at the bottom of this output, such as R-squared, R-squared adjusted, the F-statistic, and the degrees of freedom df. We can do so using the get_regression_summaries() function.

get_regression_summaries(score_model)

## # A tibble: 1 × 9
##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.011         0.009 0.292 0.540 0.541      5.34   0.021     1   463

We’ve supplemented the standard scalar summaries output yielded by summary() with the mean squared error mse and root mean squared error rmse given their popularity in machine learning settings.

The inner workings

How does this all work? Let’s open the hood of the moderndive package.

Three wrappers to `broom` functions

As we mentioned earlier, the three get_regression_* functions are wrappers of functions from the broom package for converting statistical analysis objects into tidy tibbles along with a few added tweaks, but with the introductory statistics student in mind:

get_regression_table() is a wrapper for broom::tidy().
get_regression_points() is a wrapper for broom::augment().
get_regression_summaries is a wrapper for broom::glance().

Why did we take this approach to address the initial 5 common student questions at the outset of the article?

By writing wrappers to pre-existing functions, instead of creating new custom functions, there is minimal re-inventing of the wheel necessary.
In our experience, novice R users had a hard time understanding the broom package function names tidy(), augment(), and glance(). To make them more user-friendly, the moderndive package wrappers have much more intuitively named get_regression_table(), get_regression_points(), and get_regression_summaries().
The variables included in the outputs of the above 3 broom functions are not all applicable to an introductory statistics students and of those that were, we found they were not intuitive for new R users. We therefore cut out some of the variables from the output and renamed some of the remaining variables. For example, compare the outputs of the get_regression_points() wrapper function and the parent broom::augment() function.

get_regression_points(score_model)
broom::augment(score_model)

The source code for these three get_regression_* functions can be found on here.

Custom geometries

The geom_parallel_slopes() is a custom built geom extension to the ggplot2 package. For example, the ggplot2 webpage page gives instructions on how to create such extensions. The source code for geom_parallel_slopes() written by Evgeni Chasnovski can be found on GitHub.

For details on the remaining 5 variables, see the help file by running ?evals.
Note that gender was collected as a binary variable at the time of the study (2005).
For more details on this dataset, see the help file by running ?MA_schools.

moderndive's People

Contributors

Stargazers

Watchers

moderndive's Issues

Add ORCIDs

Hey @andrewpbray @DelaneyMoran @echasnovski @wjhopper, I'm going to add @ismayc and my ORCID's to this package DESCRIPTION file as seen here. That way they'll show up as clickable green icons on the package's CRAN page. Look at the broom package's CRAN page as an example.

In case you don't know what ORCIDs are, think of them as a social security number for researchers. For example, here is mine. If you don't have one, you can create yours at https://orcid.org/.

If you would like yours to be added to the package, by Thu 7/16 either:

Make a pull request to the DESCRIPTION file on master, mimicking this format
Or just paste your ORCID below and I'll add it myself.

thanks!

(infrastructure, optional comment) Consider moving dataset documentation into one file

Part of JOSE review openjournals/jose-reviews#115

This is more for an infrastructural/developer point of view, but it is a bit confusing having the dataset documentation all over the place. There is the datasets.R file which intuitively means that the datasets are documented there, but then there are pennies.R etc. This is entirely optional and organizational, but maybe consider moving the dataset documentation into one file rather than multiple.

Create `rep_slice_sample(n)`

To align with dplyr::slice_sample(n), create an alias of rep_sample_n(size)

Add ID variable specification to get_regression_points

Right now, the get_regression_points() function just creates a column ID of 1:nrow(data) to ID the rows. Modify this to allow for get_regression_points(model, data, ID = ID_VARIABLE_NAMES) specification of ID variable to be included in output.

For example, in nycflights13::weather, the combination of year, month, day, hour, origin is an ID variable that uniquely identifies each observational unit in each row; it would be great to have a way for these five variables to be included in the output of get_regression_points().

Change format of house_prices$date to date

Currently date is dttm POSIXct objects; convert to date with as.Date()

suppressPackageStartupMessages(library(tidyverse))
#> Warning: package 'dplyr' was built under R version 3.5.1
library(moderndive)
house_prices <- house_prices %>% 
  select(date)

# As is:
glimpse(house_prices)
#> Observations: 21,613
#> Variables: 1
#> $ date <dttm> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-02-...

# Desired:
house_prices <- house_prices %>% 
  mutate(date = as.Date(date))
glimpse(house_prices)
#> Observations: 21,613
#> Variables: 1
#> $ date <date> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-02-...

Created on 2018-07-08 by the reprex package (v0.2.0.9000).

Remove use of patchwork from vignette

Part of the JOSE review openjournals/jose-reviews#115

Related to #105, in order to automatically build the website, the patchwork package is required for building the vignette. Since it is only needed for that, I'd suggest removing its use so you can remove the Suggests dependency for patchwork.

Fix CRAN error by 7/20

https://cran.r-project.org/web/checks/check_results_moderndive.html

Has to do with update to openintro package

Remove examples from dataset documentations

Part of the JOSE review openjournals/jose-reviews#115

The examples in the roxygen docs for the datasets aren't really necessary and generally increase the build time of the website/package checks without adding much value. It also isn't a common part of package docs anyway. I'd suggest removing them. The added benefit of removing them is that the ragg package dependency in #105 can be removed (which is needed to generate those documentation pages).

Add 33 tactile samples dataset

Based on in-class tactile sampling exercise. Review names and add

library(readr)
tactile_prop_red <- read_csv("https://rudeboybert.github.io/STAT135/static/sampling_red_balls.csv")

Extend get_regression_X() functions to glm objects

For later on down the road. Extend get_regression_X() wrapper functions of broom::tidy(), broom::augment(), and broom::glance() for glm objects, but end scope there.

Larger comments on paper.md

Part of JOSE review: openjournals/jose-reviews#115

Overall, really nice! I like the reference to John Oliver! 😛 Here are some broader comments on it.

Similar to @lisamr comments, the paper is a bit too long and much of the "meat" could be linked back to the website.
One of the biggest feedback is to limit duplication of content between README, vignette, and paper. Keep the content separate.
- For instance, I see you have _build.sh, which builds the paper from the vignettes/why-moderndive.Rmd. There's no need for duplication. If most of the information is available in the vignette, link back to the vignette from the paper. I'd keep the paper as minimal as possible and refer to the vignette mostly.
- There are small differences that should be found between the vignette and the paper (e.g. Statement of Need and Contributions sections in paper but not vignette).
- You don't really need code output in the paper since it is already in the vignette.
- By splitting out the vignette text from the paper text, then you can delete the vignette/README.md and the vignette/_build.sh files.
I wouldn't recommend saving the JOSE pdf in your repo, since it is already available on GitHub anyway.

A big plus is with emphasizing CI and reducing the p-value interpretation! Very nice 😁 (I personally would remove the p-value entirely, but that's a different debate).

Fix get_regression_points() to reflect update of broom to v0.7.0

New warning message is output, most likely due broom updates. This causing this testthat line to fail.

library(tidyverse)
library(moderndive)

# Clean data
mtcars <- mtcars %>%
  rownames_to_column(var = "automobile") %>%
  mutate(cyl = as.factor(cyl))

# Fit model
mpg_mlr_model2 <- lm(mpg ~ hp + cyl, data = mtcars)

# newdata with no outcome variable, hence no residuals should be returned
newcars <- mtcars %>% 
  slice(1:3) %>%
  select(-mpg)

# New warning message
get_regression_points(mpg_mlr_model2, newdata = newcars)
#> Warning: 'newdata' had 3 rows but variables found have 234 rows
#> # A tibble: 3 x 4
#>      ID    hp cyl   mpg_hat
#>   <int> <dbl> <fct>   <dbl>
#> 1     1   110 6        20.0
#> 2     2   110 6        20.0
#> 3     3    93 4        26.4

^{Created on 2020-07-13 by the reprex package (v0.3.0)}

Convert {testthat} tests over to Edition 3

https://rstudio.com/resources/webinars/testthat-3/. {testthat} is scheduled to go to CRAN in October or November 2020.

Clean up documentation of three get_regression_X() functions

Separate model fitting step from get_regresion_x() functions

library(moderndive)
library(dplyr)
mtcars <- mtcars %>% 
  mutate(cyl = as.factor(cyl))

# instead of doing this
get_regression_table(mpg ~ hp, data = mtcars, digits = 3, print = FALSE)
get_regression_points(mpg ~ hp + cyl, data = mtcars)
get_regression_summaries(mpg ~ hp, data = mtcars)

# do this
mpg_model <- lm(mpg ~ hp, data = mtcars)
get_regression_table(mpg_model, digits = 3, print = FALSE)
get_regression_points(mpg_model)
get_regression_summaries(mpg_model)

Function for 3D scatterplots with regression planes

While not a fan of creating new wrapper/helper functions, but rather using raw dplyr and ggplot2 code, in cases like the get_regression_X() & get_correlation() functions the bloat in wrapper functions worth the simplicity gained. Another case where wrapper function bloat might be warranted is one for plotly code to create interactive 3D scatterplots with regression planes like the one previewed below and whose interactive version can be seen here.

Leave variable names intact in output table + need for ID variable

library(gapminder)
library(tidyverse)
library(moderndive)
gapminder2007 <- gapminder %>% filter(year == 2007) %>% select(country, continent, 
  lifeExp, gdpPercap)

# country ID's the observational units
gapminder2007
#> # A tibble: 142 x 4
#>        country continent lifeExp  gdpPercap
#>         <fctr>    <fctr>   <dbl>      <dbl>
#>  1 Afghanistan      Asia  43.828   974.5803
#>  2     Albania    Europe  76.423  5937.0295
#>  3     Algeria    Africa  72.301  6223.3675
#>  4      Angola    Africa  42.731  4797.2313
#>  5   Argentina  Americas  75.320 12779.3796
#>  6   Australia   Oceania  81.235 34435.3674
#>  7     Austria    Europe  79.829 36126.4927
#>  8     Bahrain      Asia  75.635 29796.0483
#>  9  Bangladesh      Asia  64.062  1391.2538
#> 10     Belgium    Europe  79.441 33692.6051
#> # ... with 132 more rows

get_regression_points(lifeExp ~ continent, data = gapminder2007)
#> # A tibble: 142 x 4
#>    lifeexp continent lifeexp_hat residual
#>      <dbl>    <fctr>       <dbl>    <dbl>
#>  1  43.828      Asia      70.728  -26.900
#>  2  76.423    Europe      77.649   -1.226
#>  3  72.301    Africa      54.806   17.495
#>  4  42.731    Africa      54.806  -12.075
#>  5  75.320  Americas      73.608    1.712
#>  6  81.235   Oceania      80.719    0.515
#>  7  79.829    Europe      77.649    2.180
#>  8  75.635      Asia      70.728    4.907
#>  9  64.062      Asia      70.728   -6.666
#> 10  79.441    Europe      77.649    1.792
#> # ... with 132 more rows

# 1. Variable names should be left intact: lifeExp and lifeExp_hat
# 2. ID variable is lost. Need something like:
# get_regression_points(lifeExp ~ continent, data = gapminder2007, id_var = country)

Change conf_low and conf_high

conf_low and conf_high seem like strange names from {broom}. Renaming the columns to be lower_ci and upper_ci seems more consistent with the literature.

get_regression_table not showing all terms of an interaction model

The model of one categorical and one quantitiative variable has the interaction when you look at the summary(model name) command but not when you look at get_regression_table(model name) I can provide the data and code if desired.

Allow for get_correlation() to use group_by() metadata

Update get_correlation() to allow for grouped correlations. Consider the following reprex of computing correlation between median income of census tract and number of Dunkin Donuts and Starbucks (currently in help file for DD_vs_SB):

library(moderndive)
library(dplyr)

# This works:
DD_vs_SB %>% 
  mutate(shops_per_1000 = 1000 * shops/population) %>% 
  filter(!is.na(median_income)) %>% 
  group_by(shop_type) %>% 
  summarize(cor = cor(median_income, shops_per_1000))
#> # A tibble: 2 x 2
#>   shop_type         cor
#>   <chr>           <dbl>
#> 1 dunkin_donuts -0.0575
#> 2 starbucks      0.0980

# This doesn't: 
DD_vs_SB %>% 
  mutate(shops_per_1000 = 1000 * shops/population) %>% 
  filter(!is.na(median_income)) %>% 
  group_by(shop_type) %>% 
  get_correlation(shops_per_1000 ~ median_income)
#> # A tibble: 1 x 1
#>   correlation
#>         <dbl>
#> 1   0.0000712

Created on 2018-12-29 by the reprex package (v0.2.0.9000).

get_regression_points not working with variable names with spaces

The get_regression_table works fine, but the get_regression_points of the model throws an error about not able to subset the data and it appears to be because the variables are things like Calories per serving. I can get code and data if desired.

get_regression_points() errors when variable is defined on-the-fly with lm()

Thanks to @jamesonwatts for bringing this forward.

get_regression_points() produces an error in this situation.

library(moderndive)
iris_model <- lm(formula = Sepal.Width ~ Sepal.Length + (Petal.Length > 1.5),
                 data = iris)
get_regression_points(iris_model)
#> Error: Unknown column `Petal.Length`

Digging into the {moderndive} code for the function it looks like broom::augment() is doing something unexpected with the column name converting all "non-standard" characters to be periods instead.

iris_model %>%
   augment() %>%
   names()
#> [1] "Sepal.Width" "Sepal.Length" "Petal.Length...1.5"
#> [4] ".fitted" ".se.fit" ".resid"
#> [7] ".hat" ".sigma" ".cooksd"
#> [10] ".std.resid"

It might be best to switch the explanatory_variable definition here to be explanatory_variable <- labels(terms(model)) instead as well since it doesn't catch Petal.Length > 1.5 here but only grabs Petal.Length in its current implementation.

Switch to using {glue} package instead of {assertive}

The {dplyr} package has some suggestions on how to do this. Should reduce package dependencies in volume quite a bit.

travis file needs updating

Based on this build. Screenshot:

rhub checks returning error

when running devtools::check_rhub(): there is no package called 'utf8'.

As suggested here, the work around is to check via:

devtools::check_rhub(env_vars=c(R_COMPILE_AND_INSTALL_PACKAGES = "always"))

import infer::rep_sample_n() once updated on CRAN

Currently we import source code lifted from oilabs::rep_sample_n(); replace with import of infer::rep_sample_n() once next update is released on CRAN.

Add sampling distribution functionality

library(statsr)
library(tidyverse)
N <- 2400
tub <- data_frame(
  ball_ID = 1:N,
  color = c(rep("red", 900), rep("white", N-900))
)

p_hats <- tub %>%
  statsr::rep_sample_n(size=50, reps=10) %>% 
  group_by(replicate) %>% 
  summarize(prop_red = mean(color == "red"))
  
ggplot(p_hats, aes(x = prop_red)) + 
  geom_histogram(binwidth = 0.05)

Add Starbucks vs DD dataset

Getting residuals for predictions when we have outcome variable

Update get_regression_points() so that when it detects outcome variable is present in newdata, data frame output format identical to when newdata not specified: i.e. has residual column for example:

suppressPackageStartupMessages(library(tidyverse))
#> Warning: package 'tidyr' was built under R version 3.4.4
#> Warning: package 'purrr' was built under R version 3.4.4
#> Warning: package 'dplyr' was built under R version 3.4.4
library(moderndive)
house_prices <- house_prices %>%
  mutate(
    log10_price = log10(price),
    log10_sqft_living = log10(sqft_living)
  )
#> Warning: package 'bindrcpp' was built under R version 3.4.4

# New data
new_houses <- data_frame(
  log10_sqft_living = c(2.9, 3.6),
  condition = factor(c(3, 4))
)

# Train/test split
train <- house_prices %>%
  slice(1:10000)
test <- house_prices %>%
  slice(10001:21613)

# Fit models to training
model_price_3 <- lm(log10_price ~ log10_sqft_living + condition,
                    data = train)

# Three different outputs. Make second look like first since it has outcome
# variable, but third doesn't
get_regression_points(model_price_3)
#> # A tibble: 10,000 x 6
#>       ID log10_price log10_sqft_living condition log10_price_hat residual
#>    <int>       <dbl>             <dbl> <fct>               <dbl>    <dbl>
#>  1     1        5.35              3.07 3                    5.48  -0.136 
#>  2     2        5.73              3.41 3                    5.76  -0.0330
#>  3     3        5.26              2.89 3                    5.33  -0.0720
#>  4     4        5.78              3.29 5                    5.73   0.0470
#>  5     5        5.71              3.22 3                    5.61   0.0970
#>  6     6        6.09              3.73 3                    6.04   0.0530
#>  7     7        5.41              3.23 3                    5.62  -0.207 
#>  8     8        5.46              3.02 3                    5.44   0.0220
#>  9     9        5.36              3.25 3                    5.63  -0.270 
#> 10    10        5.51              3.28 3                    5.65  -0.144 
#> # ... with 9,990 more rows
get_regression_points(model_price_3, newdata = test)
#> # A tibble: 11,613 x 3
#>    log10_sqft_living condition log10_price_hat
#>                <dbl> <fct>               <dbl>
#>  1              3.28 3                    5.66
#>  2              3.50 5                    5.91
#>  3              3.30 4                    5.69
#>  4              3.51 4                    5.86
#>  5              3.37 3                    5.73
#>  6              3.28 3                    5.66
#>  7              2.92 4                    5.38
#>  8              3.42 3                    5.77
#>  9              3.09 3                    5.49
#> 10              3.02 4                    5.46
#> # ... with 11,603 more rows
get_regression_points(model_price_3, newdata = new_houses)
#> # A tibble: 2 x 3
#>   log10_sqft_living condition log10_price_hat
#>               <dbl> <fct>               <dbl>
#> 1              2.90 3                    5.34
#> 2              3.60 4                    5.94

Created on 2018-06-03 by the reprex package (v0.2.0).

Add message about gg_parallel_slopes() deprecation

Since we can now plot parallel slopes models using geom_parallel_slopes(), we should phase out. Add a message when running gg_parallel_slopes() that this function will soon be deprecated.

Add pennies_sample dataset

Passing newdata argument to broom::augment() in moderndive::get_regression_points()

Not an urgent priority. See line 122-142 of R/regression_functions.R. Is it worth considering if there is a cleaner way to do this?

Add newdata argument to get_regression_points()

Since broom::augment(lm_object, newdata = df) returns fitted/predicted y's for a new data frame df, so also should moderndive::get_regression_points() allow for a newdata = df argument.

Add RMSE to get_regression_summary()

suppressPackageStartupMessages(library(dplyr))
library(moderndive)
load(url("http://www.openintro.org/stat/data/evals.RData"))

# Fit model
score_age_model <- lm(score ~ age, data = evals)

# Wrappers for broom tidy, augment, and glance:
# get_regression_table(score_age_model)
# get_regression_points(score_age_model)
get_regression_summaries(score_age_model)
#> # A tibble: 1 x 6
#>   r_squared adj_r_squared sigma statistic p_value    df
#>       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>
#> 1    0.0110       0.00900 0.541      5.34  0.0210  2.00

Add RMSE here.

Created on 2018-03-02 by the reprex package (v0.2.0).

Suggestion: Include CI for get_correlation function

@oscci: For some reason, perhaps historical, it is unusual for people to include 95% CI with correlations. Instead they just discuss whether 'significant' or not.
If the get_correlation function automatically gave you the 95% CI, it would help educate people to stop interpreting correlation coefficients as precise estimates, particularly with small N.

@rudeboybert: It is very true people often misinterpret sample correlation with population correlation. We'll keep your suggestion in mind when we get around to improving get_correlation(), as there are a host of other improvements needed #29 #32 #38

Repo needs a CONTRIBUTING guidelines file

Part of JOSE review openjournals/jose-reviews#115

There are no guidelines for now to contribute. use_tidy_contributing() gives a good template to start from.

Update README based on contents of "Why moderndive?" vignette

Update get_regression_points() to accept all broom::augment() arguments

After we update get_regression_points() to accept all broom::augment() arguments, we can easy make predictions on a new dataset. Or perhaps define a new 4th moderndive::get_regression_predictions() function

suppressPackageStartupMessages(library(tidyverse))
library(moderndive)
house_prices <- house_prices %>%
  mutate(
    log10_price = log10(price),
    log10_sqft_living = log10(sqft_living)
  )

# Create training and test sets
train_set <- house_prices %>%
  sample_frac(0.7)
test_set <- house_prices %>%
  anti_join(train_set, by = "id")

# Fit on training data
price_model <- lm(log10_price ~ log10_sqft_living + condition, data = train_set)

# Predict test data using broom
predicted_points <- price_model %>%
  broom::augment(newdata = test_set) %>%
  as_tibble() %>%
  select(log10_price, .fitted)

# Should be able to accept all arguments to broom::augment()
get_regression_points(price_model, newdata = test_set)
#> Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 6400, 15129

Created on 2018-04-20 by the reprex package (v0.2.0).

Add house sales data

Add this data. Note CC0: Public Domain data license

Missing values in get_correlation

Right now, if there is missing data in either the RHS or LHS of the formula in get_correlation(), R returns NA.

library(tidyverse)
library(moderndive)

df <- data_frame(thing1 = c(1:9, NA),
                 thing2 = rnorm(10))

df %>% 
  get_correlation(thing2 ~ thing1)
#> # A tibble: 1 x 1
#>   correlation
#>         <dbl>
#> 1          NA

This is because within get_correlation(), cor() is called without additional arguments like use = "complete.obs"

A few of my students have been confused by this when they use the function on their own data (fortunately all the data in ModernDive is NA-free). I've made them filter out missing values first:

df %>% 
  filter(!is.na(thing1), !is.na(thing2)) %>% 
  get_correlation(thing2 ~ thing1)
#> # A tibble: 1 x 1
#>   correlation
#>         <dbl>
#> 1       0.156

It would be fairly easy to include a ... argument to get_correlation() that would feed extra arguments to the main cor() function inside. That way it'd be possible to do something like df %>% get_correlation(y ~ x, use = "complete.obs")

However, this adds additional complexity to what was probably designed to be a simple wrapper function. I'm not sure what the best solution is for this though. I see at least four possible options, each with pros and cons:

Add a dots argument to get_correlation()
- Pros: gives get_correlation() the full power of cor()
- Cons: too much complexity
Add a new argument to get_correlation() called na.rm or something. If it's set to true, run cor() with use = "complete.obs"; if it's false, run cor() like normal
- Pros: is simpler
- Cons: uses nonstandard syntax
Make students filter out missing data first with filter(!is.na(variable))
- Pros: students should already be familiar with the syntax from earlier MD chapters + it makes the idea of removing missing values explicity
- Cons: if #29 is resolved and multiple RHS variables are allowed, thus returning a correlation matrix, students will need to filter each of the variables individually first (i.e. filter(!is.na(var1), !is.na(var2), !is.na(varn))
Make students not use get_correlation() when there's missing data and instead use standard dplyr syntax like df %>% summarize(correlation = cor(y, x, use = "complete.obs"))
- Pros: fits within the dplyr paradigm and shows how to calculate correlation without the MD package
- Cons: doesn't use the cool wrapper function anymore

Allow for more than one RHS variable in get_correlation()

suppressPackageStartupMessages(library(tidyverse))
library(ISLR)
library(moderndive)
library(corrr)

# One RHS var:
Credit %>% 
  get_correlation(Balance ~ Limit)
#> # A tibble: 1 x 1
#>   correlation
#>         <dbl>
#> 1       0.862

# Doesn't allow for two+:
Credit %>% 
  get_correlation(Balance ~ Limit + Credit)
#> Error in check_formula_args(data, formula, outcome_variable, explanatory_variable): The left hand side of the `formula` should only have one variable name

# Perhaps something like this:
Credit %>% 
  select(Balance, Limit, Income) %>% 
  corrr::correlate()
#> # A tibble: 3 x 4
#>   rowname Balance  Limit Income
#>   <chr>     <dbl>  <dbl>  <dbl>
#> 1 Balance  NA      0.862  0.464
#> 2 Limit     0.862 NA      0.792
#> 3 Income    0.464  0.792 NA

Created on 2018-07-13 by the reprex package (v0.2.0.9000).

Non-file package-anchored link(s) in documentation object

The following error message appears when travis-ci builds using the devel version of R: Non-file package-anchored link(s) in documentation object. Solution is on StackOverflow

Fix digits rounding

library(moderndive)
library(tidyverse)
house_prices <- house_prices %>%
  mutate(
    log10_price = log10(price),
    log10_sqft_living = log10(sqft_living)
    )
model_price_1 <- lm(log10_price ~ log10_sqft_living + yr_built,
                    data = house_prices)
get_regression_table(model_price_1, digits = 7)

geom_parallel_slopes but for one categorical explanatory variable models?

One visualization tool I thought could enhance the package/book would be a function to visualize the fitted values of a regression model with one categorical explanatory variable. Even though the visualization would just be horizontal line segments, I think providing this visualization would reinforce the similarities between regression using a numeric explanatory variable and a categorical explanatory variable.

I think the final visualization would look something like this (but without the lines extending across the entire x axis).

The function I'm proposing would just add the horizontal lines to the existing ggplot, the same way geom_smooth and geom_parallel_slopes do.

I originally thought of this as an extension of functionally for geom_parallel_slopes but I realized it might be confusing to have the parallel slopes idea come in before the multiple regression situation. Perhaps this type of categorical "smoothing" should be it's own function, like geom_categorical_model?

If this would be considered a useful enhancement, I would be happy to contribute it.

Have `View()` function that provides a more informative error

I've talked to a couple of professors that are still struggling with students adding in View() to R Markdown documents and getting errors that are hard for the students to decipher. One way to "fix" this would be to include a wrapper function in this package that provides a more informative error if the environment being used was not interactive (like R Markdown). Otherwise, it behaves just like View() does usually.

Thoughts welcome.

(Infrastructural, optional) Consider using markdown to write Roxygen docs

Part of JOSE review openjournals/jose-reviews#115

This is optional and largely infrastructure/developer focused. Right now docs are written using the Rd format of writing. However, if you switch to using Markdown for writing the docs, it makes it easier for e.g. pkgdown to link functions to other help sites. For instance, on the website, the get_regression_points() reference page has a "See also" section, with augment listed but no link to what augment is. If written in Markdown (starting with use_roxygen_md()), you get pkgdown to link to augment's help doc with [broom::augment()].

Update broken url

In evals dataset, link to original data source is broken due to reformatting of openintro.org webpage

Allow user specified ID variable for get_regression_points()

Assuming you've downloaded the data for House Prices: Advanced Regression Techniques. A 9 line "baby's first kaggle competition" code snippet would be

library(tidyverse)
library(moderndive)
train <- read_csv("train.csv")
test <- read_csv("test.csv")
house_model <- lm(SalePrice ~ YrSold, data = train)
submission <- get_regression_points(house_model, newdata = test, ID = "ID") %>% 
  select(ID, SalePrice_hat) %>% 
  rename(SalePrice = SalePrice_hat)
write_csv(submission, "submission.csv")

However, if you look at the sample_submission.csv file on Kaggle, the ID variable must be Id and valued 1461:2919, not the ID valued 1:1459 our output has.

Strongly encourage using "fitted" or "predicted" compared to "*_hat"

Part of the JOSE review openjournals/jose-reviews#115

This was partly brought up by @lisamr in her review. The language for y_hat is very much linked to the math of statistics. However, considering this is a package that could have (potential) broader usage outside of the ModernDive book, most people new to doing statistics will have no idea what that means. There's a reason broom::augment() uses .fitted instead of y_hat or something similar. I'd suggest renaming the column from get_regression_points() to be *_fitted or *_predicted inline with general better communication to non-statisticians, to make the package more learner friendly, and to increase adoption/use of this package. As personal experience, I never understood what y hat was until I learned about it being the "predicted" value based on the regression line. y hat was too related to the equation and I couldn't relate it to the real world.

Plotting parallel slopes model

When considering multiple regression with one numerical and one categorical predictor, the ggplot code to plot the interaction model is easy while the code to plot the "parallel slopes" model is too hard. Consider adding a geom_parallel_slopes() ggplot2 geometry extension to moderndive package that behaves like geom_smooth(method = "lm", ...)

Example: Seattle house prices

suppressPackageStartupMessages(library(tidyverse))
library(knitr)
library(moderndive)

# log10() transformations
house_prices <- house_prices %>%
  mutate(
    log10_price = log10(price),
    log10_size = log10(sqft_living)
  )

Default plot is based on interaction model

Default plot when using col = condition is interaction model: different slopes and intercepts

ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", se = FALSE, size = 1) +
  labs(x = "log10 square feet living space", y = "log10 price in USD", title = "House prices in Seattle: Interaction model")

Corresponding slopes and intercepts

lm(log10_price ~ log10_size * condition, data = house_prices) %>% 
  get_regression_table()
#> # A tibble: 10 x 7
#>    term             estimate std_error statistic p_value lower_ci upper_ci
#>    <chr>               <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
#>  1 intercept           3.33      0.451     7.38    0        2.45     4.22 
#>  2 log10_size          0.69      0.148     4.65    0        0.399    0.98 
#>  3 condition2          0.047     0.498     0.094   0.925   -0.93     1.02 
#>  4 condition3         -0.367     0.452    -0.812   0.417   -1.25     0.519
#>  5 condition4         -0.398     0.453    -0.879   0.38    -1.29     0.49 
#>  6 condition5         -0.883     0.457    -1.93    0.053   -1.78     0.013
#>  7 log10_size:cond…   -0.024     0.163    -0.148   0.882   -0.344    0.295
#>  8 log10_size:cond…    0.133     0.148     0.893   0.372   -0.158    0.424
#>  9 log10_size:cond…    0.146     0.149     0.979   0.328   -0.146    0.437
#> 10 log10_size:cond…    0.31      0.15      2.07    0.039    0.016    0.604

Parallel slopes is too hard

Compute all regression points first

model_price_3_points <-
  house_prices %>%
  lm(log10_price ~ log10_size + condition, data = .) %>%
  get_regression_points()

Display a random sample of these regression points

model_price_3_points %>% 
  group_by(condition) %>% 
  sample_n(2)
#> # A tibble: 10 x 6
#> # Groups:   condition [5]
#>       ID log10_price log10_size condition log10_price_hat residual
#>    <int>       <dbl>      <dbl> <fct>               <dbl>    <dbl>
#>  1  7377        5.47       2.86 1                    5.27    0.196
#>  2 15372        5.82       3.00 1                    5.40    0.421
#>  3  9277        5.52       3.00 2                    5.35    0.166
#>  4  5859        5.34       2.96 2                    5.32    0.022
#>  5 19748        5.80       3.38 3                    5.74    0.057
#>  6 13959        5.74       3.28 3                    5.66    0.077
#>  7 17844        5.48       3.02 4                    5.45    0.024
#>  8  5380        5.89       3.36 4                    5.74    0.148
#>  9 18642        5.63       3.54 5                    5.94   -0.305
#> 10  7422        5.69       3.01 5                    5.50    0.189

Very cumbersome code to plot

ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) +
  geom_point(alpha = 0.1) +
  labs(y = "log10 price", x = "log10 square footage", title = "House prices in Seattle: Parallel slopes model") +
  geom_line(data = model_price_3_points, aes(y = log10_price_hat), size = 1.5, show.legend = FALSE) +
  guides(colour = guide_legend(override.aes = list(alpha = 1)))

Corresponding slope and intercepts

lm(log10_price ~ log10_size + condition, data = house_prices) %>% 
  get_regression_table()
#> # A tibble: 6 x 7
#>   term       estimate std_error statistic p_value lower_ci upper_ci
#>   <chr>         <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
#> 1 intercept     2.88      0.036     80.0    0        2.81     2.95 
#> 2 log10_size    0.837     0.006    134.     0        0.825    0.85 
#> 3 condition2   -0.039     0.033     -1.16   0.246   -0.104    0.027
#> 4 condition3    0.032     0.031      1.04   0.3     -0.028    0.092
#> 5 condition4    0.044     0.031      1.42   0.155   -0.017    0.104
#> 6 condition5    0.096     0.031      3.09   0.002    0.035    0.156

Created on 2018-07-08 by the reprex package (v0.2.0.9000).

moderndive / moderndive Goto Github PK

moderndive's Introduction

moderndive R Package

Overview

Installation

Basic usage

Statement of Need

Contributor code of conduct

Six features

Data background

1. Focus less on p-value stars, more confidence intervals

2. Outputs as tibbles

3. Produce residual analysis plots from scratch using ggplot2

4. A quick-and-easy Kaggle predictive modeling competition submission!

5. Visual model selection: plot parallel slopes & interaction regression models

6. Produce metrics on the quality of regression model fits

The inner workings

Three wrappers to broom functions

Custom geometries

moderndive's People

Contributors

Stargazers

Watchers

Forkers

moderndive's Issues

Example: Seattle house prices

Default plot is based on interaction model

Parallel slopes is too hard

Recommend Projects

Recommend Topics

Recommend Org

3. Produce residual analysis plots from scratch using `ggplot2`

Three wrappers to `broom` functions