Code Monkey home page Code Monkey logo

infer's Introduction

infer R Package A hexagonal logo. A silhouette of a fir tree sits atop green text, reading 'infer'. The logo has a white background and green border.

R-CMD-check CRAN_Status_Badge Coverage Status

The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework. The package is centered around 4 main verbs, supplemented with many utilities to visualize and extract value from their outputs.

  • specify() allows you to specify the variable, or relationship between variables, that you’re interested in.
  • hypothesize() allows you to declare the null hypothesis.
  • generate() allows you to generate data reflecting the null hypothesis.
  • calculate() allows you to calculate a distribution of statistics from the generated data to form the null distribution.

To learn more about the principles underlying the package design, see vignette("infer").

A diagram showing four steps to carry out randomization-based inference: specify hypothesis, generate data, calculate statistic, and visualize. From left to right, each step is connected by an arrow, while the diagram indicates that generating data and calculating statistics can happen iteratively.

If you’re interested in learning more about randomization-based statistical inference generally, including applied examples of this package, we recommend checking out Statistical Inference Via Data Science: A ModernDive Into R and the Tidyverse and Introduction to Modern Statistics.

Installation


To install the current stable version of infer from CRAN:

install.packages("infer")

To install the developmental stable version of infer, make sure to install remotes first. The pkgdown website for this version is at infer.tidymodels.org.

# install.packages("pak")
pak::pak("tidymodels/infer")

Contributing


We welcome others helping us make this package as user-friendly and efficient as possible. Please review our contributing and conduct guidelines. By participating in this project you agree to abide by its terms.

For questions and discussions about tidymodels packages, modeling, and machine learning, please post on Posit Community. If you think you have encountered a bug, please submit an issue. Either way, learn how to create and share a reprex (a minimal, reproducible example), to clearly communicate about your code. Check out further details on contributing guidelines for tidymodels packages and how to get help.

Examples


These examples are pulled from the “Full infer Pipeline Examples” vignette, accessible by calling vignette("observed_stat_examples"). They make use of the gss dataset supplied by the package, providing a sample of data from the General Social Survey. The data looks like this:

# load in the dataset
data(gss)

# take a glimpse at it
str(gss)
## tibble [500 × 11] (S3: tbl_df/tbl/data.frame)
##  $ year   : num [1:500] 2014 1994 1998 1996 1994 ...
##  $ age    : num [1:500] 36 34 24 42 31 32 48 36 30 33 ...
##  $ sex    : Factor w/ 2 levels "male","female": 1 2 1 1 1 2 2 2 2 2 ...
##  $ college: Factor w/ 2 levels "no degree","degree": 2 1 2 1 2 1 1 2 2 1 ...
##  $ partyid: Factor w/ 5 levels "dem","ind","rep",..: 2 3 2 2 3 3 1 2 3 1 ...
##  $ hompop : num [1:500] 3 4 1 4 2 4 2 1 5 2 ...
##  $ hours  : num [1:500] 50 31 40 40 40 53 32 20 40 40 ...
##  $ income : Ord.factor w/ 12 levels "lt $1000"<"$1000 to 2999"<..: 12 11 12 12 12 12 12 12 12 10 ...
##  $ class  : Factor w/ 6 levels "lower class",..: 3 2 2 2 3 3 2 3 3 2 ...
##  $ finrela: Factor w/ 6 levels "far below average",..: 2 2 2 4 4 3 2 4 3 1 ...
##  $ weight : num [1:500] 0.896 1.083 0.55 1.086 1.083 ...

As an example, we’ll run an analysis of variance on age and partyid, testing whether the age of a respondent is independent of their political party affiliation.

Calculating the observed statistic,

F_hat <- gss %>% 
  specify(age ~ partyid) %>%
  calculate(stat = "F")

Then, generating the null distribution,

null_dist <- gss %>%
   specify(age ~ partyid) %>%
   hypothesize(null = "independence") %>%
   generate(reps = 1000, type = "permute") %>%
   calculate(stat = "F")

Visualizing the observed statistic alongside the null distribution,

visualize(null_dist) +
  shade_p_value(obs_stat = F_hat, direction = "greater")
A histogram showing a distribution of F statistics, right-tailed and centered around one. The x axis ranges from zero to five. The region of the histogram to the right of the observed statistic, just above two, is shaded red to represent the p-value.

Calculating the p-value from the null distribution and observed statistic,

null_dist %>%
  get_p_value(obs_stat = F_hat, direction = "greater")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1    0.06

Note that the formula and non-formula interfaces (i.e. age ~ partyid vs. response = age, explanatory = partyid) work for all implemented inference procedures in infer. Use whatever is more natural for you. If you will be doing modeling using functions like lm() and glm(), though, we recommend you begin to use the formula y ~ x notation as soon as possible.

Other resources are available in the package vignettes! See vignette("observed_stat_examples") for more examples like the one above, and vignette("infer") for discussion of the underlying principles of the package design.

infer's People

Contributors

alexpghayes avatar andrewpbray avatar beanumber avatar brendanhcullen avatar carlganz avatar corinne-riddell avatar echasnovski avatar emilhvitfeldt avatar hfrick avatar ismayc avatar jimrothstein avatar juliasilge avatar mine-cetinkaya-rundel avatar nfultz avatar nicksolomon avatar pirategrunt avatar richierocks avatar rudeboybert avatar simonpcouch avatar topepo avatar torockel avatar wmorgan485 avatar xiaochi-liu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

infer's Issues

assertion to add

More informative error message needed out of specify():

load(url("http://s3.amazonaws.com/assets.datacamp.com/production/course_5694/datasets/gss_sampled.RData"))
gss2016 <- filter(gss_sampled, year == 2016)
ggplot(gss2016, aes(x = cappun)) +
  geom_bar()
gss2016 %>%
  specify(response = cappun, success = "FAVOR")

Adding missing grouping variables: year

Have any summary function work in calculate

I really like using The Lady Tasting Tea to motivate hypothesis testing. The following code does the trick:

library(tidyverse)
library(infer)
lady_tasting_tea <- data_frame(
  first = as.factor(c(rep("milk",4), rep("tea",4)))
  )

lady_tasting_tea %>%
  specify(response = first) %>% # alt: am ~ NULL (or am ~ 1)
  hypothesize(null = "point", p = c("milk" = .5, "tea" = .5)) %>% 
  generate(reps = 1000, type = "simulate") %>% 
  calculate(stat = "prop")

However, it would be great if the final stat could be "sum". That way there is one less layer of abstraction between the experiment and the null distribution (students can read off count directly, instead of reading off proportions)

In a more general setting, any many-to-one summary function would be great, for example all those that work with dplyr::summarize()

Hypothesis testing diagram

The following diagram was inspired by Alan Downey's "There is only one test" and can be found in keynote form in the figs directory.

ht-diagram

Some possible desiderata of this diagram:

  • sufficiently general to describe most/all classical hypothesis tests
  • specific enough to deconstruct the inferential black box into the key conceptual steps
  • the pipeline should be able to be broken at any step and the output should make sense
  • adaptive to both computational (permutation/simulation) and mathematical approximation approaches.

Any suggestions at this most general level?

query about cool figure

screen shot 2017-07-14 at 8 11 49 am

I had a question about the cool figure (above). Shouldn't the approximation arrow point to "calculate statistic" since that's where the approximation picks up?

Just my $0.02, Nick

bug for one prop tests

It appears that if you're working w a two-level categorical variable, it's totally fine to assign just one null value of a proportion.

mtcars %>%
  specify(response = am) %>% # alt: am ~ NULL (or am ~ 1)
  hypothesize(null = "point", p = c("1" = .25)) %>% 
  generate(reps = 100, type = "simulate") %>% 
  calculate(stat = "prop")

We're currently throwing an error message, however, when you try to do a one-prop test on a variable that has many levels. If you convert the GoF example:

> mtcars %>%
+   specify(cyl ~ NULL) %>% # alt: response = cyl
+   hypothesize(null = "point", p = c("4" = .5)) %>%
+   generate(reps = 100, type = "simulate") %>%
+   calculate(stat = "prop", success = "4")
Error in parse_params(dots, x) : 
  The factor variable that you have specified has 3 levels and you've only assigned null probabilities to 1 levels.

Do NAs get resampled?

The calculate function takes the na.rm argument for the statistic it's calculating. But does this mean at the generate stage, say if we're doing a bootstrap interval, NAs from the original sample get resampled? If so, I believe this is not a good idea. Because chances are when we look at the sample size for the original sample those NAs don't factor into it. And a bootstrap sample should be the same size as the original sample. If indeed the NAs from the original sample don't make it into the bootstrap samples, then isn't the na.rm argument in calculate unnecessary?

Some feedback

  • ran goodpractice::gp() only thing of note for later is may want to stick to 80 line width across all code, doc lines, and test files
  • i believe Title in description file needs to be sentence case
  • license: CC0 is valid license now on CRAN - but perhaps consider a more software specific license like MIT or GPL for example
  • http://github.com/andrewpbray/infer to https
  • add a code of conduct?
  • did you check build_win() yet?
  • some more parameter input checking would be useful, e.g., generate(reps = "foobar") doesn't error well
  • lots of warnings in the mtcars_examples vignette, maybe suppress those
  • vignettes do make sense to me, though I don't do stats much anymore, sooooooo
  • The exported functions (particularly calculate) lean towards being kinda big for my taste, could be DRYed out a bit if possible
  • no examples in the specify man file?
  • as a newb, i wasn't really sure if there was a order that you use the functions or not? some guidance on that would be good. the functions error nicely if specify is not used first, but seems like docs should also say that. the pkg level man file would be a good place for that.
  • error behavior for going from data.frame to specify, to generate (skipping specify) wasn't too good, but maybe users won't do that

Hypothesize function

Proposed input: dataframe/tibble that has already been select()ed down to variables of interest.
Proposted output: tibble with a flag for the null hypothesis selected, similar to group_by()

I like the idea of having students specify a null argument here, probably in the form of a character string. Deciding how to describe each of the null's we want will be difficult. I took a stab at them in the examples in the README, but I have lots of questions:

  1. Should parameters show up in these?
  2. Should there be any reference to the columns in the data frame that comes in or should we be doing checks of types within the function?
  3. Will we rerun into problems when going back and forth between the computational and approximation versions of a particular test? That is, are there cases where the null hypotheses aren't exactly the same?
  4. Should we add an argument for alternative? My sense right now is no - that the goal of this is the visualize()ation of the sampling distribution. We could always write a p-value() function that has an alternate argument.

S3 classes?

Instead of setting an attribute, should hypothesis() return an object of class proptest or whatever? That way calculate() will already know what to do.

CI for slope

@andrewpbray

I'm doing CIs for the linear model case. Can that option also be implemented in infer?

Also, I don't know if you were asking for advice for the linear model hypothesize function, But I prefer "slope=0". That way, the ideas are easier to generalize to multivariate models. I'm also not opposed to having both options be possible as hypothesize arguments.

Calculate should not use `mean` as the var name in output

This is in the summarise step of calculate. Since mean is a function it's not good practice to also use it as variable name, same with median etc. Alternatives:

  1. Generically call all of them stat so the output looks the same regardless of what statistic is used, this means user has to remember which it was, which is ok I think since in the function call they specify it.
    • If we go with this we can add a note to the output saying "stat is mean" or "stat is diff in means" something like that, but at least the variable name stays the same which is nice for subsequent calculations or visualizations.
  2. Use names like "boot_mean", "boot_median", etc. Or some other prefix.

My vote is for Option 1.

How do you plan to support tests in the context of lm() and glm()?

As @nicholasjorton suggested, it will be important to be able to move easily beyond the traditional Intro Stats tests to this more flexible space -- else you are just a dead end.

Perhaps it would be better to start from that end than from the 1- and 2-sample end. Most of the 1- and 2-sample things can be thought of as special cases of this more general setting anyway.

Calculate currently doesn't return anything

Is this right? That's my experience with it.

What should it return, as in, what should the object be called? Options:

  • dist
  • sim_dist or perm_dist or boot_dist depending on generate type
  • boot_mean or perm_mean or perm_median etc. depending on both generate type and stat?
  • ???

assertive package is not auto-installed

When trying to run

library(infer)
library(tidyverse)
mtcars <- as.data.frame(mtcars) %>%
  mutate(cyl = factor(cyl),
          vs = factor(vs),
          am = factor(am),
          gear = factor(gear),
          carb = factor(carb))
mtcars %>%
  specify(am ~ vs) %>% # alt: response = am, explanatory = vs
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "diff in props")

I get Error in loadNamespace(name) : there is no package called ‘assertive’

Here is my sessionInfo

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] infer_0.1.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14     knitr_1.17       bindr_0.1        magrittr_1.5     munsell_0.4.3    colorspace_1.3-2
 [7] R6_2.2.2         rlang_0.1.6      plyr_1.8.4       stringr_1.2.0    dplyr_0.7.4      tools_3.4.1     
[13] grid_3.4.1       gtable_0.2.0     htmltools_0.3.6  lazyeval_0.2.1   yaml_2.1.14      rprojroot_1.2   
[19] digest_0.6.13    assertthat_0.2.0 tibble_1.3.4     bookdown_0.5     bindrcpp_0.2     ggplot2_2.2.1   
[25] glue_1.2.0       evaluate_0.10.1  rmarkdown_1.8    stringi_1.1.6    compiler_3.4.1   scales_0.5.0    
[31] backports_1.1.1  pkgconfig_2.0.1 

support for approximation methods

Did we close the door on this? It seems like

mtcars %>%
  specify(mpg ~ hp) %>%
  hypothesize(null = "independence") %>% 
  calculate(stat = "slope") %>%
  visualize()

would still make sense. It would return something like:

mod <- lm(mpg ~ hp, data = mtcars)
ggplot(data = filter(broom::tidy(mod), term == "hp"), aes(x = estimate)) + 
  stat_function(fun = dt, args = list(df = mod$df.residual)) + 
  scale_x_continuous(limits = c(-4, 4)) + 
  geom_vline(aes(xintercept = estimate), color = "red")

naming of functions

Instead of generate() with different types, would it be better to have simulate(), bootstrap() and permute() just be separate functions? They are "verbs" after all... Wouldn't this make it even clearer what is happening?

calculating statistic from *random samples* drawn from a population

@andrewpbray @mine-cetinkaya-rundel @beanumber @ismayc @nicksolomon

Another thing I do in my course is to create a sampling distribution of slopes taken from a huge population. I do this in chapter 1 before I've done any inference. The idea is just to visualize how the slopes change from sample to sample outside the context of making specific hypotheses.

Again, would appreciate your vote on which of these options seems best / most consistent with what we are doing.

First:

explanatory <- rnorm(1000, 0, 5)
response <- 40 + explanatory*2 + rnorm(1000,0,10)
popdata <- data.frame(explanatory,response)

(A)
Uses rep_sample_n to get many samples. Two plots here: one is superimposed lines on a scatterplot, one is a histogram.

manysamples <- popdata %>%
rep_sample_n(size=50, reps=100)

ggplot(manysamples, aes(x=explanatory, y=response, group=replicate)) + 
geom_point() + 
geom_smooth(method="lm", se=FALSE) 

manylms <- manysamples %>% 
group_by(replicate) %>% 
do(tidy(lm(response ~ explanatory, data=.))) %>%
filter(term=="explanatory")

ggplot(manylms, aes(x=estimate)) + geom_histogram()

(B)
This code doesn't exist yet... we could write an additional option to generate in infer... maybe use the argument sample?

manylms <- popdata %>%
  specify(response ~ explanatory) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "sample", size=50) %>%
  calculate(stat = "slope") 

# I don't know how to use the above to make the slope plot in ggplot (which I believe to be very powerful).  
# I'd need both the slope and the intercept to draw the lines... so maybe this is a moot point??

ggplot(manylms, aes(x=estimate)) + geom_histogram()

Error in sample.int(length(x), size, replace, prob) : NA in probability vector

It's late...what am I doing wrong?

outcomes <- data_frame(candidate = c("clinton", "trump"))
outcomes %>%
  specify(response = candidate) %>%
  hypothesize(null = "point", p = c(0.5, 0.5)) %>%
  generate(reps = 1000, type = "simulate") %>%
  calculate(stat = "prop") %>%
  visualize()

Error in sample.int(length(x), size, replace, prob) : 
  NA in probability vector

Student/teaching experience with generate type

I'm using infer in my in-class teaching this semester, and I can't yet definitively say how learning is going, but in general based on informal observation, seems ok.

However one type of question I've been getting repeatedly is "How do I know when to use which type for generating samples?". This is a good question of course, and the answer leads us to discussing differences. But what I'm observing is that students are trying to make if this then that type lists (like if I have categorical variable and point null then simulate). And what's worrisome about it is that it feels a bit like the cheatsheets students put together in courses that teach theoretical inference with type of tests and SE formulas. One of the main motivations for coming up with consistent syntax was to move away from such thinking.

Have others using this in in-class teaching experienced this? Any thoughts? @andrewpbray @beanumber @hardin47

Note: The one alternative I can think of is to have the package decide the type of generate based on the info that comes before it. This is similar to how it worked in my inference function. But it hides what I think is an important detail. Unfortunately my limited experience in showing that detail hasn't been super positive, but I'm willing to accept that's the way to go, just wanted to hear from others what they think after having students use these functions.

Diff in means/props/medians

Currently I believe the subtraction is being done in the alphabetical order of the levels if the categorical variable is character, and in the levels order if variable is factor. I think we need an order argument for calculate for "diff in ..."situations. This is how I had solved this problem in the inference function. Happy to do a PR but wanted to see (1) am I interpreting current status correctly and (2) is everyone on board with a new order argument that will be something like order = c("level1" - "level2") which translates to mean(y[x == "level1"]) - mean(y[x == "level2"]).

failed to import pipe?

> library(infer)
> mtcars %>% hypothesize(null = "independence")
Error in mtcars %>% hypothesize(null = "independence") : 
  could not find function "%>%"

Should we just move dplyr to Depends?

Handling of NAs

Currently, if the input data has an NA, no warnings are shown but the resulting simulated statistics vector can have some NAs since calculate doesn't specify to handle NAs when taking mean, median, etc. One option is to specify that in calculate. The other option is to make the user get rid of NAs first in the exploratory stage, and then feed the data into the infer steps. Former is a bit more user friendly, but hides an important consideration. Latter can be tricky for students when doing inference for two variables since they would need to remove pairs of NAs from both variables. Thoughts?

Why is Med capitalized?

Was there a specific purpose for this? If not, I suggest using lowercase, as needing to capitalize might trip up students. Happy to do a small PR for this but didn't want to touch it before checking to make sure the capitalization isn't serving some other purpose.

Success should always be defined

In working on a solution for #53 I realized that if success isn't defined by the user we make an assumption and use the first level as success. I don't think this is a good approach because it makes the function "just work" without the student properly formulating the inference chain. I propose removing https://github.com/andrewpbray/infer/blob/f034a8629dd42c83d4fbb33dd1b80324262dbb12/R/calculate.R#L60 and https://github.com/andrewpbray/infer/blob/f034a8629dd42c83d4fbb33dd1b80324262dbb12/R/calculate.R#L61 and instead throwing an error saying success must be defined when the response variable is categorical. I can put that in the same PR as the order solution.

Also, as far as I can tell currently the "diff in props" example isn't working since success functionality isn't built into the difference in proportions case (only the single proportion).

open or closed system?

I've gone at least quickly through all the issues now. It seems like you are trying to create a very closed system -- users will only be able to conduct hypothesis tests that you have imagined in advance they will want to do to and for which they know the magic character strings that name the test. This is only marginally different from having a separate function for each test.

A true grammar of hypothesis testing would be a more open system where a new hypothesis test could be conducted without needing to add to the package.

For starters, in calculate(), why isn't stat a function (or a string that names a function). That would allow users to use any test statistic by providing a function of the proper arity for the data. It would also allow users to directly apply the stat function to the data set in question to get the observed value of the test statistic.

Here are some test cases for your "grammar":

  • how do I test that two medians are the same using the difference in medians as a test statistic?
  • ditto for other quantiles
  • how do I test that 2 groups have the same variance using the difference in standard deviations as a test statistic?
  • how do I test that 3 groups have the same variance using the variance of the variances as a test statistic?

Generate function

Proposed input: a tibble with some meta-info from hypothesize()
Proposed output: a much taller tibble that is the input repeat()ed n times along with an indexing column.

This function is pretty much the same as Mine's rep_sample_n() and it encapsulates the idea that the null hypothesis should encode some mechanism by which you can generate additional data sets under that null hypothesis. Some thoughts/questions:

  • Right now, the only argument I'm thinking of is repeat() for the number of iterations. The meta-info that's appended with hypothesize() should be sufficient to know whether this is a permutation or simulation situation.
  • This function could also be called simulate().
  • If we want this to package to be scalable to larger data sets and still run fairly fast, we might want to write some backup functionality here. As in: if the data set is really big, hold off on spooling out 1000 copies of it in memory. We can probably be more efficient about that if we don't need to show this repeated data set as output here.

Where is the grammar?

Just looking over this briefly, I don't see the "grammar". Do you have a grammar in mind? What are its elements (nouns and verbs)? How are they defined? This seems to be motivated primarily by the desire to use %>% and not by a well-defined grammar. Am I missing something?

Perhaps a wiki page could be created to describe the grammar.

error messages

The following cases have arisen that lead to very unhelpful error messages.

  1. Problem: beach_mountain has more than two levels.
library(tidyverse)
# library(devtools)
# install_github("andrewpbray/oilabs")
library(oilabs)
library(infer)
data(survey141)

survey141 %>%
  specify(response = beach_mountain) %>%
  hypothesize(null = "point", p = c("Beach" = .5, "Mountains" = .5)) %>%
  generate(reps = 300, type = "simulate")
  1. Problem: didn't specify an explanatory variable in a permutation test.

screen shot 2017-10-13 at 9 38 57 am

generate() chokes when factor levels are integers?

mtcars %>%
   specify(response = am) %>%
   hypothesize(null = "point", p = c("0" = 0.25, "1" = 0.75)) %>%
   generate(reps = 100, type = "simulate")

 Error in sample.int(length(x), size, replace, prob) : 
  invalid first argument 

lazy eval issue?

* checking R code for possible problems ... NOTE
calculate: no visible binding for global variable ‘xbar’
calculate: no visible binding for global variable ‘xtilde’
calculate: no visible binding for global variable ‘prop’
visualize: no visible binding for global variable ‘stat’

This used to be fixed by using mutate_() with a formula, but now we are supposed to use rlang?? I think @ismayc knows how to fix this...

logic in hypothesize()

Note:

> mtcars %>% 
    select(cyl) %>%
    hypothesize(null = "point", p1 = .25, p2 = .25, p3 = .50)

Error in hypothesize(., null = "point", p1 = 0.25, p2 = 0.25, p3 = 0.5) : 
  The number of parameters exceeds the number of columns. 

The code is checking to see whether the number of parameters is greater than the number of columns in the matrix. But here we want to check whether it is the same as the number of factor levels in the one variable. Does it create a problem anywhere else if I change this?

Update keynote diagrams

We will need to update the keynote diagram that sits in the README for hypothesis tests and also add the one for confidence intervals. I rarely use Keynote so I suspect one of you can do this a lot faster and nicer than my attempts.

Use tidyeval functionality

This is relevant to calculate but maybe others too, if any of those use underscore verbs from dplyr, which is now deprecated functionality.

Prototype for calculating for stat == "mean"

  if (stat == "mean") {
    col <- setdiff(names(x), "replicate")
    x %>%
      dplyr::group_by(replicate) %>%
      dplyr::summarize(stat = mean(!!sym(col)))
  }

Note that this requires importing the sym function from the rlang package, which dplyr currently does not import automatically.

Is someone up for updating the calculate function with this, or should I? We also need to resolve #15 and #16 to and all of the updates can be done at once.

Including transformed statistics like t and z

It seems that if we are planning to also include approximation methods that use the t and standard normal distribution that we should also include something like

Two categorical (2 level) variables

mtcars %>%
  specify(am ~ vs) %>% # alt: response = am, explanatory = vs
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "z")

or

One numerical (one mean)

mtcars %>%
  specify(response = mpg) %>%
  generate(reps = 100, type = "bootstrap") %>%
  calculate(stat = "t")

I think this is especially needed since if we'd like to plot both the computation and approximation methods on the same plot. Would something like this make more sense in that case for displaying both?

mtcars %>%
  specify(am ~ vs) %>% # alt: response = am, explanatory = vs
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "z") %>%
  visualize(method = "both", tail = "left", extreme_as = test_stat)

But this brings up the issue of the student needing to calculate the observed (transformed) test statistic by entering the formula for z and also doing a group_by. Hmm... What do you all think?

Discrepancy between README and calculate() examples of diff in proportion

Running ?calculate, the example works:

library(infer)
library(tidyverse)
# Example
  mtcars %>%
    dplyr::mutate(am = factor(am), vs = factor(vs)) %>%
    specify(am ~ vs, success = "1") %>%
    hypothesize(null = "independence") %>%
    generate(reps = 100, type = "permute") %>%
    calculate(stat = "diff in props", order = c("1", "0"))

However the README example doesn't

mtcars %>%
  specify(am ~ vs) %>% # alt: response = am, explanatory = vs
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "diff in props")

as we need specify(am ~ vs, success = "1") and calculate(stat = "diff in props", order = c("1", "0")). Perhaps set default values for specify(success) and calculate(order)?

add test statistics ref line to visualize()?

visualize() shows the sampling distribution. Should we add a call to + geom_vline() that illustrates where in the sampling distribution the test statistic is? Do we still have that information by the time we get to visualize()?

calculating the observed statistic

@mine-cetinkaya-rundel @andrewpbray @beanumber @nicksolomon

As I finish up the permutation test, I realize that calculating the linear model test statistic (slope) is not consistent with the structure of the infer package. Please vote on (A), (B), (C), or (D) and let @ismayc (or @andrewpbray ??) know if / how the infer package should be updated.

First:

library(infer)
twins <- read.csv("http://andrewpbray.github.io/data/twins.csv")
twin.slps <- twins %>%
  specify(Foster ~ Biological) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "slope") 

(A)
Very tidy, but I would have to explain do.

obs.slp <- twins %>% do(tidy(lm(Foster ~ Biological, data = .))) %>%
    filter(term == "Biological") %>%
    dplyr::select(estimate)

sum(abs(obs.slp) > = abs(twin.slps) ) / 100

(B)
Mostly tidy, but I'm not piping in the twins dataset.

obs.slp <- tidy(lm(Foster ~ Biological, data = twins)) %>% 
     filter(term == "Biological") %>%
     dplyr::select(estimate)

sum(abs(obs.slp) > = abs(twin.slps) ) / 100

(C)
Not tidy.

obs.slp <- lm(Foster ~ Biological, data = twins)$coef[2]

sum(abs(obs.slp) > = abs(twin.slps) ) / 100

(D)
Create a type="identity" option in generate. This doesn't work yet. Someone would have to write it. and is identity the right word? What about observed?

obs.slp <- twins %>%
  specify(Foster ~ Biological) %>%
  hypothesize(null = "independence") %>%
  generate(type = "identity") %>%
  calculate(stat = "slope") 

sum(abs(obs.slp) > = abs(twin.slps) ) / 100

Print methods

I'd like people to be able to break the inferential pipeline at anypoint and have the output make sense. The default tibble/data.frames that we're working in are pretty good, but there are other attributes related to hypothesize() that would be nice to print (similar to how "Groups: cyl" will be printed at the top of the output of group_by(mtcars, cyl).

Seems like all we'd need to do to do this is add an infer class to the dataframe then make a print method that will supercede that of tbl and data.frame. I was hoping to just tweak the existing print methods for that but...I've been having a tough time tracking them down. They appear to be in tibble, but I haven't gotten much more than learning about format() and trunc_mat().

Has anyone been able to find them?

Where should the group_by go?

Currently we have the df coming out of generate() getting grouped by replicate. In calculate, we'll need to redo the grouping for the difference case, but for the single variable case, we don't need those group_by()s in there, so we?

More generally, does it make more sense to group by in generate() or calculate(). I kinda like having it generate because that output doesn't really make any sense in an ungrouped setting.

Calculate function

Proposed input: dataframe/tibble that has generated data in a form to calculate statistics across replicates
Proposed output: tibble containing the replicated statistics

Only have monochromatic bins in null distribution + p-value histograms?

In the Randomization approach to t-statistic example:

library(nycflights13)
library(dplyr)
library(stringr)
library(infer)
set.seed(2017)

# Create data frame
fli_small <- flights %>% 
  sample_n(size = 500) %>% 
  mutate(half_year = case_when(
    between(month, 1, 6) ~ "h1",
    between(month, 7, 12) ~ "h2"
  )) %>% 
  mutate(day_hour = case_when(
    between(hour, 1, 12) ~ "morning",
    between(hour, 13, 24) ~ "not morning"
  )) %>% 
  select(arr_delay, dep_delay, half_year, 
         day_hour, origin, carrier)

# Null
obs_t <- fli_small %>% 
  t_stat(formula = arr_delay ~ half_year)
t_null_distn <- fli_small %>%
  specify(arr_delay ~ half_year) %>% # alt: response = arr_delay, explanatory = half_year
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "t")
t_null_distn %>% 
  visualize(obs_stat = obs_t, direction = "two_sided")

I think having the bin that straddles the test-statistic value (and its negative) have two colors will definitely trip novices out. Is there anyway to have two bin boundaries be the test statistic value (and its negative) by default, so that we have monochromatic bins? And only if people specify a particular bin/binwidth, resort to this two-color binning scheme?

Fix how specify() deals w multi-level factors

> library(infer)
> library(tidyverse)
> load(url("http://s3.amazonaws.com/assets.datacamp.com/production/course_5694/datasets/gss_sampled.RData"))
> gss2016 <- dplyr::filter(gss_sampled, year == 2016)
> table(gss2016$cappun)

   IAP  FAVOR OPPOSE     DK     NA 
     0     85     65      0      0 
> gss2016 %>%
+   specify(response = cappun, success = "FAVOR") %>%
+   hypothesize(null = "point", p = .5)
Adding missing grouping variables: `year`
Error in names(dots$p) <- c(attr(x, "success"), missing_lev) : 
  'names' attribute [5] must be the same length as the vector [2]

I'd like to make this code work in a situation like this when there are old levels with no observations. I'll add re-releveling within hypothesize().

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.