projectmosaic / ggformula Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rpruim/statisticalmodeling

39.0 39.0 9.0 192.69 MB

Provides a formula interface to 'ggplot2' graphics.

License: Other

R 50.09% HTML 49.91%

ggformula's Introduction

mosaic-web

Welcome to the Project MOSAIC website

ggformula's People

Contributors

Stargazers

Watchers

Forkers

rubak imarcello klaassenj lebebr01 romainfrancois rasanderson larmarange ian-curtis

ggformula's Issues

support things like scales = "free" in the main gf_ calls?

We support simple faceting in the formula, but currently you need to use gf_facet* to use the additional arguments beyond the formula.

default for gf_histogram()

Can the default for df_histogram() address the "Pick better value" message? (Or is this fixed in a beta version of ggplot2()?)

Column names in df_stats() output

Currently, df_stats() gets its names in one of three ways. First applicable method wins.

If the user provides a name, use that, enumerating when there is more than one component.
If the stat function names its output, use that.
Else a name is created from the name of the function and the variable it is applied to. The long_names argument controls the format used.

foo <- function(x) c(the_mean = mean(x), the_median = median(x))

df_stats(age ~ sex, data = HELPrct, foo)
##      sex the_mean the_median
## 1 female 36.25234         35
## 2   male 35.46821         35

df_stats(age ~ sex, data = HELPrct, center = foo)
##      sex  center1 center2
## 1 female 36.25234      35
## 2   male 35.46821      35

df_stats(age ~ sex, data = HELPrct, mean)
##      sex mean_age
## 1 female 36.25234
## 2   male 35.46821

df_stats(age ~ sex, data = HELPrct, mean, long_names = FALSE)
##      sex     mean
## 1 female 36.25234
## 2   male 35.46821

Decisions to make:

do we like this naming convention
should the default be long_names = TRUE or long_names = FALSE?

Note: there is also a nice_names = TRUE option that will force the names to be syntactically OK.

It would be nice to be able to set aesthetics based on values used in mapping other layers

When adding to a plot in ggformula, we can set aesthetics like color to specific values, but it would be nice to set the color to an appropriate color from the palette being used by the color scale.

Possible syntax:

model <- 
  lm(length ~ width + sex, data = KidsFeet)
l <- makeFun(model)
l(width = 8.25, sex = "B")
gf_point(length ~ width, data = KidsFeet,
         color = ~ sex) %>%
  gf_fun(l(w, sex = "B") ~ w, 
         color = ~"B") %>%
  gf_fun(l(w, sex = "G") ~ w, 
         color = ~"G") )

Support more inheritance?

Currently, something like

gf_point(y ~ x, data = mydata) %>%
  gf_smooth()

doesn't work because gf_smooth() will complain about the missing formula. We could allow this by allowing the formula to be NULL when object is a gg object.

I'm not sure how easy it will be to do error checking to make sure all the required aesthetics are defined in the ggplot() call. But if we can live with ggplot2 errors coming through, this is probably pretty easy to implement.

This can be turned on/off with inherit = TRUE/inherit = FALSE for cases when this is not what we want, and we can choose the default value of inherit function by function.

best interface for gf_abline() and friends?

I'm trying to decide how/whether we want to support plotting multiple lines in one call of gf_abline(), gf_hline(), and gf_vline(). These functions don't fit our usual pattern exactly because the formula isn't really used, at least not it the most common case: adding a single line.

It might be worth writing a separate factory function for these and tuning it for their use. (If you look at geom_abline() you will see that it does some preprocessing to create a data frame used for the plot and that the data frame passed in as data is typically ignored. We might want to do something similar.

A cost of supporting attribute:value syntax in the main formula

I think this pretty much interferes with the following reasonable syntax:

gf_point(power(1:100) ~ 1:100)

at least as currently implemented.

Is it worth removing attribute:value support to handle examples like this?

better show.help message when formula = NULL

> gf_vline(show.help = TRUE)
gf_vline uses a formula with shape NULL.
See ?geom_vline for additional information.

It would be better to detect the NULL and emit some other message.

support aesthetic inheritance?

Do we want to support something like

gf_point(y ~ x, data = mydata) %>%
  gf_smooth()

Currently the formula is required in each layer.

Create gf_ functions to support choropleth maps?

There are a number of issues here, including the fact that ggplot2 is not really ideal for creating maps and that here are multiple approaches to mapping. We have to think about what sorts of things are reasonable to support.

add usage help for gf_functions

I'm not sure what the argument should be called, but here's what it does:

gf_pointrange(show.help = TRUE)
## gf_pointrange uses a formula with shape y + ymin + ymax ~ x.
## See ?geom_pointrange for additional information.

create gf_emoji using emojifont package?

Have to look into this to see how easy/hard/useful/silly this is.

Put ggformula on CRAN

Danny's plan of action:

The vignette needs much work. I can do this, but not for two weeks. How
about this plan:

Put it on CRAN now.
Danny fixes the vignette.
We write an rbloggers post.

Improve df_stats() on 1-sided formulas

Actually y ~ 1 isn't perfect either. Extra column and a warning message.

> df_stats(age ~ 1, data = HELPrct)
     x min Q1 median Q3 max     mean       sd   n missing
min 19  19 30     35 40  60 35.65342 7.710266 453       0
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

Proposed changes to df_stats

I'd like to modify for df_stats() always to return a dataframe-like object. Currently, when the formula is one-sided, df_stats() returns a vector not a dataframe.

Not about modifying df_stats ...

Following the use-your-own-stat-function style of df_stats, I'm planning to add some functions that calculate confidence intervals on simple stats like mean, median, sd. Some questions:

Should I put these in the ggformula package? Seems odd to do so, but that's where df_stats() lives.
It would be nice to have a level = argument for these functions. I can't simply put this in fargs in df_stats, since that will barf for functions that don't take a level argument. I'm open to other ideas. For now, my plan is to write to versions of each function, one like ci_mean which directly does the calculation, and another like ci_mean_level(0.95) which takes the level as an argument and returns the proper form of ci_mean for a calculation at that level.

@rpruim I love how you used model.frame() with the side effect of causing na.omit() to run on the data being processed. This makes it completely consistent with the NA behavior of lm(), etc.

documentation for gf_labs()

An example or two would be helpful: at present the user needs to spelunk in ggplot2::labs() to get a sense for what is feasible.

Add gf_predict()

I've created a prototype, but I'm still figuring out the best interface. The idea is to be able to do something like

model <- lm( ... )
gf_point( ...) %>%
  gf_predict(model, interval = "prediction", level = 0.9)

and get a ribbon showing the prediction or confidence bands based on predict() and model. It will be a bit like gf_lm(), but allow for either type of interval and work with any model that has a predict method that creates a data frame with fit, lwr and upr.

allow for more general expressions in mapped asthetics

The ability to do on-the-fly calculations in mapped aesthetics is convenient. Here is a way (heuristic) to determine which aesthetics are mapped and which are set:

Anything in aes_form is considered mapped and appears in aes()
attribute:var is mapped if var is a variable in the data
attribute::expr is mapped -- this allows on the fly calculations
formula parsing stops at (, so expressions can be surrounded by parens

This allows things like

gf_dens(~ (disp/cyl) + color::factor(gear), data = mtcars, verbose = TRUE)
## ggplot(data = mtcars) + 
##   geom_line(aes(x = (disp/cyl), color = factor(gear)), stat = "density")

add " learnr::run_tutorial(“introduction”, package = “ggformula”)" to the README file?

how to add position = position_dodge(width = 0.5) to a gf_pointrange()?

I'm sure the issues is the same for other gf_ functions and other arguments, but I just came across this one.

gf_pointrange(mean + lo + hi ~ k + col:dev_type  | ~ n, data = Sims2,
              position = position_dodge(width = 0.5), verbose = TRUE)
ggplot(data = Sims2) + 
##  geom_pointrange(aes(y = mean, ymin = lo, ymax = hi, x = k, col = dev_type), fatten = 2, 
##    position = <environment>) + facet_grid(~n)

Possible improvements to gf_function()

I added gf_function(). It works great as an add-on layer.

If we want to support plotting a function and nothing else, we need to determine how to set the plotting widow. For now, you can do anything that creates an empty or invisible plot and then add the function on top.

Also, gf_function() doesn't currently support verbose = TRUE and there is no error checking regarding object. Here's the whole function:

function(object, fun, ...) {
  object + stat_function(fun = fun, ...)
}

Add support for x ~ 1 where ~ x is currently required.

This was suggested by @nicholasjhorton.

I think it should be possible to essentially convert x ~ 1 into ~ x before processing the formula. I'm imagining that a literal 1 will be required here and not something that evaluates with a single unique value. Would that suffice?

change name of argument `formula`

this conflicts with the aesthetic named formula in geom_smooth().

add gf_qqline (and the underlying geoms/stats to support it)?

This is a missing feature in ggplot2 that users of mosaic may be familiar with from xqqmath(). Possible resources:

handle data = argument better in the plotting functions

Current processing doesn't allow for data transformations within the call to the plotting function.

It would be good to support things like

gf_point( length ~ width, data = KidsFeet %>% filter(sex == "G"))

 Error in parse(text = gg_string) : <text>:1:15: unexpected SPECIAL
1: ggplot(data = %>%

Redesign gf_lm()

I think I'd rather redesign gf_lm() than continue with gf_predict() as in #45.

I'm imagining the following interface:

gf_lm(interval = "none") -- just the regression line
`gf_lm(interval = "confidence") -- line and confidence bands
`gf_lm(interval = "prediction") -- line and prediction bands
`gf_lm(interval = "confidence", geom = "ribbon") -- bands without line
gf_lm(geom = "line") -- line using different aesthetics as per geom_line()`

Given this interface an the similarity to predict(), I suggest that the default should be interval = "none". This is not quite backward compatible with the existing gf_lm(), but I think I want to force users to declare which type of band they want.

add gf_dhistogram()?

We can get density histograms using

gf_histogram( ..density.. ~ x, data = ...)

But it might be nice to get these with

gf_dhistogram( ~ x, data = ...)

split documentation into more smaller files

The number of functions currently documented together is probably a bit overwhelming.

allow user to override defaults from extras

This allows us to set aesthetic defaults (like alpha = 0.3 for gf_ribbon() while still allowing the user to adjust.

support data %>% gf_() syntax

We can detect when the placeholder object is a data frame to handle this. Verbose output won't include the data object left of %>%, but this doesn't seem to bad. After all, the verbose output is telling what the gf_ function needs to be replaced with.

KidsFeet %>% gf_point(length ~ width, verbose = TRUE)
## ggplot(data = .) + 
##   geom_point(aes(y = length, x = width))

Add gf_scales()

Have to think a bit about the API for this if we want to avoid having a function for each of the many scales functions in ggplot2.

Error in how gf_dist() plots distributions with a grouping indicator

To get all the features of mosaic::plotDist(), I'd need to figure out how to get at the x-range of the plotting windows.

create minimal R sheet for ggformula

Improve examples throughout documentation (and remove dependencies?)

The examples have served in part to check that features work as advertised, but some of them are not very interesting and could be improved.

Also, several of them rely on external packages (likemosaicData), which we might prefer to avoid. (But using external packages can contribute to more interesting examples.)

Perhaps we should put a couple data sets into ggformula and use those for examples. Anyone have a new data set to suggest?

Remove attribute:value and attribute::value from documentation, vignettes, etc.

As per #41, these are no longer supported.

set show.help = TRUE when object and gformula are both NULL

This allows users to get help more easily.

Example:

> gf_abline()
gf_abline does not require a formula.
See ?geom_abline for additional information.

allow data = NULL to produce a plot

Currently we throw an error (unless we are adding an additional layer). But things work just fine, and it can be handy to plot things like residuals without placing them into a data frame first:

gf_dens( ~ resid(model))

eval()uate ggplot2 expressions in the environment of the formula

This should handle most use cases well and provides a mechanism to override the default for those who want to do so. Seems to handle gf_ function called from within other functions reasonably.

Provide alternative syntax for mapping non-positional attributes/aesthetics

I'm thinking of something like this:

gf_point( y ~ x, color = ~ a, data = mydata)

Advantages:

should be easier to do on the fly computations (fewer parens)
less text parsing of formula
formulas can get pretty big when there are lots of attributes to set
easier to convert between mapping and setting when attribute = value is used for setting.
avoid needing :: for some cases.

Eventually, we can decide if we want to support a one true way or continue to support both ways of specifying attribute-mapping.

add a version of statisticalModeling::gmodel() as gf_model?

I've not given gmodel() much of a test drive, but it might be useful to have something like it or like mosaic:plotModel() in the package.

Add formula-based faceting?

This would make it even easier to migrate from lattice.

I'm thinking to support

plot_points(cesd ~ i1 | sex, data = HELPrct)       # facet_wrap
plot_points(cesd ~ i1 | sex ~ . , data = HELPrct)  # facet_grid
plot_points(cesd ~ i1 | . ~ sex , data = HELPrct)  # facet_grid
plot_points(cesd ~ i1 | sex ~ substance, data = HELPrct)  # facet_grid

Friends for `df_stats()`

I've written a wrapper around mosaic::tally() which works with piped-in data (or data = ) and returns a data frame as output. This has two names: df_props() and df_counts(). Currently, these are in the mosaicModel package, but I don't think they should live there. Among the options ...

Put them in ggformula
Create a mosaicdf (mosaicDf ?) which contains all the df_ functions.

I considered integrating them with df_stats(), but decided not. Giving them separate names is (1) likely easier for students to assimilate, (2) keeps df_stats() cleaner, and (3) leaves df_stats() as explicitly about statistics of a single variable.

improve parsing of formula-based faceting

See issue #9 for the current strategy, which is relatively easy and should work for the intended use cases, but may given unexpected results in some other cases.

gf_rug() wants a two-sided formula.

Switch it to work with one-sided formulas. The y-axis coordinate is meaningless for a rug plot.

Name for df_stats?

This is similar to @dtkaplan's qstats() (but reimplemented using aggregate() to allow for functions that produce multiple values), but I don't see a compelling reason for the letter q. Some options:

df_stats(): because it produces a data frame from a formula rather than a graphic from a formula.
gf_stats(): because it is in the ggformula package
stats(): because it is shorter

Also, we could keep or drop the final s.

Other options? Preferences?

Fix df_stats()

The version of df_stats() on CRAN produces a data frame, but it is an odd data frame that doesn't behave as one would expect (despite displaying nicely on the screen).

Issue: aggregate() produces a matrix that is really a list with dimension rather than a vector with dimensions, so the conversion to a data frame doesn't work as we would like.

Fix: This lovely bit of code, with two calls to lapply(), two calls to data.frame() and an unlist() converts things to a more standard data frame:

  res <- lapply(res, function(x) data.frame(lapply(data.frame(x$x), unlist)))

Makes me wonder about this claim from aggregate()'s documentation (emphasis mine):

Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.

gf_boxplot(age ~ 1, data=HELPrct)

is suboptimal.

While I'm not a fan of a single boxplot, it is still taught (and shows up in SDM4 in R chapter 3.