Code Monkey home page Code Monkey logo

ggformula's Introduction

mosaic-web

Welcome to the Project MOSAIC website

ggformula's People

Contributors

dtkaplan avatar ian-curtis avatar klaassenj avatar larmarange avatar ncarchedi avatar nicholasjhorton avatar rpruim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ggformula's Issues

default for gf_histogram()

screen shot 2017-07-16 at 8 56 04 am

Can the default for df_histogram() address the "Pick better value" message? (Or is this fixed in a beta version of ggplot2()?)

Column names in df_stats() output

Currently, df_stats() gets its names in one of three ways. First applicable method wins.

  1. If the user provides a name, use that, enumerating when there is more than one component.

  2. If the stat function names its output, use that.

  3. Else a name is created from the name of the function and the variable it is applied to. The long_names argument controls the format used.

foo <- function(x) c(the_mean = mean(x), the_median = median(x))

df_stats(age ~ sex, data = HELPrct, foo)
##      sex the_mean the_median
## 1 female 36.25234         35
## 2   male 35.46821         35

df_stats(age ~ sex, data = HELPrct, center = foo)
##      sex  center1 center2
## 1 female 36.25234      35
## 2   male 35.46821      35

df_stats(age ~ sex, data = HELPrct, mean)
##      sex mean_age
## 1 female 36.25234
## 2   male 35.46821

df_stats(age ~ sex, data = HELPrct, mean, long_names = FALSE)
##      sex     mean
## 1 female 36.25234
## 2   male 35.46821

Decisions to make:

  • do we like this naming convention
  • should the default be long_names = TRUE or long_names = FALSE?

Note: there is also a nice_names = TRUE option that will force the names to be syntactically OK.

It would be nice to be able to set aesthetics based on values used in mapping other layers

When adding to a plot in ggformula, we can set aesthetics like color to specific values, but it would be nice to set the color to an appropriate color from the palette being used by the color scale.

Possible syntax:

model <- 
  lm(length ~ width + sex, data = KidsFeet)
l <- makeFun(model)
l(width = 8.25, sex = "B")
gf_point(length ~ width, data = KidsFeet,
         color = ~ sex) %>%
  gf_fun(l(w, sex = "B") ~ w, 
         color = ~"B") %>%
  gf_fun(l(w, sex = "G") ~ w, 
         color = ~"G") )

Support more inheritance?

Currently, something like

gf_point(y ~ x, data = mydata) %>%
  gf_smooth()

doesn't work because gf_smooth() will complain about the missing formula. We could allow this by allowing the formula to be NULL when object is a gg object.

I'm not sure how easy it will be to do error checking to make sure all the required aesthetics are defined in the ggplot() call. But if we can live with ggplot2 errors coming through, this is probably pretty easy to implement.

This can be turned on/off with inherit = TRUE/inherit = FALSE for cases when this is not what we want, and we can choose the default value of inherit function by function.

best interface for gf_abline() and friends?

I'm trying to decide how/whether we want to support plotting multiple lines in one call of gf_abline(), gf_hline(), and gf_vline(). These functions don't fit our usual pattern exactly because the formula isn't really used, at least not it the most common case: adding a single line.

It might be worth writing a separate factory function for these and tuning it for their use. (If you look at geom_abline() you will see that it does some preprocessing to create a data frame used for the plot and that the data frame passed in as data is typically ignored. We might want to do something similar.

better show.help message when formula = NULL

> gf_vline(show.help = TRUE)
gf_vline uses a formula with shape NULL.
See ?geom_vline for additional information. 

It would be better to detect the NULL and emit some other message.

support aesthetic inheritance?

Do we want to support something like

gf_point(y ~ x, data = mydata) %>%
  gf_smooth()

Currently the formula is required in each layer.

Create gf_ functions to support choropleth maps?

There are a number of issues here, including the fact that ggplot2 is not really ideal for creating maps and that here are multiple approaches to mapping. We have to think about what sorts of things are reasonable to support.

add usage help for gf_functions

I'm not sure what the argument should be called, but here's what it does:

gf_pointrange(show.help = TRUE)
## gf_pointrange uses a formula with shape y + ymin + ymax ~ x.
## See ?geom_pointrange for additional information.

Put ggformula on CRAN

Danny's plan of action:

The vignette needs much work. I can do this, but not for two weeks. How
about this plan:

  1. Put it on CRAN now.
  2. Danny fixes the vignette.
  3. We write an rbloggers post.

Improve df_stats() on 1-sided formulas

Actually y ~ 1 isn't perfect either. Extra column and a warning message.

> df_stats(age ~ 1, data = HELPrct)
     x min Q1 median Q3 max     mean       sd   n missing
min 19  19 30     35 40  60 35.65342 7.710266 453       0
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

Proposed changes to df_stats

I'd like to modify for df_stats() always to return a dataframe-like object. Currently, when the formula is one-sided, df_stats() returns a vector not a dataframe.

Not about modifying df_stats ...

Following the use-your-own-stat-function style of df_stats, I'm planning to add some functions that calculate confidence intervals on simple stats like mean, median, sd. Some questions:

  1. Should I put these in the ggformula package? Seems odd to do so, but that's where df_stats() lives.
  2. It would be nice to have a level = argument for these functions. I can't simply put this in fargs in df_stats, since that will barf for functions that don't take a level argument. I'm open to other ideas. For now, my plan is to write to versions of each function, one like ci_mean which directly does the calculation, and another like ci_mean_level(0.95) which takes the level as an argument and returns the proper form of ci_mean for a calculation at that level.

@rpruim I love how you used model.frame() with the side effect of causing na.omit() to run on the data being processed. This makes it completely consistent with the NA behavior of lm(), etc.

documentation for gf_labs()

An example or two would be helpful: at present the user needs to spelunk in ggplot2::labs() to get a sense for what is feasible.

Add gf_predict()

I've created a prototype, but I'm still figuring out the best interface. The idea is to be able to do something like

model <- lm( ... )
gf_point( ...) %>%
  gf_predict(model, interval = "prediction", level = 0.9)

and get a ribbon showing the prediction or confidence bands based on predict() and model. It will be a bit like gf_lm(), but allow for either type of interval and work with any model that has a predict method that creates a data frame with fit, lwr and upr.

allow for more general expressions in mapped asthetics

The ability to do on-the-fly calculations in mapped aesthetics is convenient. Here is a way (heuristic) to determine which aesthetics are mapped and which are set:

  • Anything in aes_form is considered mapped and appears in aes()
  • attribute:var is mapped if var is a variable in the data
  • attribute::expr is mapped -- this allows on the fly calculations
  • formula parsing stops at (, so expressions can be surrounded by parens

This allows things like

gf_dens(~ (disp/cyl) + color::factor(gear), data = mtcars, verbose = TRUE)
## ggplot(data = mtcars) + 
##   geom_line(aes(x = (disp/cyl), color = factor(gear)), stat = "density") 

image

how to add position = position_dodge(width = 0.5) to a gf_pointrange()?

I'm sure the issues is the same for other gf_ functions and other arguments, but I just came across this one.

gf_pointrange(mean + lo + hi ~ k + col:dev_type  | ~ n, data = Sims2,
              position = position_dodge(width = 0.5), verbose = TRUE)
ggplot(data = Sims2) + 
##  geom_pointrange(aes(y = mean, ymin = lo, ymax = hi, x = k, col = dev_type), fatten = 2, 
##    position = <environment>) + facet_grid(~n) 

Possible improvements to gf_function()

I added gf_function(). It works great as an add-on layer.

If we want to support plotting a function and nothing else, we need to determine how to set the plotting widow. For now, you can do anything that creates an empty or invisible plot and then add the function on top.

Also, gf_function() doesn't currently support verbose = TRUE and there is no error checking regarding object. Here's the whole function:

function(object, fun, ...) {
  object + stat_function(fun = fun, ...)
}

handle data = argument better in the plotting functions

Current processing doesn't allow for data transformations within the call to the plotting function.

It would be good to support things like

gf_point( length ~ width, data = KidsFeet %>% filter(sex == "G"))

 Error in parse(text = gg_string) : <text>:1:15: unexpected SPECIAL
1: ggplot(data = %>%

Redesign gf_lm()

I think I'd rather redesign gf_lm() than continue with gf_predict() as in #45.

I'm imagining the following interface:

  • gf_lm(interval = "none") -- just the regression line
  • `gf_lm(interval = "confidence") -- line and confidence bands
  • `gf_lm(interval = "prediction") -- line and prediction bands
  • `gf_lm(interval = "confidence", geom = "ribbon") -- bands without line
  • gf_lm(geom = "line") -- line using different aesthetics as per geom_line()`

Given this interface an the similarity to predict(), I suggest that the default should be interval = "none". This is not quite backward compatible with the existing gf_lm(), but I think I want to force users to declare which type of band they want.

add gf_dhistogram()?

We can get density histograms using

gf_histogram( ..density.. ~ x, data = ...)

But it might be nice to get these with

gf_dhistogram( ~ x, data = ...)

support data %>% gf_() syntax

We can detect when the placeholder object is a data frame to handle this. Verbose output won't include the data object left of %>%, but this doesn't seem to bad. After all, the verbose output is telling what the gf_ function needs to be replaced with.

KidsFeet %>% gf_point(length ~ width, verbose = TRUE)
## ggplot(data = .) + 
##   geom_point(aes(y = length, x = width)) 

Add gf_scales()

Have to think a bit about the API for this if we want to avoid having a function for each of the many scales functions in ggplot2.

Improve examples throughout documentation (and remove dependencies?)

The examples have served in part to check that features work as advertised, but some of them are not very interesting and could be improved.

Also, several of them rely on external packages (likemosaicData), which we might prefer to avoid. (But using external packages can contribute to more interesting examples.)

Perhaps we should put a couple data sets into ggformula and use those for examples. Anyone have a new data set to suggest?

allow data = NULL to produce a plot

Currently we throw an error (unless we are adding an additional layer). But things work just fine, and it can be handy to plot things like residuals without placing them into a data frame first:

gf_dens( ~ resid(model))

Provide alternative syntax for mapping non-positional attributes/aesthetics

I'm thinking of something like this:

gf_point( y ~ x, color = ~ a, data = mydata)

Advantages:

  • should be easier to do on the fly computations (fewer parens)
  • less text parsing of formula
  • formulas can get pretty big when there are lots of attributes to set
  • easier to convert between mapping and setting when attribute = value is used for setting.
  • avoid needing :: for some cases.

Eventually, we can decide if we want to support a one true way or continue to support both ways of specifying attribute-mapping.

Add formula-based faceting?

This would make it even easier to migrate from lattice.

I'm thinking to support

plot_points(cesd ~ i1 | sex, data = HELPrct)       # facet_wrap
plot_points(cesd ~ i1 | sex ~ . , data = HELPrct)  # facet_grid
plot_points(cesd ~ i1 | . ~ sex , data = HELPrct)  # facet_grid
plot_points(cesd ~ i1 | sex ~ substance, data = HELPrct)  # facet_grid

Friends for `df_stats()`

I've written a wrapper around mosaic::tally() which works with piped-in data (or data = ) and returns a data frame as output. This has two names: df_props() and df_counts(). Currently, these are in the mosaicModel package, but I don't think they should live there. Among the options ...

  1. Put them in ggformula
  2. Create a mosaicdf (mosaicDf ?) which contains all the df_ functions.

I considered integrating them with df_stats(), but decided not. Giving them separate names is (1) likely easier for students to assimilate, (2) keeps df_stats() cleaner, and (3) leaves df_stats() as explicitly about statistics of a single variable.

Name for df_stats?

This is similar to @dtkaplan's qstats() (but reimplemented using aggregate() to allow for functions that produce multiple values), but I don't see a compelling reason for the letter q. Some options:

  • df_stats(): because it produces a data frame from a formula rather than a graphic from a formula.
  • gf_stats(): because it is in the ggformula package
  • stats(): because it is shorter

Also, we could keep or drop the final s.

Other options? Preferences?

Fix df_stats()

The version of df_stats() on CRAN produces a data frame, but it is an odd data frame that doesn't behave as one would expect (despite displaying nicely on the screen).

Issue: aggregate() produces a matrix that is really a list with dimension rather than a vector with dimensions, so the conversion to a data frame doesn't work as we would like.

Fix: This lovely bit of code, with two calls to lapply(), two calls to data.frame() and an unlist() converts things to a more standard data frame:

  res <- lapply(res, function(x) data.frame(lapply(data.frame(x$x), unlist)))

Makes me wonder about this claim from aggregate()'s documentation (emphasis mine):

Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.

gf_boxplot support for 1 variable

gf_boxplot(~ age, data=HELPrct)

isn't supported, while

gf_boxplot(age ~ 1, data=HELPrct)

is suboptimal.

While I'm not a fan of a single boxplot, it is still taught (and shows up in SDM4 in R chapter 3.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.