Welcome to the Project MOSAIC website
projectmosaic / ggformula Goto Github PK
View Code? Open in Web Editor NEWThis project forked from rpruim/statisticalmodeling
Provides a formula interface to 'ggplot2' graphics.
License: Other
This project forked from rpruim/statisticalmodeling
Provides a formula interface to 'ggplot2' graphics.
License: Other
We support simple faceting in the formula, but currently you need to use gf_facet*
to use the additional arguments beyond the formula.
Currently, df_stats()
gets its names in one of three ways. First applicable method wins.
If the user provides a name, use that, enumerating when there is more than one component.
If the stat function names its output, use that.
Else a name is created from the name of the function and the variable it is applied to. The long_names
argument controls the format used.
foo <- function(x) c(the_mean = mean(x), the_median = median(x))
df_stats(age ~ sex, data = HELPrct, foo)
## sex the_mean the_median
## 1 female 36.25234 35
## 2 male 35.46821 35
df_stats(age ~ sex, data = HELPrct, center = foo)
## sex center1 center2
## 1 female 36.25234 35
## 2 male 35.46821 35
df_stats(age ~ sex, data = HELPrct, mean)
## sex mean_age
## 1 female 36.25234
## 2 male 35.46821
df_stats(age ~ sex, data = HELPrct, mean, long_names = FALSE)
## sex mean
## 1 female 36.25234
## 2 male 35.46821
Decisions to make:
long_names = TRUE
or long_names = FALSE
?Note: there is also a nice_names = TRUE
option that will force the names to be syntactically OK.
When adding to a plot in ggformula
, we can set aesthetics like color to specific values, but it would be nice to set the color to an appropriate color from the palette being used by the color scale.
Possible syntax:
model <-
lm(length ~ width + sex, data = KidsFeet)
l <- makeFun(model)
l(width = 8.25, sex = "B")
gf_point(length ~ width, data = KidsFeet,
color = ~ sex) %>%
gf_fun(l(w, sex = "B") ~ w,
color = ~"B") %>%
gf_fun(l(w, sex = "G") ~ w,
color = ~"G") )
Currently, something like
gf_point(y ~ x, data = mydata) %>%
gf_smooth()
doesn't work because gf_smooth()
will complain about the missing formula. We could allow this by allowing the formula to be NULL
when object
is a gg
object.
I'm not sure how easy it will be to do error checking to make sure all the required aesthetics are defined in the ggplot()
call. But if we can live with ggplot2
errors coming through, this is probably pretty easy to implement.
This can be turned on/off with inherit = TRUE
/inherit = FALSE
for cases when this is not what we want, and we can choose the default value of inherit
function by function.
I'm trying to decide how/whether we want to support plotting multiple lines in one call of gf_abline()
, gf_hline()
, and gf_vline()
. These functions don't fit our usual pattern exactly because the formula isn't really used, at least not it the most common case: adding a single line.
It might be worth writing a separate factory function for these and tuning it for their use. (If you look at geom_abline()
you will see that it does some preprocessing to create a data frame used for the plot and that the data frame passed in as data
is typically ignored. We might want to do something similar.
I think this pretty much interferes with the following reasonable syntax:
gf_point(power(1:100) ~ 1:100)
at least as currently implemented.
Is it worth removing attribute:value
support to handle examples like this?
> gf_vline(show.help = TRUE)
gf_vline uses a formula with shape NULL.
See ?geom_vline for additional information.
It would be better to detect the NULL and emit some other message.
Do we want to support something like
gf_point(y ~ x, data = mydata) %>%
gf_smooth()
Currently the formula is required in each layer.
There are a number of issues here, including the fact that ggplot2
is not really ideal for creating maps and that here are multiple approaches to mapping. We have to think about what sorts of things are reasonable to support.
I'm not sure what the argument should be called, but here's what it does:
gf_pointrange(show.help = TRUE)
## gf_pointrange uses a formula with shape y + ymin + ymax ~ x.
## See ?geom_pointrange for additional information.
Have to look into this to see how easy/hard/useful/silly this is.
Danny's plan of action:
The vignette needs much work. I can do this, but not for two weeks. How
about this plan:
Actually y ~ 1
isn't perfect either. Extra column and a warning message.
> df_stats(age ~ 1, data = HELPrct)
x min Q1 median Q3 max mean sd n missing
min 19 19 30 35 40 60 35.65342 7.710266 453 0
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs
I'd like to modify for df_stats()
always to return a dataframe-like object. Currently, when the formula is one-sided, df_stats()
returns a vector not a dataframe.
Not about modifying df_stats
...
Following the use-your-own-stat-function style of df_stats
, I'm planning to add some functions that calculate confidence intervals on simple stats like mean
, median
, sd
. Some questions:
ggformula
package? Seems odd to do so, but that's where df_stats()
lives.level =
argument for these functions. I can't simply put this in fargs
in df_stats
, since that will barf for functions that don't take a level
argument. I'm open to other ideas. For now, my plan is to write to versions of each function, one like ci_mean
which directly does the calculation, and another like ci_mean_level(0.95)
which takes the level as an argument and returns the proper form of ci_mean
for a calculation at that level.@rpruim I love how you used model.frame()
with the side effect of causing na.omit()
to run on the data being processed. This makes it completely consistent with the NA behavior of lm()
, etc.
An example or two would be helpful: at present the user needs to spelunk in ggplot2::labs() to get a sense for what is feasible.
I've created a prototype, but I'm still figuring out the best interface. The idea is to be able to do something like
model <- lm( ... )
gf_point( ...) %>%
gf_predict(model, interval = "prediction", level = 0.9)
and get a ribbon showing the prediction or confidence bands based on predict()
and model
. It will be a bit like gf_lm()
, but allow for either type of interval and work with any model that has a predict method that creates a data frame with fit
, lwr
and upr
.
The ability to do on-the-fly calculations in mapped aesthetics is convenient. Here is a way (heuristic) to determine which aesthetics are mapped and which are set:
aes_form
is considered mapped and appears in aes()
attribute:var
is mapped if var
is a variable in the dataattribute::expr
is mapped -- this allows on the fly calculations(
, so expressions can be surrounded by parensThis allows things like
gf_dens(~ (disp/cyl) + color::factor(gear), data = mtcars, verbose = TRUE)
## ggplot(data = mtcars) +
## geom_line(aes(x = (disp/cyl), color = factor(gear)), stat = "density")
I'm sure the issues is the same for other gf_ functions and other arguments, but I just came across this one.
gf_pointrange(mean + lo + hi ~ k + col:dev_type | ~ n, data = Sims2,
position = position_dodge(width = 0.5), verbose = TRUE)
ggplot(data = Sims2) +
## geom_pointrange(aes(y = mean, ymin = lo, ymax = hi, x = k, col = dev_type), fatten = 2,
## position = <environment>) + facet_grid(~n)
I added gf_function()
. It works great as an add-on layer.
If we want to support plotting a function and nothing else, we need to determine how to set the plotting widow. For now, you can do anything that creates an empty or invisible plot and then add the function on top.
Also, gf_function()
doesn't currently support verbose = TRUE
and there is no error checking regarding object
. Here's the whole function:
function(object, fun, ...) {
object + stat_function(fun = fun, ...)
}
This was suggested by @nicholasjhorton.
I think it should be possible to essentially convert x ~ 1
into ~ x
before processing the formula. I'm imagining that a literal 1
will be required here and not something that evaluates with a single unique value. Would that suffice?
this conflicts with the aesthetic named formula
in geom_smooth()
.
This is a missing feature in ggplot2 that users of mosaic
may be familiar with from xqqmath()
. Possible resources:
Current processing doesn't allow for data transformations within the call to the plotting function.
It would be good to support things like
gf_point( length ~ width, data = KidsFeet %>% filter(sex == "G"))
Error in parse(text = gg_string) : <text>:1:15: unexpected SPECIAL
1: ggplot(data = %>%
I think I'd rather redesign gf_lm()
than continue with gf_predict()
as in #45.
I'm imagining the following interface:
gf_lm(interval = "none")
-- just the regression linegf_lm(geom = "line") -- line using different aesthetics as per
geom_line()`Given this interface an the similarity to predict()
, I suggest that the default should be interval = "none"
. This is not quite backward compatible with the existing gf_lm()
, but I think I want to force users to declare which type of band they want.
We can get density histograms using
gf_histogram( ..density.. ~ x, data = ...)
But it might be nice to get these with
gf_dhistogram( ~ x, data = ...)
The number of functions currently documented together is probably a bit overwhelming.
This allows us to set aesthetic defaults (like alpha = 0.3
for gf_ribbon()
while still allowing the user to adjust.
We can detect when the placeholder object is a data frame to handle this. Verbose output won't include the data object left of %>%, but this doesn't seem to bad. After all, the verbose output is telling what the gf_
function needs to be replaced with.
KidsFeet %>% gf_point(length ~ width, verbose = TRUE)
## ggplot(data = .) +
## geom_point(aes(y = length, x = width))
Have to think a bit about the API for this if we want to avoid having a function for each of the many scales functions in ggplot2
.
To get all the features of mosaic::plotDist()
, I'd need to figure out how to get at the x-range of the plotting windows.
The examples have served in part to check that features work as advertised, but some of them are not very interesting and could be improved.
Also, several of them rely on external packages (likemosaicData
), which we might prefer to avoid. (But using external packages can contribute to more interesting examples.)
Perhaps we should put a couple data sets into ggformula
and use those for examples. Anyone have a new data set to suggest?
As per #41, these are no longer supported.
This allows users to get help more easily.
Example:
> gf_abline()
gf_abline does not require a formula.
See ?geom_abline for additional information.
Currently we throw an error (unless we are adding an additional layer). But things work just fine, and it can be handy to plot things like residuals without placing them into a data frame first:
gf_dens( ~ resid(model))
This should handle most use cases well and provides a mechanism to override the default for those who want to do so. Seems to handle gf_ function called from within other functions reasonably.
I'm thinking of something like this:
gf_point( y ~ x, color = ~ a, data = mydata)
Advantages:
attribute = value
is used for setting.::
for some cases.Eventually, we can decide if we want to support a one true way or continue to support both ways of specifying attribute-mapping.
I've not given gmodel()
much of a test drive, but it might be useful to have something like it or like mosaic:plotModel()
in the package.
This would make it even easier to migrate from lattice.
I'm thinking to support
plot_points(cesd ~ i1 | sex, data = HELPrct) # facet_wrap
plot_points(cesd ~ i1 | sex ~ . , data = HELPrct) # facet_grid
plot_points(cesd ~ i1 | . ~ sex , data = HELPrct) # facet_grid
plot_points(cesd ~ i1 | sex ~ substance, data = HELPrct) # facet_grid
I've written a wrapper around mosaic::tally()
which works with piped-in data (or data =
) and returns a data frame as output. This has two names: df_props()
and df_counts()
. Currently, these are in the mosaicModel
package, but I don't think they should live there. Among the options ...
ggformula
mosaicdf
(mosaicDf
?) which contains all the df_
functions.I considered integrating them with df_stats()
, but decided not. Giving them separate names is (1) likely easier for students to assimilate, (2) keeps df_stats()
cleaner, and (3) leaves df_stats()
as explicitly about statistics of a single variable.
See issue #9 for the current strategy, which is relatively easy and should work for the intended use cases, but may given unexpected results in some other cases.
Switch it to work with one-sided formulas. The y-axis coordinate is meaningless for a rug plot.
This is similar to @dtkaplan's qstats()
(but reimplemented using aggregate()
to allow for functions that produce multiple values), but I don't see a compelling reason for the letter q. Some options:
df_stats()
: because it produces a data frame from a formula rather than a graphic from a formula.gf_stats()
: because it is in the ggformula
packagestats()
: because it is shorterAlso, we could keep or drop the final s
.
Other options? Preferences?
The version of df_stats()
on CRAN produces a data frame, but it is an odd data frame that doesn't behave as one would expect (despite displaying nicely on the screen).
Issue: aggregate()
produces a matrix that is really a list with dimension rather than a vector with dimensions, so the conversion to a data frame doesn't work as we would like.
Fix: This lovely bit of code, with two calls to lapply()
, two calls to data.frame()
and an unlist()
converts things to a more standard data frame:
res <- lapply(res, function(x) data.frame(lapply(data.frame(x$x), unlist)))
Makes me wonder about this claim from aggregate()
's documentation (emphasis mine):
Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.
Things like stat = stat_summary(fun.y = mean)
don't currently work.
to get the word out.
Geoms exist in other packages and people might like to use them. We could get in the business of monitoring this and creating gf_
functions, or we could export gf_factory()
and let users do it themselves.
gf_boxplot(~ age, data=HELPrct)
isn't supported, while
gf_boxplot(age ~ 1, data=HELPrct)
is suboptimal.
While I'm not a fan of a single boxplot, it is still taught (and shows up in SDM4 in R chapter 3.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.