Comments (10)
Actually one of the reasons why I developed this functionality is to be able to systematize routine data analysis that I end up doing generally (a lot of select + split/apply/combine). In this way it's easier for inexperienced users to make somewhat complicated plots with simpler commands: I am also working on a GUI to simplify the process even more (you can load a csv, select, split and plot with a few clicks: see https://github.com/piever/PlugAndPlot.jl/). Ideally it has other applications than can be generalized to other plots than groupapply
, but all of this doesn't need to live in StatPlots: it's actually easier for me to develop it if it doesn't.
Then maybe GroupedErrors could take care of all the data manipulation dependencies (thanks to the macro, we no longer need DataFrames on StatPlots) and take care of the data shuffling while the plots are actually drawn in StatPlots. The recipe for the groupederror
is something that I'm reimplementing to use IndexedTables, so in terms of dependencies it's probably better if it lives in GroupedErrors (and I don't need a StatPlots dependency to have it, I'd just need RecipesBase). Also the Loess dependency can be dropped, as it's only really used by groupapply
.
I'll try to develop this separately and when it's mature enough we can add a link to it (as well as perhaps to the GUI if that also develops into a mature package). I'm unsure why the thought hasn't occurred to me before you mentioned it, it really makes a lot of sense...
from statsplots.jl.
It would certainly be great if we could integrate this with the Query.jl story! I published a new version recently, that brings a pipe syntax to Query.jl, so this seems good timing to try to figure out how we can make these things match and work together.
I have to admit I don't fully understand the original thing that is happening in the graph at the top. I think the beginning of the whole data analysis we could create in Query with
school |> @where(_.Minrty=="Yes") |> @groupby(_.Sx)
this would give you two groups. Or should things be split by Sx
and School
in this stage? That would be
school |> @where(_.Minrty=="Yes") |> @groupby({_.Sx, _.School})
But then I don't understand what the next step in the original analysis is.
from statsplots.jl.
I really like of the new Query syntax and I'm actually trying to implement something similar. The way groupapply
works is actually a bit convoluted. The idea is that there is a first phase of selecting and grouping which really is just:
school |> @where(_.Minrty=="Yes") |> @groupby(_.Sx)
Then the compute_error(:across, _.School)
means that the following happens:
- a common discretization of the x axis is chosen
- the data is further split across
:School
and the desired curve is drawn for each school (in this case:density
of:MAch
column, whatever that is). However, given that there are many curves like that, only the mean and standard error across all of those curves are shown, to give an idea of what is the average shape of:density
of:MAch
and how much variability we have from one school to another.
There are a some examples at the end of the README in the groupapply
section, maybe they are helpful.
In pure Query.jl terms I actually thought that this compute_error
part was a hard to express, the whole thing would probably look something like (I may have got some of Query's syntax wrong but I hope you get the idea):
s = @from i in school begin
@where i.Minrty == "Yes"
@groupby i by i.Sx into j
@select {j..MAch, j..School, } into k
@let shared_axis = compute_axis(k..MAch)
@group k.MAch by k.School into l
@select {axis = shared_axis, density = compute_density(l, shared_axis) } into m
@group m by m.axis into n
@select {mean(n..density), sem(n..density)}
@collect DataFrame
end
I've started a pipeline based implementation in https://github.com/piever/GroupedErrors.jl but it's still not finalized/documented: I'll start adding docs and clarify syntax and open an issue to discuss Query integration as soon as I'm back from holidays (in a week or so).
What I think is a better way to implement this stuff, both in terms of code clarity and performance (and what I have implemented so far) is to use IterableTables (or maybe even Query - so far I have reimplemented the bits of Query that I needed but hopefully there is an approach with less code duplication) to:
- filter the data
- extract columns corresponding to grouping variables (
:Sx
, but I accept an arbitrary number of grouping variables), compute_error variables (:School
) and data variables (:MAch
)
Then I create an IndexedTable with grouping and compute_error variables as index columns and data variables as data column(s). Once in that format it seems to me that all the data manipulations I need seems easy enough (mapslices
and reducedim
essentially). Still I feel we should move the discussion in that package as soon as the code is clean/documented enough that it makes sense for you to look into it.
from statsplots.jl.
I'm also taking the liberty to ping @davidanthoff to get a more experienced opinion as to what a query-like syntax for groupapply
should look like (and whether it makes sense in the first place), or more generally what would be the preferred way of combining some basic data manipulation with a plot in terms of syntax.
from statsplots.jl.
I can't help thinking that this is starting to look like a whole analytical framework in itself - if the groupapply would have a special bespoke linq-like syntax that wouldn't be used anywhere else in StatPlots, and actually doesn't draw very much on StatPlots either - except for the plotting. Also it sounds like you may develop this into a whole data-analytical framework focusing on splitting, calculating error and combining.
Could it be an idea to make a package "GroupedErrors.jl" instead? It sounds to me as if all it would need from StatPlots was the user recipe on groupederror
. That would give you complete freedom to develop it in any direction you wanted, and maybe create an even more Query-compatible data analysis framework. It sounds like there are good arguments for making it a separate package, and no strong need to keep it together.
I can understand if you're concerned about discoverability - many people know StatPlots (though my impression is the vast majority just use Plots). But we could keep a referral in the readme "the awesome groupapply recipes have been moved to a standalone package".
Let me know your thoughts.
from statsplots.jl.
On a second thought, there may be technical reasons why the some groupederror
like recipe should live in StatPlots, but I think we can worry about that when it's time to register GroupedErrors.jl
from statsplots.jl.
Great, I think that will be really useful to people, and your PlugAndPlot package also looks pretty sweet. BTW, do you know CrossfilterCharts.jl ? http://nbviewer.jupyter.org/github/tawheeler/CrossfilterCharts.jl/blob/master/docs/CrossfilterCharts.ipynb
from statsplots.jl.
Looks interesting, didn't know about that one!
from statsplots.jl.
Actually, after looking more closely at the new "pipeline style" Query (once again, I think the new syntax really is a big improvement) I should actually be able to use it to a large extent. There only is one last thing that is bugging me: I tend to need to use arbitrary sets of columns (maybe found programmatically) which is really not compatible with NamedTuples, at least now. It'd be hugely helpful to be able to @collect
also an iterator that returns tuples (and not named tuples). For example I believe the following could be made to work pretty easily:
using DataFrames, Query, IndexedTables
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
x = df |>
@where(_.age>40) |>
@select(( _.name, _.children)) |>
Columns
As Columns
has both a named and unnamed version. This way I would solve all the issues related to working with NamedTuples and programmatically found sets of columns.
from statsplots.jl.
This can be closed now that the whole groupapply
functionality has been moved to GroupedErrors
from statsplots.jl.
Related Issues (20)
- "reducing over an empty collection" error in `dotplot`
- invalid image url for "Dendrogram on the right side"
- ytick doesn't show all ticks
- Documentation of StatsPlots.jl is in Plots.jl repo? HOT 1
- Raincloud Plots HOT 3
- Removing dependency on DataValues.jl
- `groupedhist` and `groupedbar` `ylims` not working when x and y axis switched.
- Bugs in `cornerplot` and `corrplot` HOT 2
- How to specify fillcolor for violin plots? HOT 1
- The broadcast pipe operator doesn't get recognized by the @df macro
- kwagrs splatting does not work for errorline
- Feature request: support for percentograms HOT 1
- Plot of ecdf with groups gives incorrect plot HOT 3
- errorline! does not work with Dates x-axis
- julia 1.9.4 crashes with corrplot >11 columns HOT 2
- Line breaks when using @df macro with missing data
- New release HOT 1
- Feature Request: Automatically set default xlabel and ylabel given the column names
- Example "ordination" in readme.md doesn't works
- `xformatter` has no effect in `groupedbar()` when `x`, `y`, and `group` are provided
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from statsplots.jl.