Code Monkey home page Code Monkey logo

Comments (10)

piever avatar piever commented on August 12, 2024 2

Actually one of the reasons why I developed this functionality is to be able to systematize routine data analysis that I end up doing generally (a lot of select + split/apply/combine). In this way it's easier for inexperienced users to make somewhat complicated plots with simpler commands: I am also working on a GUI to simplify the process even more (you can load a csv, select, split and plot with a few clicks: see https://github.com/piever/PlugAndPlot.jl/). Ideally it has other applications than can be generalized to other plots than groupapply, but all of this doesn't need to live in StatPlots: it's actually easier for me to develop it if it doesn't.

Then maybe GroupedErrors could take care of all the data manipulation dependencies (thanks to the macro, we no longer need DataFrames on StatPlots) and take care of the data shuffling while the plots are actually drawn in StatPlots. The recipe for the groupederror is something that I'm reimplementing to use IndexedTables, so in terms of dependencies it's probably better if it lives in GroupedErrors (and I don't need a StatPlots dependency to have it, I'd just need RecipesBase). Also the Loess dependency can be dropped, as it's only really used by groupapply.

I'll try to develop this separately and when it's mature enough we can add a link to it (as well as perhaps to the GUI if that also develops into a mature package). I'm unsure why the thought hasn't occurred to me before you mentioned it, it really makes a lot of sense...

from statsplots.jl.

davidanthoff avatar davidanthoff commented on August 12, 2024 2

It would certainly be great if we could integrate this with the Query.jl story! I published a new version recently, that brings a pipe syntax to Query.jl, so this seems good timing to try to figure out how we can make these things match and work together.

I have to admit I don't fully understand the original thing that is happening in the graph at the top. I think the beginning of the whole data analysis we could create in Query with

school |> @where(_.Minrty=="Yes") |> @groupby(_.Sx)

this would give you two groups. Or should things be split by Sx and School in this stage? That would be

school |> @where(_.Minrty=="Yes") |> @groupby({_.Sx, _.School})

But then I don't understand what the next step in the original analysis is.

from statsplots.jl.

piever avatar piever commented on August 12, 2024 1

I really like of the new Query syntax and I'm actually trying to implement something similar. The way groupapply works is actually a bit convoluted. The idea is that there is a first phase of selecting and grouping which really is just:
school |> @where(_.Minrty=="Yes") |> @groupby(_.Sx)

Then the compute_error(:across, _.School) means that the following happens:

  • a common discretization of the x axis is chosen
  • the data is further split across :School and the desired curve is drawn for each school (in this case :density of :MAch column, whatever that is). However, given that there are many curves like that, only the mean and standard error across all of those curves are shown, to give an idea of what is the average shape of :density of :MAch and how much variability we have from one school to another.

There are a some examples at the end of the README in the groupapplysection, maybe they are helpful.

In pure Query.jl terms I actually thought that this compute_error part was a hard to express, the whole thing would probably look something like (I may have got some of Query's syntax wrong but I hope you get the idea):

s = @from i in school begin
    @where i.Minrty == "Yes"
    @groupby i by i.Sx into j
    @select {j..MAch, j..School, } into k
    @let shared_axis = compute_axis(k..MAch)
    @group k.MAch by k.School into l
    @select {axis = shared_axis, density = compute_density(l, shared_axis) } into m
    @group m by m.axis into n
    @select {mean(n..density), sem(n..density)}
    @collect DataFrame
end

I've started a pipeline based implementation in https://github.com/piever/GroupedErrors.jl but it's still not finalized/documented: I'll start adding docs and clarify syntax and open an issue to discuss Query integration as soon as I'm back from holidays (in a week or so).

What I think is a better way to implement this stuff, both in terms of code clarity and performance (and what I have implemented so far) is to use IterableTables (or maybe even Query - so far I have reimplemented the bits of Query that I needed but hopefully there is an approach with less code duplication) to:

  • filter the data
  • extract columns corresponding to grouping variables (:Sx, but I accept an arbitrary number of grouping variables), compute_error variables (:School) and data variables (:MAch)

Then I create an IndexedTable with grouping and compute_error variables as index columns and data variables as data column(s). Once in that format it seems to me that all the data manipulations I need seems easy enough (mapslices and reducedim essentially). Still I feel we should move the discussion in that package as soon as the code is clean/documented enough that it makes sense for you to look into it.

from statsplots.jl.

piever avatar piever commented on August 12, 2024

I'm also taking the liberty to ping @davidanthoff to get a more experienced opinion as to what a query-like syntax for groupapply should look like (and whether it makes sense in the first place), or more generally what would be the preferred way of combining some basic data manipulation with a plot in terms of syntax.

from statsplots.jl.

mkborregaard avatar mkborregaard commented on August 12, 2024

I can't help thinking that this is starting to look like a whole analytical framework in itself - if the groupapply would have a special bespoke linq-like syntax that wouldn't be used anywhere else in StatPlots, and actually doesn't draw very much on StatPlots either - except for the plotting. Also it sounds like you may develop this into a whole data-analytical framework focusing on splitting, calculating error and combining.

Could it be an idea to make a package "GroupedErrors.jl" instead? It sounds to me as if all it would need from StatPlots was the user recipe on groupederror. That would give you complete freedom to develop it in any direction you wanted, and maybe create an even more Query-compatible data analysis framework. It sounds like there are good arguments for making it a separate package, and no strong need to keep it together.

I can understand if you're concerned about discoverability - many people know StatPlots (though my impression is the vast majority just use Plots). But we could keep a referral in the readme "the awesome groupapply recipes have been moved to a standalone package".

Let me know your thoughts.

from statsplots.jl.

piever avatar piever commented on August 12, 2024

On a second thought, there may be technical reasons why the some groupederror like recipe should live in StatPlots, but I think we can worry about that when it's time to register GroupedErrors.jl

from statsplots.jl.

mkborregaard avatar mkborregaard commented on August 12, 2024

Great, I think that will be really useful to people, and your PlugAndPlot package also looks pretty sweet. BTW, do you know CrossfilterCharts.jl ? http://nbviewer.jupyter.org/github/tawheeler/CrossfilterCharts.jl/blob/master/docs/CrossfilterCharts.ipynb

from statsplots.jl.

piever avatar piever commented on August 12, 2024

Looks interesting, didn't know about that one!

from statsplots.jl.

piever avatar piever commented on August 12, 2024

Actually, after looking more closely at the new "pipeline style" Query (once again, I think the new syntax really is a big improvement) I should actually be able to use it to a large extent. There only is one last thing that is bugging me: I tend to need to use arbitrary sets of columns (maybe found programmatically) which is really not compatible with NamedTuples, at least now. It'd be hugely helpful to be able to @collect also an iterator that returns tuples (and not named tuples). For example I believe the following could be made to work pretty easily:

using DataFrames, Query, IndexedTables
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
x = df |>
    @where(_.age>40) |>
    @select((  _.name,  _.children)) |>
    Columns

As Columns has both a named and unnamed version. This way I would solve all the issues related to working with NamedTuples and programmatically found sets of columns.

from statsplots.jl.

piever avatar piever commented on August 12, 2024

This can be closed now that the whole groupapply functionality has been moved to GroupedErrors

from statsplots.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.