mechanicalrabbit / dataknots.jl Goto Github PK

View Code? Open in Web Editor NEW

65.0 65.0 5.0 3.24 MB

an extensible, practical and coherent algebra of query combinators

Home Page: https://mechanicalrabbit.github.io/DataKnots.jl

License: Other

Julia 100.00%

dsl julia query query-algebra

dataknots.jl's People

Contributors

Stargazers

Watchers

Forkers

stjordanis sbalci essenciary standardgalactic playfloor

dataknots.jl's Issues

Explicit handling of missing values

Hello DataKnots team,

I'm looking into the DataKnots project and I'm excited about what I see. It looks like a very powerful tool.

I do have an issue that I'd like to discuss. In my use cases, values are "missing not at random", and I need to treat them with caution. For example, it might be that the lowest true values are always unobserved. Naive behavior when filtering, joining, or aggregating on missing values will lead me to incorrect conclusions.

In base Julia, filter lets me be confident I'm not accidentally dropping significant missing values.

# Note the missing salary.
julia> employee_csv = """
                  name,department,position,salary
                  "ANTHONY A","POLICE","POLICE OFFICER",72510
                  "DANIEL A","FIRE","FIRE FIGHTER-EMT",95484
                  "JAMES A","FIRE","FIRE ENGINEER-EMT",103350
                  "JEFFERY A","POLICE","SERGEANT",101442
                  "NANCY A","POLICE","POLICE OFFICER",80016
                  "ROBERT K","FIRE","FIRE FIGHTER-EMT",
                  """ |> IOBuffer |> CSV.File
6-element CSV.File{false}:
 CSV.Row: (name = "ANTHONY A", department = "POLICE", position = "POLICE OFFICER", salary = 72510)
 CSV.Row: (name = "DANIEL A", department = "FIRE", position = "FIRE FIGHTER-EMT", salary = 95484)
 CSV.Row: (name = "JAMES A", department = "FIRE", position = "FIRE ENGINEER-EMT", salary = 103350)
 CSV.Row: (name = "JEFFERY A", department = "POLICE", position = "SERGEANT", salary = 101442)
 CSV.Row: (name = "NANCY A", department = "POLICE", position = "POLICE OFFICER", salary = 80016)
 CSV.Row: (name = "ROBERT K", department = "FIRE", position = "FIRE FIGHTER-EMT", salary = missing)

julia> filter(x->x.salary < 100_000, employee_csv)
ERROR: TypeError: non-boolean (Missing) used in boolean context
Stacktrace:
 [1] filter(f::var"#11#12", a::CSV.File{false})
   @ Base ./array.jl:2522
 [2] top-level scope
   @ REPL[29]:1

julia> filter(x-> coalesce(x.salary < 100_000, false), employee_csv)
3-element Vector{CSV.Row}:
 CSV.Row: (name = "ANTHONY A", department = "POLICE", position = "POLICE OFFICER", salary = 72510)
 CSV.Row: (name = "DANIEL A", department = "FIRE", position = "FIRE FIGHTER-EMT", salary = 95484)
 CSV.Row: (name = "NANCY A", department = "POLICE", position = "POLICE OFFICER", salary = 80016)

On the other hand, currently DataKnots.jl silently drops missing values.

julia> chicago = DataKnot(:employee => employee_csv);

julia> @query chicago begin
        employee
        filter(salary < 100000)
        end
  │ employee                                        │
  │ name       department  position          salary │
──┼─────────────────────────────────────────────────┼
1 │ ANTHONY A  POLICE      POLICE OFFICER     72510 │
2 │ DANIEL A   FIRE        FIRE FIGHTER-EMT   95484 │
3 │ NANCY A    POLICE      POLICE OFFICER     80016 │

Using tools that require me to mentally track missingness and ensure rows aren't silently dropped takes effort I'd rather spend on other parts of my analysis. Tools like Missings.jl's passmissing(f)(x) and f(skipmissing(xs)) make it easier to do this explicitly.

For more discussion, see JuliaData/DataFrames.jl#2499 about joining tables on missing values.

Differences with others query packages

Hello,

Seeing JuliaDatabases/DBAPI.jl#17 (comment)
I wonder what are differences between DataKnots.jl from @rbt-lang (@xitology ...) which is based on query combinators described in https://github.com/rbt-lang/rbt-paper/blob/master/pdf/rbt-paper-2016-12-14.pdf

and others packages such as:

Query.jl from @davidanthoff
DataFramesMeta.jl from @JuliaStats (@tshort @bramtayl ...)
StructuredQueries.jl from @davidagold
LazyQuery.jl from @bramtayl
SplitApplyCombine.jl from @JuliaData (@andyferris ...)

What features are possible with DataKnots.jl that aren't possible with others libraries.

Kind regards

Properly convert a Dict into a DataKnot

Hi there! I'm excited by the potential of this library for my day to day work: I appreciate the elegant, coherent and highly compose-able nature of this approach.

I would expect the two following commands to result in the same DataKnot

convert(DataKnot,(joe = (bob = [1,2,3], bill=[3,4,5]),))
convert(DataKnot,Dict("joe" => Dict("bob" => [1,2,3], "bill"=>[3,4,5])))

Since—with using JSON—the latter format is how data is parsed.

Doesn't seem like it would be hard to implement. I'm happy to submit a pull request if there's interest.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.