Code Monkey home page Code Monkey logo

jsonsql's People

Contributors

needles-tim avatar tim-patterson avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

shortysarah

jsonsql's Issues

Change to using apache arrow as tuples

The reasons behind this are to enable:

  1. Vectorisation in the future, both at the operation and at the expression/function level and objectinspector level
  2. It gives us an efficient serialisable format for a packed in-memory and fast on disk formats to enable larger in-memory sorts and easy ability to spill to disk in the future if needed.
  3. It gives us the ability to pass the buffers between "languages" in the same process, ie some functions or operators could be implemented in rust and called via the jni.

At the same time this opportunity should be used to remove:

  1. The compiled function nonsense.
  2. The __all__ thing and replace with a true * expansion.

Investigate better csv parsers

Csv parser pulls in quite a few deps and doesn't seem to be that configurable around what to consider as null,escaping of separators etc

Add support for more natural select *

The * could just be treated just the same as the __all__ is currently and have support to expand in the ui or in the file sinks etc.

Edge cases

Select f.foo from (
  select * from ...
) f

Any fields like foo that we can't find during semantic validation attempt to pull out of *

Select foo from (
  select `*`["foo"] as foo from ...
)

In fact we could even push the * all the way down from the top, ie if its not in the top level select we'd end up writing it out.

The only trouble here is if we had something like

Select foo from (
  select * from ...
) a
join (
  select * from ...
) b on ...

we could generate a coalesce but maybe it's more correct to just throw a semantic error

Change gather to use the same physical children with multiple calls to data()

Currently gathers branches are calculated at query compile time and have a whole bunch of weird hacks to create a bunch of children with different file sources etc.
Now with the data() method these can be done at runtime.
And it might be better to override the paths via some context object that we pass into the data() method.

Create aliasing operator

to tidy up code and prevent the need for tableAlias: String? inGroupByOperator,ProjectOperator and TableScanOperator

s3 caching

When exploring s3 datasets it might be useful to implement an s3 caching strategy.

Add support for explain logical

We should support spitting out the logical operator tree as sql after any optimisations and query rewrites(select distinct, from ( select * ... for debugging purposes

Dir table type

If we add a table type of dir alongside csv and json we could do stuff like:

select filename, size from dir '/some/directory' order by size desc limit 10

or even

-- top directories ordered by volume of json files
select parent_dir, sum(size) as total_size
from dir '/some/directory'
where extension = 'json'
group by parent_dir
order by total_size desc limit 10

allow describe, explain etc to be used inside of selects

The reasoning for this is to allow stuff like

INSERT INTO csv 'create_table.sql'
DESCRIBE json 's3://...'

It could also eventually be used for stuff like

SELECT function_name, description FROM (
  SHOW functions 
) WHERE function_name like '%str%'

Auto alias dot notation

When using dot notation to access nested fields ie
field.subfield the expression just gets the default alias of _col1 etc,
It should be easy enough to capture this info when building the ast to at least have a bit more of a sane default

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.