Comments (15)
- are the indices related from one stream to another?
- do we still assume we are doing
SELECT *
like queries? - are this going to be loaded at the same time? (related to 2)
from macrobase.
Why not transform json into relational tables, and load them into postgres? Just wondering how complex the transformation is.
from macrobase.
The goal here is to do this in a somewhat generic way.
One thought is to have a JSONLoader
class that allows you to declare "virtual tables" over each file (e.g., comments: userComments*.json
) and then select metrics and attributes of the form table.feature
.
There are a few related projects:
- JaQL looks like it got swallowed by IBM: https://code.google.com/archive/p/jaql/
- JsonPath: XPath is not my favorite: http://goessner.net/articles/JsonPath/
- JsonIQ: http://jsoniq.org/
Jackson gives us cross-language support: https://github.com/FasterXML/jackson
from macrobase.
are the indices related from one stream to another?
Likely, yes, you'll want to do joins.
do we still assume we are doing SELECT * like queries?
Ideally. Otherwise, how else do we want to do this?
are this going to be loaded at the same time? (related to 2)
Depends. What's most expedient?
from macrobase.
Why not transform json into relational tables, and load them into postgres? Just wondering how complex the transformation is.
May be doable. Postgres is very slow compared to reading from disk. I am curious -- is there an easy way to take JSON and put it into an in-memory JDBC-like DB instead of Postgres?
from macrobase.
We could load one of each type of file at a time, and consider that to be a "streaming" query, maybe?
from macrobase.
@deepakn94 The question is how to do this if we want to join, say, clicks with comments.
from macrobase.
Per @raininthesun's suggestion, it should be possible to use Postgres's built-in JSON support.
We could also use Postgres's foreign data wrapper support; someone already has one for JSON: http://pgxn.org/dist/json_fdw/
I believe DeepDive uses Greenplum, and we could too. (However, Greenplum doesn't have JSON support yet.)
from macrobase.
Postgres JSON loader: https://github.com/lukasmartinelli/pgfutter
from macrobase.
@pbailis I think this leads to a broader question of how we want to handle multiple incoming streams in general (especially if you could possibly join them), right?
from macrobase.
This is perhaps an aside, but have you tried to look into why Postgres is so slow? I would expect SELECT *
queries, especially without joins, to be similar speed to using our own disk cache. What's the approximate slowdown with Postgres? Is it because SELECT *
is trying to load everything into memory and swapping, and we should be batching reads instead?
from macrobase.
The machine we're on has 250 Gigabytes of RAM, so I would be surprised if it's because of memory + swapping. I don't think Postgres isn't optimized for full range queries (performs much better when index lookups are feasible): I think the layers of abstraction that help to keep things well organized completely screw the performance, but I could be wrong here.
from macrobase.
I wonder if our use of nested queries also kills performance.
from macrobase.
@viveksjain: I believe Postgres is smart enough to push the column selections into the subquery. My hypothesis why disk caching is fast is that we're only scanning over the columns that we want. In contrast, when we select a particular set of columns from Postgres, it has to scan over all of the data in each row due to its row-oriented layout. With wide rows (i.e., many columns, as in CMT), this is expensive.
A columnar-oriented storage engine should help here.
postgres=# \timing
Timing is on.
postgres=# SELECT COUNT(dataset_id) FROM (SELECT * from mapmatch_history) AS bq;
count
----------
ZZZ
(1 row)
Time: 16401.287 ms
postgres=# SELECT COUNT(dataset_id) FROM mapmatch_history;
count
----------
ZZZ
(1 row)
Time: 16096.268 ms
postgres=# EXPLAIN SELECT dataset_id FROM (SELECT * from mapmatch_history) AS bq;
QUERY PLAN
-----------------------------------------------------------------------------
Seq Scan on mapmatch_history (cost=0.00..ZZZ rows=ZZZ width=4)
(1 row)
Time: 25.282 ms
postgres=#
from macrobase.
Going to experiment with Postgres this week.
from macrobase.
Related Issues (20)
- Bump DropWizard version to work with JDK 9
- WindowedOperator initial value
- Can't view pictures in tutorial HOT 4
- Time Complexity of the Algorithm? HOT 1
- [Feature Request] Select All as Explanatory Variable HOT 1
- [Feature Request] Enlarge the graph (through button or dragging or just set height and width maybe?) HOT 1
- Toggle tuple/table output in SQL
- SQL: redundant explanations (probably) should be collapsed by pruning rules HOT 1
- SQL: counts should be integers HOT 1
- Test fails because of locale HOT 1
- CSVDataFrameParser fails on null in double columns HOT 1
- MacroBase SQL tests fail because of spaces in path
- How can I print threshold beyond which MAD classifies a metric as outliers? HOT 1
- Issues with MacroBase streaming mode explanations
- Bug in risk ratio formula and boundary condition in legacy and lib source code files
- Question on MacrobaseSQL Syntax and Schema HOT 1
- test failure while building macrobase HOT 2
- [Error] Multiple servlets map to path:
- Publish to maven
- How to pass csv data as body to rest server
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from macrobase.