We're going to need to run over some JSON files in the near future. Say we have a bunc

are the indices related from one stream to another? do we still assume

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Per <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Design discussion: how to handle JSON about macrobase HOT 15 CLOSED

pbailis commented on September 18, 2024

Design discussion: how to handle JSON

from macrobase.

Comments (15)

mamikonyana commented on September 18, 2024

are the indices related from one stream to another?
do we still assume we are doing SELECT * like queries?
are this going to be loaded at the same time? (related to 2)

from macrobase.

raininthesun commented on September 18, 2024

Why not transform json into relational tables, and load them into postgres? Just wondering how complex the transformation is.

from macrobase.

pbailis commented on September 18, 2024

The goal here is to do this in a somewhat generic way.

One thought is to have a JSONLoader class that allows you to declare "virtual tables" over each file (e.g., comments: userComments*.json) and then select metrics and attributes of the form table.feature.

There are a few related projects:

JaQL looks like it got swallowed by IBM: https://code.google.com/archive/p/jaql/
JsonPath: XPath is not my favorite: http://goessner.net/articles/JsonPath/
JsonIQ: http://jsoniq.org/

Jackson gives us cross-language support: https://github.com/FasterXML/jackson

from macrobase.

pbailis commented on September 18, 2024

@mamikonyana

are the indices related from one stream to another?

Likely, yes, you'll want to do joins.

do we still assume we are doing SELECT * like queries?

Ideally. Otherwise, how else do we want to do this?

are this going to be loaded at the same time? (related to 2)

Depends. What's most expedient?

from macrobase.

pbailis commented on September 18, 2024

@raininthesun

Why not transform json into relational tables, and load them into postgres? Just wondering how complex the transformation is.

May be doable. Postgres is very slow compared to reading from disk. I am curious -- is there an easy way to take JSON and put it into an in-memory JDBC-like DB instead of Postgres?

from macrobase.

deepakn94 commented on September 18, 2024

We could load one of each type of file at a time, and consider that to be a "streaming" query, maybe?

from macrobase.

pbailis commented on September 18, 2024

@deepakn94 The question is how to do this if we want to join, say, clicks with comments.

from macrobase.

pbailis commented on September 18, 2024

Per @raininthesun's suggestion, it should be possible to use Postgres's built-in JSON support.

We could also use Postgres's foreign data wrapper support; someone already has one for JSON: http://pgxn.org/dist/json_fdw/

I believe DeepDive uses Greenplum, and we could too. (However, Greenplum doesn't have JSON support yet.)

from macrobase.

pbailis commented on September 18, 2024

Postgres JSON loader: https://github.com/lukasmartinelli/pgfutter

from macrobase.

deepakn94 commented on September 18, 2024

@pbailis I think this leads to a broader question of how we want to handle multiple incoming streams in general (especially if you could possibly join them), right?

from macrobase.

viveksjain commented on September 18, 2024

This is perhaps an aside, but have you tried to look into why Postgres is so slow? I would expect SELECT * queries, especially without joins, to be similar speed to using our own disk cache. What's the approximate slowdown with Postgres? Is it because SELECT * is trying to load everything into memory and swapping, and we should be batching reads instead?

from macrobase.

deepakn94 commented on September 18, 2024

The machine we're on has 250 Gigabytes of RAM, so I would be surprised if it's because of memory + swapping. I don't think Postgres isn't optimized for full range queries (performs much better when index lookups are feasible): I think the layers of abstraction that help to keep things well organized completely screw the performance, but I could be wrong here.

from macrobase.

viveksjain commented on September 18, 2024

I wonder if our use of nested queries also kills performance.

from macrobase.

pbailis commented on September 18, 2024

@viveksjain: I believe Postgres is smart enough to push the column selections into the subquery. My hypothesis why disk caching is fast is that we're only scanning over the columns that we want. In contrast, when we select a particular set of columns from Postgres, it has to scan over all of the data in each row due to its row-oriented layout. With wide rows (i.e., many columns, as in CMT), this is expensive.

A columnar-oriented storage engine should help here.

postgres=# \timing
Timing is on.
postgres=# SELECT COUNT(dataset_id) FROM (SELECT * from mapmatch_history) AS bq;
  count   
----------
 ZZZ
(1 row)

Time: 16401.287 ms
postgres=# SELECT COUNT(dataset_id) FROM mapmatch_history;
  count   
----------
 ZZZ
(1 row)

Time: 16096.268 ms
postgres=# EXPLAIN SELECT dataset_id FROM (SELECT * from mapmatch_history) AS bq;
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Seq Scan on mapmatch_history  (cost=0.00..ZZZ rows=ZZZ width=4)
(1 row)

Time: 25.282 ms
postgres=#

from macrobase.

pbailis commented on September 18, 2024

Going to experiment with Postgres this week.

from macrobase.

Design discussion: how to handle JSON about macrobase HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent