Code Monkey home page Code Monkey logo

Comments (15)

mamikonyana avatar mamikonyana commented on September 18, 2024
  1. are the indices related from one stream to another?
  2. do we still assume we are doing SELECT * like queries?
  3. are this going to be loaded at the same time? (related to 2)

from macrobase.

raininthesun avatar raininthesun commented on September 18, 2024

Why not transform json into relational tables, and load them into postgres? Just wondering how complex the transformation is.

from macrobase.

pbailis avatar pbailis commented on September 18, 2024

The goal here is to do this in a somewhat generic way.

One thought is to have a JSONLoader class that allows you to declare "virtual tables" over each file (e.g., comments: userComments*.json) and then select metrics and attributes of the form table.feature.

There are a few related projects:

Jackson gives us cross-language support: https://github.com/FasterXML/jackson

from macrobase.

pbailis avatar pbailis commented on September 18, 2024

@mamikonyana

are the indices related from one stream to another?

Likely, yes, you'll want to do joins.

do we still assume we are doing SELECT * like queries?

Ideally. Otherwise, how else do we want to do this?

are this going to be loaded at the same time? (related to 2)

Depends. What's most expedient?

from macrobase.

pbailis avatar pbailis commented on September 18, 2024

@raininthesun

Why not transform json into relational tables, and load them into postgres? Just wondering how complex the transformation is.

May be doable. Postgres is very slow compared to reading from disk. I am curious -- is there an easy way to take JSON and put it into an in-memory JDBC-like DB instead of Postgres?

from macrobase.

deepakn94 avatar deepakn94 commented on September 18, 2024

We could load one of each type of file at a time, and consider that to be a "streaming" query, maybe?

from macrobase.

pbailis avatar pbailis commented on September 18, 2024

@deepakn94 The question is how to do this if we want to join, say, clicks with comments.

from macrobase.

pbailis avatar pbailis commented on September 18, 2024

Per @raininthesun's suggestion, it should be possible to use Postgres's built-in JSON support.

We could also use Postgres's foreign data wrapper support; someone already has one for JSON: http://pgxn.org/dist/json_fdw/

I believe DeepDive uses Greenplum, and we could too. (However, Greenplum doesn't have JSON support yet.)

from macrobase.

pbailis avatar pbailis commented on September 18, 2024

Postgres JSON loader: https://github.com/lukasmartinelli/pgfutter

from macrobase.

deepakn94 avatar deepakn94 commented on September 18, 2024

@pbailis I think this leads to a broader question of how we want to handle multiple incoming streams in general (especially if you could possibly join them), right?

from macrobase.

viveksjain avatar viveksjain commented on September 18, 2024

This is perhaps an aside, but have you tried to look into why Postgres is so slow? I would expect SELECT * queries, especially without joins, to be similar speed to using our own disk cache. What's the approximate slowdown with Postgres? Is it because SELECT * is trying to load everything into memory and swapping, and we should be batching reads instead?

from macrobase.

deepakn94 avatar deepakn94 commented on September 18, 2024

The machine we're on has 250 Gigabytes of RAM, so I would be surprised if it's because of memory + swapping. I don't think Postgres isn't optimized for full range queries (performs much better when index lookups are feasible): I think the layers of abstraction that help to keep things well organized completely screw the performance, but I could be wrong here.

from macrobase.

viveksjain avatar viveksjain commented on September 18, 2024

I wonder if our use of nested queries also kills performance.

from macrobase.

pbailis avatar pbailis commented on September 18, 2024

@viveksjain: I believe Postgres is smart enough to push the column selections into the subquery. My hypothesis why disk caching is fast is that we're only scanning over the columns that we want. In contrast, when we select a particular set of columns from Postgres, it has to scan over all of the data in each row due to its row-oriented layout. With wide rows (i.e., many columns, as in CMT), this is expensive.

A columnar-oriented storage engine should help here.

postgres=# \timing
Timing is on.
postgres=# SELECT COUNT(dataset_id) FROM (SELECT * from mapmatch_history) AS bq;
  count   
----------
 ZZZ
(1 row)

Time: 16401.287 ms
postgres=# SELECT COUNT(dataset_id) FROM mapmatch_history;
  count   
----------
 ZZZ
(1 row)

Time: 16096.268 ms
postgres=# EXPLAIN SELECT dataset_id FROM (SELECT * from mapmatch_history) AS bq;
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Seq Scan on mapmatch_history  (cost=0.00..ZZZ rows=ZZZ width=4)
(1 row)

Time: 25.282 ms
postgres=# 

from macrobase.

pbailis avatar pbailis commented on September 18, 2024

Going to experiment with Postgres this week.

from macrobase.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.