If someone does a write/read in a map/reduce operation, the flowDef will be null (beca

Check for null flowDef about scalding HOT 2 CLOSED

twitter commented on May 14, 2024

Check for null flowDef

from scalding.

Comments (2)

krishnanraman commented on May 14, 2024

I tried the following for the problem I'm facing -
I have a large pipe, with 2 columns - say the columns are 'advertiser_id, 'bids
Say I limit to 100 rows
Now I will have a datafile that has 100 rows & 2 columns
I want each row to go into its own Tsv whose name is the entry in 'advertiser_id of that row.
I try writing to 100 files in a map & its fails with npe because flowdef is null.

But how about I put the datafile in a matrix, transpose to get a 2 row 100 column matrix.
Now run a map & filter on the column-name, and return a pipe that contains the bids for that advertiser.
Since there are exactly 100 pipes, just iterate over the pipes in a regular for loop & write to the sinks.
This works locally for small number of columns...without needing PailSource or TemplateTap.

Is this a scalable solution ?

from scalding.

johnynek commented on May 14, 2024

I think this is going to be much more computationally expensive, but I don't really know.

This is exactly the problem that Pail is supposed to solve, so I really think we should investigate it.

By the way, next year, we will start sharding our processed logs more by column data, so we have to start using Pail more, as far as I can see. I think now is the time to adopt it more widely.

PS: Sam Ritchie can help you get started on your first Pail source.

from scalding.

Related Issues (20)

Recommend Projects

Check for null flowDef about scalding HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent