Builded data like this: df.repartition(20, col("id")).write.p

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

How about introduce a light filter strategy that based on the col used in Dataset#repartition. about parquet-index HOT 10 OPEN

Aaaaaaron commented on May 29, 2024

How about introduce a light filter strategy that based on the col used in Dataset#repartition.

from parquet-index.

Comments (10)

sadikovi commented on May 29, 2024 1

That is a really good idea. Please do submit a PR.

from parquet-index.

sadikovi commented on May 29, 2024 1

First of all, the feature you are trying to add is similar to bucketing, so it might be worth researching that a little bit. And yes, it is supported (with managed tables).

Yes, you do need to keep track of columns that dataset is repartitioned by. No one is going to provide that column for you, even more remember what column was used to repartition dataset that was created months ago. So you would have to record column(s) somehow and perform validation to tell user whether or not this is the correct column that they can filter by (or fall back to normal filtering, either one will do).
Probably, the name is too specific, but I don't see adding anything else besides Parquet, considering the amount of features other datasources have.
Not necessarily, it depends what predicate is. You can repartition by multiple columns, but then bucketing filter should only be triggered when predicate appears to have all columns required.
GroupBy example is a variation of repartition, you would have to group keys together in order to perform aggregation. If I save such DataFrame you can also look at the plan and decide to store index for groupBy columns.
We should support bucketing, IMHO.
What about non-equality filters? I did not see any assertions or validations of such. It looks like you can only apply equality filters on bucketed file layout.
IMHO, this feature should be automated, and users should not provide any columns to filter, since we can manage all of that ourselves.
I am curious, does current version of index show the same performance on repartitioned dataset? Have you run any benchmarks? Because if there is no performance improvement, I don't see a point of investing more effort into it.

Anyway, it is a good start, but this work needs some design work, IMHO.

from parquet-index.

Aaaaaaron commented on May 29, 2024

Hi @sadikovi , I submit an initial commit and there are some remaining work to be done. And any review comments are welcome. Thanks a lot!

basic logic
strategy about "parquet index"/"partition index", how they work together.
unit tests

from parquet-index.

sadikovi commented on May 29, 2024

Great! Do you record partition columns in the index?

from parquet-index.

sadikovi commented on May 29, 2024

It looks like this is similar to bucketing, which index does not support yet. Maybe we can fix that as well...

from parquet-index.

sadikovi commented on May 29, 2024

Thanks for the work you have done so far!

I would like to see this feature integrated into parquet file index, instead of having it as a separate index data source. This could be applied after partition pruning to determine if it is possible to reduce set of files even further. This feature is similar to bucketing and I think we should try making it automatic - user would just specify filters and we would have to figure out what actions to take.

Several things to consider:

What happens when there are multiple repartition, e.g. df.repartition(...).repartition(...). Can we leverage this somehow?
What happens when someone provides more than one column in repartition? How does filter work in this case, if at all?
What happens when there is an aggregate, e.g. df.groupBy("id").count(). Can we also leverage this information somehow?
What happens when someone saves it like df.write.partitionBy().bucketBy().parquet(...)? Table will still be partitioned by the set of columns.

I think answering these questions would be a very good start. Let's plan this feature a little bit. Also we need to make sure that it does not break previous versions (not required, but desirable:)).

Let me know if you disagree or have corrections/suggestions. Thanks.

from parquet-index.

Aaaaaaron commented on May 29, 2024

still draft

Q: Do you record partition columns in the index?

No, it needn't create the index, but when querying, it will record the partition column.

Q: This feature integrated into parquet file index, instead of having it as a separate index data source.

At first, I do integrate into parquet file index, but then I found it's a universal opt(you can find I didn't couple with ParquetFileFormat), and if I integrated into parquet file index, it's strong coupling with Parquet file index, in my opinion, users can choose to only use the Partition index(maybe user don't want to build index).
In my opinion, this project should be renamed to "Spark Index" cuz its architectural design... (to be done.)

Q: What happens when there are multiple repartition, e.g. df.repartition(...).repartition(...). Can we leverage this somehow?

Good idea, I'll think about it

Q: What happens when someone provides more than one column in repartition? How does filter work in this case, if at all?

it seems only one partition column makes sense. If you repartition("a", "b", "c"), you can filter files only in specify filter set("a", "b", "c"). A separate filter won't apply the index.

Q: What happens when there is an aggregate, e.g. df.groupBy("id").count(). Can we also leverage this information somehow?

Exm, I didn't get this. "groupBy("id")" will use all the id's values, so it still need to read all files
.
If you want to do some opt with aggregation, you can take a look about Apache Kylin. I'm family with this project and we are now integrating with Spark(), maybe we can have a discuss latter cuz I think pre-calculattion is another "index".

Q: What happens when someone saves it like df.write.partitionBy().bucketBy().parquet(...)? Table will still be partitioned by the set of columns.

Good idea, I'm not familiar with "bucketBy",
Seems that now DF can not be stored with bucketBy() org.apache.spark.sql.DataFrameWriter#assertNotBucketed

from parquet-index.

Aaaaaaron commented on May 29, 2024

@sadikovi
Thanks for your reply! Recently I was very busy and sorry for the uncompleted comments.

Yes, I agree with you, we do need to keep track of columns that dataset is repartitioned by. I'm lack of consideration. And we can integrate this with origin parquet index, cuz it needs to build(record the repartition col) anyway. IMHO, we should provide users an option that just builds "repartition index", cuz it's very light. So the strategy that if we find users don't have the index need to be designed.
Although integrate with parquet index, it's still a universal opt, we should do some decoupling so if we have other indexes in the future(like ORC etc.), we can just apply to it( And also for UT:) ).
It shall take me a while to understand about bucketing, and recently maybe not a good time, can we place this and scenario "GroupBy" to the next milestone?
The partition index has limited ability, and can only deal with equality filters, it's essentially just a hash to choose which file to use.
Benchmarks do require, it's lack now and I'll add later, may need some of your help. But in principle, its overhead is very little, and if it works, it'll just select one file and pruning all the others(in filter "=").

IMHO, this feature should be automated, and users should not provide any columns to filter since we can manage all of that ourselves.

I didn't get this very well, can you expand this a little bit?

Thanks again for your patient.

from parquet-index.

Aaaaaaron commented on May 29, 2024

Hi @sadikovi
After reading the code of bucketing, I found that I want to do is the same with bucketing(we can use the code "FileSourceStrategy#getExpressionBuckets" directly).

And in my opinion spark's bucketing need using with managed tables, if we can keep track of columns/partitionNum that dataset is repartitioned by, we can do this lighter(Yes, as you mentioned before), but I'm hesitating, cuz it seems exactly the "buckeing", except the way we keep metadata(partitoinCol/partitoinNum), shall we support bucketing, or use our own way.

from parquet-index.

sadikovi commented on May 29, 2024

Hello. I think it’s fine to add it, because we can resolve faster using our metadata. The problem comes from updating such information in index metadata, but we need to do that for all of the functionality, not just bucketing part. Anyway, I suggest we go ahead and add the feature.

…

On Sun, 9 Jun 2019 at 10:26 AM, Jiatao Tao ***@***.***> wrote: Hi @sadikovi <https://github.com/sadikovi> After reading the code of bucketing, I found that I want to do is the same with bucketing(we can use the code "FileSourceStrategy#getExpressionBuckets" directly). And in my opinion spark's bucketing need using with managed tables, if we can keep track of columns/partitionNum that dataset is repartitioned by, we can do this lighter(Yes, as you mentioned before), but I'm hesitating, cuz it seems exactly the "buckeing", except the way we keep metadata(partitoinCol/partitoinNum). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#88?email_source=notifications&email_token=AB3NRXWRYA2YZ57MSLEQ773PZS5ETA5CNFSM4GHTUGQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIGAEI#issuecomment-500195345>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB3NRXQGRWASYD3VETARGFDPZS5ETANCNFSM4GHTUGQQ> .

from parquet-index.

How about introduce a light filter strategy that based on the col used in Dataset#repartition. about parquet-index HOT 10 OPEN

Comments (10)

still draft

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent