Code Monkey home page Code Monkey logo

Comments (10)

sadikovi avatar sadikovi commented on May 29, 2024 1

That is a really good idea. Please do submit a PR.

from parquet-index.

sadikovi avatar sadikovi commented on May 29, 2024 1

First of all, the feature you are trying to add is similar to bucketing, so it might be worth researching that a little bit. And yes, it is supported (with managed tables).

  • Yes, you do need to keep track of columns that dataset is repartitioned by. No one is going to provide that column for you, even more remember what column was used to repartition dataset that was created months ago. So you would have to record column(s) somehow and perform validation to tell user whether or not this is the correct column that they can filter by (or fall back to normal filtering, either one will do).

  • Probably, the name is too specific, but I don't see adding anything else besides Parquet, considering the amount of features other datasources have.

  • Not necessarily, it depends what predicate is. You can repartition by multiple columns, but then bucketing filter should only be triggered when predicate appears to have all columns required.

  • GroupBy example is a variation of repartition, you would have to group keys together in order to perform aggregation. If I save such DataFrame you can also look at the plan and decide to store index for groupBy columns.

  • We should support bucketing, IMHO.

  • What about non-equality filters? I did not see any assertions or validations of such. It looks like you can only apply equality filters on bucketed file layout.

  • IMHO, this feature should be automated, and users should not provide any columns to filter, since we can manage all of that ourselves.

  • I am curious, does current version of index show the same performance on repartitioned dataset? Have you run any benchmarks? Because if there is no performance improvement, I don't see a point of investing more effort into it.

Anyway, it is a good start, but this work needs some design work, IMHO.

from parquet-index.

Aaaaaaron avatar Aaaaaaron commented on May 29, 2024

Hi @sadikovi , I submit an initial commit and there are some remaining work to be done. And any review comments are welcome. Thanks a lot!

  • basic logic

  • strategy about "parquet index"/"partition index", how they work together.

  • unit tests

from parquet-index.

sadikovi avatar sadikovi commented on May 29, 2024

Great! Do you record partition columns in the index?

from parquet-index.

sadikovi avatar sadikovi commented on May 29, 2024

It looks like this is similar to bucketing, which index does not support yet. Maybe we can fix that as well...

from parquet-index.

sadikovi avatar sadikovi commented on May 29, 2024

Thanks for the work you have done so far!

I would like to see this feature integrated into parquet file index, instead of having it as a separate index data source. This could be applied after partition pruning to determine if it is possible to reduce set of files even further. This feature is similar to bucketing and I think we should try making it automatic - user would just specify filters and we would have to figure out what actions to take.

Several things to consider:

  • What happens when there are multiple repartition, e.g. df.repartition(...).repartition(...). Can we leverage this somehow?
  • What happens when someone provides more than one column in repartition? How does filter work in this case, if at all?
  • What happens when there is an aggregate, e.g. df.groupBy("id").count(). Can we also leverage this information somehow?
  • What happens when someone saves it like df.write.partitionBy().bucketBy().parquet(...)? Table will still be partitioned by the set of columns.

I think answering these questions would be a very good start. Let's plan this feature a little bit. Also we need to make sure that it does not break previous versions (not required, but desirable:)).

Let me know if you disagree or have corrections/suggestions. Thanks.

from parquet-index.

Aaaaaaron avatar Aaaaaaron commented on May 29, 2024

still draft

Q: Do you record partition columns in the index?

  • No, it needn't create the index, but when querying, it will record the partition column.

Q: This feature integrated into parquet file index, instead of having it as a separate index data source.

  • At first, I do integrate into parquet file index, but then I found it's a universal opt(you can find I didn't couple with ParquetFileFormat), and if I integrated into parquet file index, it's strong coupling with Parquet file index, in my opinion, users can choose to only use the Partition index(maybe user don't want to build index).

  • In my opinion, this project should be renamed to "Spark Index" cuz its architectural design... (to be done.)

Q: What happens when there are multiple repartition, e.g. df.repartition(...).repartition(...). Can we leverage this somehow?

  • Good idea, I'll think about it

Q: What happens when someone provides more than one column in repartition? How does filter work in this case, if at all?

  • it seems only one partition column makes sense. If you repartition("a", "b", "c"), you can filter files only in specify filter set("a", "b", "c"). A separate filter won't apply the index.

Q: What happens when there is an aggregate, e.g. df.groupBy("id").count(). Can we also leverage this information somehow?

  • Exm, I didn't get this. "groupBy("id")" will use all the id's values, so it still need to read all files
    .
  • If you want to do some opt with aggregation, you can take a look about Apache Kylin. I'm family with this project and we are now integrating with Spark(), maybe we can have a discuss latter cuz I think pre-calculattion is another "index".

Q: What happens when someone saves it like df.write.partitionBy().bucketBy().parquet(...)? Table will still be partitioned by the set of columns.

  • Good idea, I'm not familiar with "bucketBy",
  • Seems that now DF can not be stored with bucketBy() org.apache.spark.sql.DataFrameWriter#assertNotBucketed

from parquet-index.

Aaaaaaron avatar Aaaaaaron commented on May 29, 2024

@sadikovi
Thanks for your reply! Recently I was very busy and sorry for the uncompleted comments.

  • Yes, I agree with you, we do need to keep track of columns that dataset is repartitioned by. I'm lack of consideration. And we can integrate this with origin parquet index, cuz it needs to build(record the repartition col) anyway. IMHO, we should provide users an option that just builds "repartition index", cuz it's very light. So the strategy that if we find users don't have the index need to be designed.

  • Although integrate with parquet index, it's still a universal opt, we should do some decoupling so if we have other indexes in the future(like ORC etc.), we can just apply to it( And also for UT:) ).

  • It shall take me a while to understand about bucketing, and recently maybe not a good time, can we place this and scenario "GroupBy" to the next milestone?

  • The partition index has limited ability, and can only deal with equality filters, it's essentially just a hash to choose which file to use.

  • Benchmarks do require, it's lack now and I'll add later, may need some of your help. But in principle, its overhead is very little, and if it works, it'll just select one file and pruning all the others(in filter "=").


IMHO, this feature should be automated, and users should not provide any columns to filter since we can manage all of that ourselves.

  • I didn't get this very well, can you expand this a little bit?

Thanks again for your patient.

from parquet-index.

Aaaaaaron avatar Aaaaaaron commented on May 29, 2024

Hi @sadikovi
After reading the code of bucketing, I found that I want to do is the same with bucketing(we can use the code "FileSourceStrategy#getExpressionBuckets" directly).

And in my opinion spark's bucketing need using with managed tables, if we can keep track of columns/partitionNum that dataset is repartitioned by, we can do this lighter(Yes, as you mentioned before), but I'm hesitating, cuz it seems exactly the "buckeing", except the way we keep metadata(partitoinCol/partitoinNum), shall we support bucketing, or use our own way.

from parquet-index.

sadikovi avatar sadikovi commented on May 29, 2024

from parquet-index.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.