Code Monkey home page Code Monkey logo

Comments (8)

alamb avatar alamb commented on July 3, 2024 1

Thanks @marvinlanhenke πŸ™

To write the relevant structues into Parquet, the statistics_enable field needs to be Page

To read them back, the reader needs to be configured with with_page_index I think

Also I have a proposed change to the Statistics code in #10802

If that gets merged, then the API for extracting the mins from data pages might look like

        // get relevant index statistics somehow
        let data_page_statatistics: Vec<&Statistics> = todo!();
        let converter = StatisticsConverter::try_new(
            column_name,
            reader.schema(),
            reader.parquet_schema(),
        );
        // get mins from the ColumnIndex
        let mins = converter.column_index_mins(data_page_statatistics).unwrap();

from arrow-datafusion.

marvinlanhenke avatar marvinlanhenke commented on July 3, 2024 1

The proposed Api looks nice πŸ‘ŒUntil the merge I can use the time to explore and prototype. Thanks for the pointers

from arrow-datafusion.

alamb avatar alamb commented on July 3, 2024 1

@marvinlanhenke -- I whipped up something (actually I had been playing with it yesterday) #10843

from arrow-datafusion.

marvinlanhenke avatar marvinlanhenke commented on July 3, 2024

@alamb
I was briefly looking at this, trying to understand whats needed here.

Do we already have a helper fn at place to write a parquet file with Page Index statistics? While I was "prototyping" I tried to get the metadata.column_index() by using the existing make_test_file_rg - but it seems page index stats are not written (None)?

I'll keep on looking - but perhaps you have a quick pointer here, where to look?

from arrow-datafusion.

marvinlanhenke avatar marvinlanhenke commented on July 3, 2024

@alamb
I'm currently thinking about how to integrate StatisticsConverter with the existing code prune_pages_in_one_row_group.

This is what I originally had in mind for the converter method:

    pub fn column_index_mins(&self, metadata: &ParquetMetaData) -> Result<ArrayRef> {
        let data_type = self.arrow_field.data_type();

        let Some(parquet_column_index) = metadata.column_index() else {
            return Ok(self.make_null_array(data_type, metadata.row_groups()));
        };

        let Some(parquet_index) = self.parquet_index else {
            return Ok(self.make_null_array(data_type, metadata.row_groups()));
        };

        let row_group_page_indices = parquet_column_index
            .into_iter()
            .map(|x| x.get(parquet_index));
        min_page_statistics(Some(data_type), row_group_page_indices)
    }

So we would simply create an iterator for all row group's column indices, match the index and apply the statsfunc. Which works - all tests are passing.

However, the API, or the integration with prune_pages_in_one_row_group feels kind of strange:

  • a lot of work the StatisticConverter does is already done here
  • we already iterate over each row_group individually, extracting a single Option<&Index> here and passing it into prune_pages_per_one_row_group

Now, my API has to change. I'm wondering how specific it should be?
If we pass &Index as a parameter, I can match the index and extract the statistic as done here. However, I'm not sure this is the way to go. We'd simply move the get_min_max_values_for_page_index method, and basically have no need for the StatisticConverter?

Maybe I'm missing something, but I think it would help to maybe outline the scope of the refactor you had in mind.

from arrow-datafusion.

alamb avatar alamb commented on July 3, 2024

Thank you @marvinlanhenke -- excellent analysis.

  • a lot of work the StatisticConverter does is already done here

Yes. It is my eventual goal for all of the code to convert Index to ArrayRef in page_filter.rs is gone and page_filter.rs only calls StatisticsConverter.

To avoid a massive PR, however, I think it makes sense to add new code to StatisticsConverter for extracting page values, and then when it is complete enough switch page_filter.rs to use StatisticsConverter

  • we already iterate over each row_group individually, extracting a single Option<&Index> here and passing it into prune_pages_per_one_row_group

Indeed that is how it works today (one row group at a time). I eventually hope/plan to apply the same treatment to data page filtering as I did to row group filtering in #10802 (that is, make a single call to PruningPredicate::prune for the all the remaining row groups.

Now, my API has to change. I'm wondering how specific it should be? If we pass &Index as a parameter, I can match the index and extract the statistic as done here. However, I'm not sure this is the way to go. We'd simply move the get_min_max_values_for_page_index method, and basically have no need for the StatisticConverter?

let me play around with some options and get back to you

from arrow-datafusion.

marvinlanhenke avatar marvinlanhenke commented on July 3, 2024

Thank you so much - I quickly skimmed the draft you uploaded (will take a closer look tomorrow). My main question should be answered - for now we are iterating over each row group one by one using a row group index.

I also agree about the scope for now.
However, now I can see the overall picture / direction somewhat clearer, thanks for explaining that.

I'll try to incorporate your suggestions and upload a draft myself, so we have something more concrete to reason about.

from arrow-datafusion.

alamb avatar alamb commented on July 3, 2024

Follow on work tracked in #10922

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.