Is your feature request related to a problem or challenge? Related

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Follow on work tracked in <a class="issue-link js-issue-link" data-error-text="Failed

Efficiently and correctly Extract Page Index statistics into `ArrayRef`s about arrow-datafusion HOT 8 CLOSED

alamb commented on July 3, 2024

Efficiently and correctly Extract Page Index statistics into `ArrayRef`s

from arrow-datafusion.

Comments (8)

alamb commented on July 3, 2024 1

Thanks @marvinlanhenke 🙏

To write the relevant structues into Parquet, the statistics_enable field needs to be Page

To read them back, the reader needs to be configured with with_page_index I think

Also I have a proposed change to the Statistics code in #10802

If that gets merged, then the API for extracting the mins from data pages might look like

        // get relevant index statistics somehow
        let data_page_statatistics: Vec<&Statistics> = todo!();
        let converter = StatisticsConverter::try_new(
            column_name,
            reader.schema(),
            reader.parquet_schema(),
        );
        // get mins from the ColumnIndex
        let mins = converter.column_index_mins(data_page_statatistics).unwrap();

from arrow-datafusion.

marvinlanhenke commented on July 3, 2024 1

The proposed Api looks nice 👌Until the merge I can use the time to explore and prototype. Thanks for the pointers

from arrow-datafusion.

alamb commented on July 3, 2024 1

@marvinlanhenke -- I whipped up something (actually I had been playing with it yesterday) #10843

from arrow-datafusion.

marvinlanhenke commented on July 3, 2024

@alamb
I was briefly looking at this, trying to understand whats needed here.

Do we already have a helper fn at place to write a parquet file with Page Index statistics? While I was "prototyping" I tried to get the metadata.column_index() by using the existing make_test_file_rg - but it seems page index stats are not written (None)?

I'll keep on looking - but perhaps you have a quick pointer here, where to look?

from arrow-datafusion.

marvinlanhenke commented on July 3, 2024

@alamb
I'm currently thinking about how to integrate StatisticsConverter with the existing code prune_pages_in_one_row_group.

This is what I originally had in mind for the converter method:

    pub fn column_index_mins(&self, metadata: &ParquetMetaData) -> Result<ArrayRef> {
        let data_type = self.arrow_field.data_type();

        let Some(parquet_column_index) = metadata.column_index() else {
            return Ok(self.make_null_array(data_type, metadata.row_groups()));
        };

        let Some(parquet_index) = self.parquet_index else {
            return Ok(self.make_null_array(data_type, metadata.row_groups()));
        };

        let row_group_page_indices = parquet_column_index
            .into_iter()
            .map(|x| x.get(parquet_index));
        min_page_statistics(Some(data_type), row_group_page_indices)
    }

So we would simply create an iterator for all row group's column indices, match the index and apply the statsfunc. Which works - all tests are passing.

However, the API, or the integration with prune_pages_in_one_row_group feels kind of strange:

a lot of work the StatisticConverter does is already done here
we already iterate over each row_group individually, extracting a single Option<&Index> here and passing it into prune_pages_per_one_row_group

Now, my API has to change. I'm wondering how specific it should be?
If we pass &Index as a parameter, I can match the index and extract the statistic as done here. However, I'm not sure this is the way to go. We'd simply move the get_min_max_values_for_page_index method, and basically have no need for the StatisticConverter?

Maybe I'm missing something, but I think it would help to maybe outline the scope of the refactor you had in mind.

from arrow-datafusion.

alamb commented on July 3, 2024

Thank you @marvinlanhenke -- excellent analysis.

a lot of work the StatisticConverter does is already done here

Yes. It is my eventual goal for all of the code to convert Index to ArrayRef in page_filter.rs is gone and page_filter.rs only calls StatisticsConverter.

To avoid a massive PR, however, I think it makes sense to add new code to StatisticsConverter for extracting page values, and then when it is complete enough switch page_filter.rs to use StatisticsConverter

we already iterate over each row_group individually, extracting a single Option<&Index> here and passing it into prune_pages_per_one_row_group

Indeed that is how it works today (one row group at a time). I eventually hope/plan to apply the same treatment to data page filtering as I did to row group filtering in #10802 (that is, make a single call to PruningPredicate::prune for the all the remaining row groups.

Now, my API has to change. I'm wondering how specific it should be? If we pass &Index as a parameter, I can match the index and extract the statistic as done here. However, I'm not sure this is the way to go. We'd simply move the get_min_max_values_for_page_index method, and basically have no need for the StatisticConverter?

let me play around with some options and get back to you

from arrow-datafusion.

marvinlanhenke commented on July 3, 2024

Thank you so much - I quickly skimmed the draft you uploaded (will take a closer look tomorrow). My main question should be answered - for now we are iterating over each row group one by one using a row group index.

I also agree about the scope for now.
However, now I can see the overall picture / direction somewhat clearer, thanks for explaining that.

I'll try to incorporate your suggestions and upload a draft myself, so we have something more concrete to reason about.

from arrow-datafusion.

alamb commented on July 3, 2024

Follow on work tracked in #10922

from arrow-datafusion.

Efficiently and correctly Extract Page Index statistics into `ArrayRef`s about arrow-datafusion HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent