Comments (8)
Thanks @marvinlanhenke π
To write the relevant structues into Parquet, the statistics_enable field needs to be Page
To read them back, the reader needs to be configured with with_page_index I think
Also I have a proposed change to the Statistics code in #10802
If that gets merged, then the API for extracting the mins from data pages might look like
// get relevant index statistics somehow
let data_page_statatistics: Vec<&Statistics> = todo!();
let converter = StatisticsConverter::try_new(
column_name,
reader.schema(),
reader.parquet_schema(),
);
// get mins from the ColumnIndex
let mins = converter.column_index_mins(data_page_statatistics).unwrap();
from arrow-datafusion.
The proposed Api looks nice πUntil the merge I can use the time to explore and prototype. Thanks for the pointers
from arrow-datafusion.
@marvinlanhenke -- I whipped up something (actually I had been playing with it yesterday) #10843
from arrow-datafusion.
@alamb
I was briefly looking at this, trying to understand whats needed here.
Do we already have a helper fn at place to write a parquet file with Page Index
statistics? While I was "prototyping" I tried to get the metadata.column_index()
by using the existing make_test_file_rg
- but it seems page index stats are not written (None)?
I'll keep on looking - but perhaps you have a quick pointer here, where to look?
from arrow-datafusion.
@alamb
I'm currently thinking about how to integrate StatisticsConverter
with the existing code prune_pages_in_one_row_group
.
This is what I originally had in mind for the converter method:
pub fn column_index_mins(&self, metadata: &ParquetMetaData) -> Result<ArrayRef> {
let data_type = self.arrow_field.data_type();
let Some(parquet_column_index) = metadata.column_index() else {
return Ok(self.make_null_array(data_type, metadata.row_groups()));
};
let Some(parquet_index) = self.parquet_index else {
return Ok(self.make_null_array(data_type, metadata.row_groups()));
};
let row_group_page_indices = parquet_column_index
.into_iter()
.map(|x| x.get(parquet_index));
min_page_statistics(Some(data_type), row_group_page_indices)
}
So we would simply create an iterator for all row group's column indices, match the index and apply the statsfunc. Which works - all tests are passing.
However, the API, or the integration with prune_pages_in_one_row_group
feels kind of strange:
- a lot of work the StatisticConverter does is already done here
- we already iterate over each row_group individually, extracting a single Option<&Index> here and passing it into
prune_pages_per_one_row_group
Now, my API has to change. I'm wondering how specific it should be?
If we pass &Index
as a parameter, I can match the index and extract the statistic as done here. However, I'm not sure this is the way to go. We'd simply move the get_min_max_values_for_page_index
method, and basically have no need for the StatisticConverter?
Maybe I'm missing something, but I think it would help to maybe outline the scope of the refactor you had in mind.
from arrow-datafusion.
Thank you @marvinlanhenke -- excellent analysis.
- a lot of work the StatisticConverter does is already done here
Yes. It is my eventual goal for all of the code to convert Index
to ArrayRef
in page_filter.rs
is gone and page_filter.rs
only calls StatisticsConverter
.
To avoid a massive PR, however, I think it makes sense to add new code to StatisticsConverter
for extracting page values, and then when it is complete enough switch page_filter.rs
to use StatisticsConverter
- we already iterate over each row_group individually, extracting a single Option<&Index> here and passing it into
prune_pages_per_one_row_group
Indeed that is how it works today (one row group at a time). I eventually hope/plan to apply the same treatment to data page filtering as I did to row group filtering in #10802 (that is, make a single call to PruningPredicate::prune
for the all the remaining row groups.
Now, my API has to change. I'm wondering how specific it should be? If we pass
&Index
as a parameter, I can match the index and extract the statistic as done here. However, I'm not sure this is the way to go. We'd simply move theget_min_max_values_for_page_index
method, and basically have no need for the StatisticConverter?
let me play around with some options and get back to you
from arrow-datafusion.
Thank you so much - I quickly skimmed the draft you uploaded (will take a closer look tomorrow). My main question should be answered - for now we are iterating over each row group one by one using a row group index.
I also agree about the scope for now.
However, now I can see the overall picture / direction somewhat clearer, thanks for explaining that.
I'll try to incorporate your suggestions and upload a draft myself, so we have something more concrete to reason about.
from arrow-datafusion.
Follow on work tracked in #10922
from arrow-datafusion.
Related Issues (20)
- Support unparsing the Value Plan of Array (List) to SQL String HOT 2
- Support `Dictionary` in Parquet Metadata Statistics HOT 4
- Is pre-compile pattern string in regexp_match operation HOT 3
- Implement physical plan serialization for COPY plans `CsvLogicalExtensionCodec` HOT 1
- Remove special casting of `Min` / `Max` built in `AggregateFunctions` HOT 2
- Remove `Min`/`Max` references from `AggregateExec::get_minmax_descr` HOT 1
- Remove `Min`/`Max` references from `AggregateStatistics` HOT 7
- Union columns can never be `NULL` (I think?) HOT 1
- Allow sorting to improve `FixedSizeBinary` filtering HOT 6
- Consolidate examples to make them easier to navigate HOT 3
- Inconsistent behavior of '+-' arithmetic operator on Boolean type HOT 1
- Optionally display schema in explain plan
- math functions precision with f64 doesn't quite vibe on FreeBSD HOT 1
- Break datafusion-catalog code into its own crate HOT 5
- Support `FixedSizedBinaryArray` Parquet Data Page Statistics HOT 1
- Support `DictionaryArray` Parquet Data Page Statistics HOT 4
- DataFusion weekly project plan (Andrew Lamb) - July 1, 2024 HOT 3
- Add support for SessionState in supports_filters_pushdown for a Custom Data Source HOT 6
- Improve `CommonSubexprEliminate` rule HOT 2
- `Unnest` query does not allow one expression being referenced multiple times HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.