Comments (10)
I start working on the first PR
from arrow-datafusion.
from arrow-datafusion.
@alamb just hint #10605 is also closed.
from arrow-datafusion.
FYI I have a proposed API change in #10806
from arrow-datafusion.
In terms of sequencing of this feature what I would recommend
First PR
Purpose: Sketch out the API, and test framework
- Create a test framework for this
- Create the basic API and extract min/max values for Int64 columns
Second PR (draft)
purpose: demonstrate the API can be used in DataFusion, also ensure test coverage is adequate
Update one of the uses of parquet statistics (like ListingTable) to use the new API. @alamb would like to do this if I have time
Third+Fourth+... PRs
Add support for the remaining datatypes, along with tests
This part can be parallelized into multiple PRs
from arrow-datafusion.
After working through an actual example in #10549 I have a new API proposal: NGA-TRAN#118
Here is what the API looks like
/// What type of statistics should be extracted?
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub enum RequestedStatistics {
/// Minimum Value
Min,
/// Maximum Value
Max,
/// Null Count, returned as a [`UInt64Array`])
NullCount,
}
/// Extracts Parquet statistics as Arrow arrays
///
/// This is used to convert Parquet statistics to Arrow arrays, with proper type
/// conversions. This information can be used for pruning parquet files or row
/// groups based on the statistics embedded in parquet files
///
/// # Schemas
///
/// The schema of the parquet file and the arrow schema are used to convert the
/// underlying statistics value (stored as a parquet value) into the
/// corresponding Arrow value. For example, Decimals are stored as binary in
/// parquet files.
///
/// The parquet_schema and arrow _schema do not have to be identical (for
/// example, the columns may be in different orders and one or the other schemas
/// may have additional columns). The function [`parquet_column`] is used to
/// match the column in the parquet file to the column in the arrow schema.
///
/// # Multiple parquet files
///
/// This API is designed to support efficiently extracting statistics from
/// multiple parquet files (hence why the parquet schema is passed in as an
/// argument). This is useful when building an index for a directory of parquet
/// files.
///
#[derive(Debug)]
pub struct StatisticsConverter<'a> {
/// The name of the column to extract statistics for
column_name: &'a str,
/// The type of statistics to extract
statistics_type: RequestedStatistics,
/// The arrow schema of the query
arrow_schema: &'a Schema,
/// The field (with data type) of the column in the arrow schema
arrow_field: &'a Field,
}
impl<'a> StatisticsConverter<'a> {
/// Returns a [`UInt64Array`] with counts for each row group
///
/// The returned array has no nulls, and has one value for each row group.
/// Each value is the number of rows in the row group.
pub fn row_counts(metadata: &ParquetMetaData) -> Result<UInt64Array> {
...
}
/// create an new statistics converter
pub fn try_new(
column_name: &'a str,
statistics_type: RequestedStatistics,
arrow_schema: &'a Schema,
) -> Result<Self> {
...
}
/// extract the statistics from a parquet file, given the parquet file's metadata
///
/// The returned array contains 1 value for each row group in the parquet
/// file in order
///
/// Each value is either
/// * the requested statistics type for the column
/// * a null value, if the statistics can not be extracted
///
/// Note that a null value does NOT mean the min or max value was actually
/// `null` it means it the requested statistic is unknown
///
/// Reasons for not being able to extract the statistics include:
/// * the column is not present in the parquet file
/// * statistics for the column are not present in the row group
/// * the stored statistic value can not be converted to the requested type
pub fn extract(&self, metadata: &ParquetMetaData) -> Result<ArrayRef> {
...
}
}
I am envisioning this API could also easily support
Extract from multiple files in one go
impl<'a> StatisticsConverter<'a> {
..
/// Extract metadata from multiple parquet files into an single arrow array
/// one element per row group per file
fn extract_multi(&self, metadata: impl IntoIterator<Item = &ParquetMetadata>))-> Result<ArrayRef> {
...
}
Extract information from the page index as well
impl<'a> StatisticsConverter<'a> {
..
/// Extract metadata from page indexes across all row groups. The returned array has one element
/// per page across all row groups
fn extract_page(&self, metadata: impl IntoIterator<Item = &ParquetMetadata>))-> Result<ArrayRef> {
...
}
from arrow-datafusion.
@alamb I have created 2 more bug tickets but I cannot edit the description to add them in the subtasks. Can you help with that?
from arrow-datafusion.
@alamb I have created 2 more bug tickets but I cannot edit the description to add them in the subtasks. Can you help with that?
Done
from arrow-datafusion.
Given how far we have come with this ticket, I plan to close this ticket and do some organizing of the remaining tasks as follow on tickets / epics
from arrow-datafusion.
This issue is done enough -- I am consolidating the remaining todo items under #10922
from arrow-datafusion.
Related Issues (20)
- Support extracting `Int8`, `Int16`, `Int32` statistics from Parquet Data Pages HOT 2
- Do we need to escape search string as it's used in regexp? Wondering what's the result of `contains("abcdefg", ".*")` HOT 6
- Add a benchmark for extracting parquet data page statistics HOT 1
- Push down filters below `Unnest` in sub queries HOT 11
- Convert `Average` to UDAF HOT 1
- Convert `Min` and `Max` to UDAF HOT 5
- Convert `Hyperloglog` to UDAF HOT 3
- Support extracting Float{16, 32, 64} statistics from Parquet Data Pages
- Support extracting UInt{8, 16, 32, 64} statistics from Parquet Data Pages HOT 1
- DataFusion weekly project plan (Andrew Lamb) - June 17, 2024 HOT 2
- Timezone `UTC` is not considered equal to `+00:00` HOT 6
- Create a branch for StringView / BinaryView development HOT 1
- Change `StatisticsConverter::row_group_counts` to return `None` for non existent columns in parquet files HOT 1
- Complete Possible Join Type Handling for EmptyRelation Propagation Rule HOT 12
- Presentation about DataFusion at Microsoft GSL (Gray Systems Lab) HOT 2
- Conditionally allow to keep partition_by columns when using PARTITIONED BY HOT 3
- Inconsistent behavior in HashJoin Projections HOT 2
- regen.sh is failing in Ubuntu when working with recommended version of protoc HOT 2
- Implement `DynamicTableProvider` in DataFusion Core HOT 4
- Open Variant Type for semi-structured data HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.