Code Monkey home page Code Monkey logo

Comments (10)

NGA-TRAN avatar NGA-TRAN commented on July 24, 2024 1

I start working on the first PR

from arrow-datafusion.

NGA-TRAN avatar NGA-TRAN commented on July 24, 2024 1

@alamb Another bug: #10609

from arrow-datafusion.

xinlifoobar avatar xinlifoobar commented on July 24, 2024 1

@alamb just hint #10605 is also closed.

from arrow-datafusion.

alamb avatar alamb commented on July 24, 2024 1

FYI I have a proposed API change in #10806

from arrow-datafusion.

alamb avatar alamb commented on July 24, 2024

In terms of sequencing of this feature what I would recommend

First PR

Purpose: Sketch out the API, and test framework

  1. Create a test framework for this
  2. Create the basic API and extract min/max values for Int64 columns

Second PR (draft)

purpose: demonstrate the API can be used in DataFusion, also ensure test coverage is adequate
Update one of the uses of parquet statistics (like ListingTable) to use the new API. @alamb would like to do this if I have time

Third+Fourth+... PRs

Add support for the remaining datatypes, along with tests
This part can be parallelized into multiple PRs

from arrow-datafusion.

alamb avatar alamb commented on July 24, 2024

After working through an actual example in #10549 I have a new API proposal: NGA-TRAN#118

Here is what the API looks like

/// What type of statistics should be extracted?
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub enum RequestedStatistics {
    /// Minimum Value
    Min,
    /// Maximum Value
    Max,
    /// Null Count, returned as a [`UInt64Array`])
    NullCount,
}

/// Extracts Parquet statistics as Arrow arrays
///
/// This is used to convert Parquet statistics to Arrow arrays, with proper type
/// conversions. This information can be used for pruning parquet files or row
/// groups based on the statistics embedded in parquet files
///
/// # Schemas
///
/// The schema of the parquet file and the arrow schema are used to convert the
/// underlying statistics value (stored as a parquet value) into the
/// corresponding Arrow  value. For example, Decimals are stored as binary in
/// parquet files.
///
/// The parquet_schema and arrow _schema do not have to be identical (for
/// example, the columns may be in different orders and one or the other schemas
/// may have additional columns). The function [`parquet_column`] is used to
/// match the column in the parquet file to the column in the arrow schema.
///
/// # Multiple parquet files
///
/// This API is designed to support efficiently extracting statistics from
/// multiple parquet files (hence why the parquet schema is passed in as an
/// argument). This is useful when building an index for a directory of parquet
/// files.
///
#[derive(Debug)]
pub struct StatisticsConverter<'a> {
    /// The name of the column to extract statistics for
    column_name: &'a str,
    /// The type of statistics to extract
    statistics_type: RequestedStatistics,
    /// The arrow schema of the query
    arrow_schema: &'a Schema,
    /// The field (with data type) of the column in the arrow schema
    arrow_field: &'a Field,
}

impl<'a> StatisticsConverter<'a> {
    /// Returns a [`UInt64Array`] with counts for each row group
    ///
    /// The returned array has no nulls, and has one value for each row group.
    /// Each value is the number of rows in the row group.
    pub fn row_counts(metadata: &ParquetMetaData) -> Result<UInt64Array> {
...
    }

    /// create an new statistics converter
    pub fn try_new(
        column_name: &'a str,
        statistics_type: RequestedStatistics,
        arrow_schema: &'a Schema,
    ) -> Result<Self> {
...
    }

    /// extract the statistics from a parquet file, given the parquet file's metadata
    ///
    /// The returned array contains 1 value for each row group in the parquet
    /// file in order
    ///
    /// Each value is either
    /// * the requested statistics type for the column
    /// * a null value, if the statistics can not be extracted
    ///
    /// Note that a null value does NOT mean the min or max value was actually
    /// `null` it means it the requested statistic is unknown
    ///
    /// Reasons for not being able to extract the statistics include:
    /// * the column is not present in the parquet file
    /// * statistics for the column are not present in the row group
    /// * the stored statistic value can not be converted to the requested type
    pub fn extract(&self, metadata: &ParquetMetaData) -> Result<ArrayRef> {
...
    }
}

I am envisioning this API could also easily support

Extract from multiple files in one go

impl<'a> StatisticsConverter<'a> {
..
/// Extract metadata from multiple parquet files into an single arrow array
/// one element per row group per file
fn extract_multi(&self, metadata: impl IntoIterator<Item = &ParquetMetadata>))-> Result<ArrayRef> {
...
}

Extract information from the page index as well

impl<'a> StatisticsConverter<'a> {
..
/// Extract metadata from page indexes across all row groups. The returned array has one element
/// per page across all row groups
fn extract_page(&self, metadata: impl IntoIterator<Item = &ParquetMetadata>))-> Result<ArrayRef> {
...
}

from arrow-datafusion.

NGA-TRAN avatar NGA-TRAN commented on July 24, 2024

@alamb I have created 2 more bug tickets but I cannot edit the description to add them in the subtasks. Can you help with that?

  1. #10604
  2. #10605

from arrow-datafusion.

alamb avatar alamb commented on July 24, 2024

@alamb I have created 2 more bug tickets but I cannot edit the description to add them in the subtasks. Can you help with that?

Done

from arrow-datafusion.

alamb avatar alamb commented on July 24, 2024

Given how far we have come with this ticket, I plan to close this ticket and do some organizing of the remaining tasks as follow on tickets / epics

from arrow-datafusion.

alamb avatar alamb commented on July 24, 2024

This issue is done enough -- I am consolidating the remaining todo items under #10922

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.