Comments (4)
Thanks for sharing and helping resolve this issue so quickly 😁
from arrow-datafusion.
Thank you for the report and the reproducer ❤️
read row groups in order they were written
This is not my expectation.
DataFusion reads row groups in parallel, potentially out of order, with multiple threads as an optimization. To preserve the order of the data you can either set the configuration datafusion.optimizer.repartition_file_scans
to false
or else communicate the order of the data in the files using the CREATE EXTERNAL TABLE .. WITH ORDER
clause and then explicitly ask for that order in your query.
read the same values for the same row group even when the file increases in size
read the same values as the python pyarrow parquet reader
Yes I agree these are also my expectation
Maybe you can try setting datafusion.optimizer.repartition_file_scans
to false
and see if that makes the data consistent
from arrow-datafusion.
Setting datafusion.optimizer.repartition_file_scans
to false
like this fixes things. ✔️
let session_cfg =
SessionConfig::new().set_str("datafusion.optimizer.repartition_file_scans", "false");
let session_ctx = SessionContext::new_with_config(session_cfg);
However, it's unclear how it interacts with other options and affects memory and performance. So here's what I have -
It is a given that the data will be sorted based on timestamp like this
let parquet_options = ParquetReadOptions::<'_> {
skip_metadata: Some(false),
file_sort_order: vec![vec![Expr::Sort(Sort {
expr: Box::new(col("ts_init")),
asc: true,
nulls_first: true,
})]],
..Default::default()
};
Then there are two approaches to get row groups/data in order -
-
Using an order by clause in the query
session_ctx.sql("SELECT * FROM data ORDER BY ts_init")
. From our previous discussion, doing an order by on an already sorted column does not incur an additional overhead. -
Setting
datafusion.optimizer.repartition_file_scans
tofalse
ensures that the data is read in sequential order of row groups.
It's not clear to me how each option affects the performance and memory usage. Do you have any guidance around it?
from arrow-datafusion.
Setting datafusion.optimizer.repartition_file_scans to false like this fixes things. ✔️
That is great news
Using an order by clause in the query session_ctx.sql("SELECT * FROM data ORDER BY ts_init"). From our #7675 (comment), doing an order by on an already sorted column does not incur an additional overhead.
This is the approach I would recommend (we do it in InfluxDB 3.0).
You can verify that there are no additional sorts, etc in the plan by using EXPLAIN
to look at the query plan
If you use this approach, you should also be able to avoid setting datafusion.optimizer.repartition_file_scans
to false
as the optimizer will take care of it automatically
from arrow-datafusion.
Related Issues (20)
- DataFusion ignores "column order" parquet statistics specification
- DataFusion reads Date32 and Date64 parquet statistics in as Int32Array HOT 2
- Pass per-field BigQuery `OPTIONS` values to the LogicalPlan's Arrow Schema
- Expand Test Coverage for ScalarUDF's
- Make the configuration for `StreamTable` more generic to support more stream sources
- Support `date_bin` on timestamps with timezone, properly accounting for Daylight Savings Time HOT 26
- Incorrect statistics read for unsigned integer columns in parquet HOT 1
- Incorrect statistics read for binary columns in parquet
- Implement a benchmark for extracting arrow statistics from parquet HOT 1
- Incorrect statistics read for struct array in parquet HOT 3
- PlaceholderRowExec shown when select from union results. HOT 2
- Make it easier to register object stores HOT 2
- MySQL doesn't support the `NULLS FIRST/LAST` clause in `ORDER BY` statements
- Improve performance of extracting statistics from parquet files HOT 1
- Examples of using `TreeNode` APIs to walk and manipulate LogicalPlans HOT 6
- The `limit` info lost in the AggregateExec when ser/deser the physical plan HOT 9
- Make TaskContext wrap SessionState HOT 3
- Make SQL strings generated from Exprs even "prettier" HOT 2
- Consolidate tests for unparser / plan to sql to make them easier to find
- Implement protobuf serialization for LogicalPlan::Unnest HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.