Comments (6)
The issue with the way how DataFusion drakes the file into pieces.
let session_config = SessionConfig::new().with_repartition_file_scans(false);
let ctx = SessionContext::new_with_config(session_config);
I've found that if repartition is disabled, it works flawlessly.
So I suspect something is wrong here in case of ZStd.
After splitting the file into 10 slices it does decodes some of them, but fails with the others.
from arrow-datafusion.
Nice find!
It seems like the current code disables repartitioning for gzip:
datafusion/datafusion/core/src/datasource/physical_plan/json.rs
Lines 157 to 159 in 9f0e016
Maybe we have to do something similar for zstd and other compression types ๐ค
from arrow-datafusion.
Thanks for the report -- can you possiblly share an example of such a file (or instructions for how to create one)?
from arrow-datafusion.
Here is an example file data.zst.json
And the code, which shows that the file could be perfectly decoded with async_compression
which is used in DataFusion. Meanwhile it could not be used to read as DataFrame.
use arrow::datatypes::{Field, Schema};
use datafusion::common::arrow::datatypes::{DataType, TimeUnit};
use datafusion::datasource::file_format::options::NdJsonReadOptions;
use datafusion::datasource::file_format::file_compression_type::FileCompressionType;
use datafusion::prelude::*;
use std::io::Error;
use datafusion::error::Result;
use async_compression::tokio::bufread::ZstdDecoder;
use tokio::io::AsyncReadExt;
const FILE_PATH: &str = "data.zst";
#[tokio::main]
async fn main() -> Result<(), Error> {
// read file with tokio and create a StreamReader
let file = tokio::fs::File::open(FILE_PATH).await?;
let mut reader = ZstdDecoder::new(tokio::io::BufReader::new(file));
let mut buf = vec![];
reader.read_to_end(&mut buf).await?;
println!("๐ฆ Read {} bytes", buf.len());
let schema = Schema::new(vec![
Field::new("OriginalRequest", DataType::Utf8, false),
Field::new(
"RequestStarted",
DataType::Timestamp(TimeUnit::Millisecond, None),
false,
),
]);
// Create context
let ctx = SessionContext::new();
// Read data
let json_options = NdJsonReadOptions::default()
.file_extension("zst")
.file_compression_type(FileCompressionType::ZSTD)
.schema(&schema);
let df = ctx.read_json(FILE_PATH, json_options).await?;
println!("๐คจ Hello, ZStd issue!");
df.show_limit(10).await?;
Ok(())
}
from arrow-datafusion.
Thank you @Smotrov ๐
from arrow-datafusion.
Given we now have a good reproducer on this issue I think it is ready for someone to take a look if they have time
from arrow-datafusion.
Related Issues (20)
- clean up simple udwf example HOT 2
- Implement `hf://` / "hugging face" integration in datafusion-cli HOT 6
- Clippy failed on main: consider removing unnecessary double parentheses
- Convert builtin Sum aggregate function to UDAF
- FIRST/LAST_VALUE behavior changes HOT 3
- CLI cannot create external tables with format options
- `stride` arg of `array_slice()` should be optional HOT 2
- Precision/length parameter of varchar/char types is ignored HOT 2
- Feedback request for providing configurable UDF functions HOT 12
- DataFrame.except() does not work with structs in schema HOT 2
- Extract parquet statistics from `Time32` and `Time64` columns HOT 1
- Extract parquet statistics from `Interval` columns HOT 5
- Extract parquet statistics from `LargeBinary` columns
- Extract parquet statistics from `Duration` columns HOT 3
- Extract parquet statistics from `Decimal256` columns
- Extract parquet statistics from `LargeUtf8` columns HOT 2
- Extract parquet statistics from `f16` columns HOT 1
- Extract parquet statistics from timestamps with timezones HOT 1
- Repeat scalar function panics on negative repeat counts.
- Update split_part to support negative indexes vs failing
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.