Comments (12)
Thanks for the pointer @ion-elgreco the following config worked:
SparkSession.config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
from delta-rs.
This actually had breaking effects since the prior behavior was incorrect. I suggest you rewrite your tables so that the parquet timestamp is properly encoded.
It's also strange that the parquet encoded type takes precedence over the arrow schema that's coming from delta shema. This doesn't seem right
from delta-rs.
@qinix I assume you have this issue only on tables written with older versions and then read with the latest main?
from delta-rs.
@qinix I assume you have this issue only on tables written with older versions and then read with the latest main?
Yes, it is
from delta-rs.
i have a similar issue when trying to delete records via a UTC timestamp field. I was not able to create a UTC timestamp for the right side of the predicate:
pa = duckdb.sql("select now()::timestamptz at time zone 'CET' at time zone 'UTC' as ts" ).to_arrow_table()
write_deltalake(target_table, pa,engine="rust",storage_options=storage_options,mode='overwrite')
dt = DeltaTable(target_table,storage_options=storage_options)
dt.delete("ts >= to_timestamp_micros('2024-03-27 00:00:00Z','%Y-%m-%d %H:%M:%S%#z')")
fails with:
ValueError: Invalid comparison operation: Timestamp(Microsecond, Some("UTC")) >= Timestamp(Microsecond, None)
from delta-rs.
I am facing similar issue when trying to write to a delta table previously written to by a Glue PySpark job. I chose to use Glue for one time full load of the source data and then use Lambda with deltalake python package to load on-going streaming data. The Glue job loads the data and then Lambda fails with the following error:
"Schema error: Fail to merge schema because the from data_type = Timestamp(Microsecond, Some(\"UTC\")) does not equal Timestamp(Nanosecond, None)
I have tried various options like:
- trying to set
SparkSession.config("spark.sql.timestampType", "TIMESTAMP_LTZ") \
- using to_utc_timestamp to convert the incoming value to UTC
- read schema of the delta table loaded by Lambda first and then pass it to spark.write.option("schema", myschema)
but unable to make these two schemas agree on a common format for timestamp!!
from delta-rs.
@ravid08 I am quite sure you wrote with spark without changing the default spark timestamp parquet type from int96 to timestamp_micros.
Try restoring the table prior to full load, then do load with spark again with timestamp_micros as the default timestamp parquet type
from delta-rs.
@ion-elgreco can you please clarify, I am using Glue 4.0 which supports Spark 3.3. From what I see pyspark.sql.functions.timestamp_micros is a new feature in Spark 3.5.
Also, I am not using delta table, all I am doing is using write_deltalake
from writer.py
because the table will be created in an external hive metastore.
from delta-rs.
@ravid08 you mentioned you did a full load with spark, so i assume it wrote some parquet files. Those parquet files without timestam_micros setting will have timestamps in int96. At the moment the parquet crate interprets this as timestamp nanoseconds
from delta-rs.
@ion-elgreco
The original value from the source is a string 2024-08-06T16:34:16.000574Z
I did a full load of that data using spark to s3 in delta format:
df = df.withColumn(col, fn.to_timestamp(fn.col(col)))
I created a table in hive metastore on these delta files and I could see the datatype is timestamp, not int.
from delta-rs.
@ravid08 im meaning the logical type in the parquet file is represented as INT96 when you don't set this spark conf setting
from delta-rs.
@ion-elgreco OK. sounds like spark 3.5 is required, which doesn't exist in Glue (yet).
from delta-rs.
Related Issues (20)
- Do not add readerFeatures or writerFeatures keys under checkpoint files if minReaderVersion or minWriterVersion do not satisfy the requirements HOT 1
- Use Aliyun OSS as storage backend HOT 2
- Merge update+insert truncates a delta table if the table is big enough HOT 4
- Upgrade to pyo3 0.21.0 HOT 10
- Write struct with binary column failed with Json error: Binary is not supported by JSON
- write_deltalake identifies large_string as datatype even though string is set in schema HOT 2
- deltaTable.generate("symlink_format_manifest")
- Support for Arrow PyCapsule interface HOT 1
- Very slow s3 connection after 0.16.1 HOT 6
- LockClientError HOT 7
- ValueError: Partition value cannot be parsed from string. HOT 9
- Allow checkpoint creation when partion column is "timestampNtz " HOT 3
- Rust engine doesn't correctly seralize path for partions on timestamp on Windows HOT 3
- Pyarrow engine incorretly serialize timestamp with Z.
- Compacting produces smaller row groups than expected HOT 1
- Use gcs with s3-compatible interface: commit error 411 Length Required
- Generic S3 error: Error after 0 retries ... Broken pipe (os error 32) HOT 4
- How to get a DataFrame in Rust? HOT 1
- MERGE should raise when multiple source rows match target row
- Datafusion timestamp type doesn't respect delta lake schema HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from delta-rs.