Environment Delta-rs version : current main(<a cla

Thanks for the pointer <a class="user-mention notranslate" data-hovercard-type="user"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Inconsistent arrow timestamp type breaks datafusion query about delta-rs HOT 12 OPEN

qinix commented on June 12, 2024

Inconsistent arrow timestamp type breaks datafusion query

from delta-rs.

Comments (12)

ravid08 commented on June 12, 2024 1

Thanks for the pointer @ion-elgreco the following config worked:
SparkSession.config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")

from delta-rs.

ion-elgreco commented on June 12, 2024

This actually had breaking effects since the prior behavior was incorrect. I suggest you rewrite your tables so that the parquet timestamp is properly encoded.

It's also strange that the parquet encoded type takes precedence over the arrow schema that's coming from delta shema. This doesn't seem right

from delta-rs.

ion-elgreco commented on June 12, 2024

@qinix I assume you have this issue only on tables written with older versions and then read with the latest main?

from delta-rs.

qinix commented on June 12, 2024

@qinix I assume you have this issue only on tables written with older versions and then read with the latest main?

Yes, it is

from delta-rs.

cmettler commented on June 12, 2024

i have a similar issue when trying to delete records via a UTC timestamp field. I was not able to create a UTC timestamp for the right side of the predicate:

pa = duckdb.sql("select now()::timestamptz at time zone 'CET' at time zone 'UTC' as ts" ).to_arrow_table()
write_deltalake(target_table, pa,engine="rust",storage_options=storage_options,mode='overwrite')
dt = DeltaTable(target_table,storage_options=storage_options)
dt.delete("ts >= to_timestamp_micros('2024-03-27 00:00:00Z','%Y-%m-%d %H:%M:%S%#z')")

fails with:
ValueError: Invalid comparison operation: Timestamp(Microsecond, Some("UTC")) >= Timestamp(Microsecond, None)

from delta-rs.

ravid08 commented on June 12, 2024

I am facing similar issue when trying to write to a delta table previously written to by a Glue PySpark job. I chose to use Glue for one time full load of the source data and then use Lambda with deltalake python package to load on-going streaming data. The Glue job loads the data and then Lambda fails with the following error:
"Schema error: Fail to merge schema because the from data_type = Timestamp(Microsecond, Some(\"UTC\")) does not equal Timestamp(Nanosecond, None)
I have tried various options like:

trying to set SparkSession.config("spark.sql.timestampType", "TIMESTAMP_LTZ") \
using to_utc_timestamp to convert the incoming value to UTC
read schema of the delta table loaded by Lambda first and then pass it to spark.write.option("schema", myschema)

but unable to make these two schemas agree on a common format for timestamp!!

from delta-rs.

ion-elgreco commented on June 12, 2024

@ravid08 I am quite sure you wrote with spark without changing the default spark timestamp parquet type from int96 to timestamp_micros.

Try restoring the table prior to full load, then do load with spark again with timestamp_micros as the default timestamp parquet type

from delta-rs.

ravid08 commented on June 12, 2024

@ion-elgreco can you please clarify, I am using Glue 4.0 which supports Spark 3.3. From what I see pyspark.sql.functions.timestamp_micros is a new feature in Spark 3.5.
Also, I am not using delta table, all I am doing is using write_deltalake from writer.py because the table will be created in an external hive metastore.

from delta-rs.

ion-elgreco commented on June 12, 2024

@ravid08 you mentioned you did a full load with spark, so i assume it wrote some parquet files. Those parquet files without timestam_micros setting will have timestamps in int96. At the moment the parquet crate interprets this as timestamp nanoseconds

from delta-rs.

ravid08 commented on June 12, 2024

@ion-elgreco
The original value from the source is a string 2024-08-06T16:34:16.000574Z
I did a full load of that data using spark to s3 in delta format:
df = df.withColumn(col, fn.to_timestamp(fn.col(col)))
I created a table in hive metastore on these delta files and I could see the datatype is timestamp, not int.

from delta-rs.

ion-elgreco commented on June 12, 2024

@ravid08 im meaning the logical type in the parquet file is represented as INT96 when you don't set this spark conf setting

from delta-rs.

ravid08 commented on June 12, 2024

@ion-elgreco OK. sounds like spark 3.5 is required, which doesn't exist in Glue (yet).

from delta-rs.

Inconsistent arrow timestamp type breaks datafusion query about delta-rs HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent