Comments (4)
Hi!
Parquet4s treats timestamps as INT96 (stored as a binary), because Timestamp is date (Int) + time in nanos (Long). Datetime64 is less precise. However, it doesn't mean that Parquet4s shouldn't be able to read such data.
Can you share a link to a sample file with datetime64 column, including BC & AD, with millis and without. It will help to implement the improvement faster :)
Thanks
from parquet4s.
INT96 has been deprecated (https://issues.apache.org/jira/browse/PARQUET-323), so many tools are now using INT64 to represent timestamps.
I ran into the same issue because my parquet file was generated by PyArrow, but was able to workaround it by forcing the INT96 type during parquet file generation.
from parquet4s.
That is a fair point with INT64. Parquet4s should support it as a timestamp format both in reads and writes (as an option). Spark, Hive and Impala highly influenced the library. Even now, they all use INT96 for timestamps, at least by default.
I am going to prioritise the work on this issue. @Yanikovic @mbykovskyy If you do not want to wait, you can write a custom decoder typeclass and transform LongValue
to a timestamp.
from parquet4s.
Out of the box typeclasses are released with https://github.com/mjakubowski84/parquet4s/releases/tag/v2.7.0
from parquet4s.
Related Issues (20)
- Feature Request: scalapb support HOT 4
- An example for writing a required or optional value codec for custom type with more than one field. HOT 6
- Is schema backwards compatibility on projections possible? HOT 1
- Options `ParquetFileWriter.Mode.OVERWRITE` not deleting old parquet files in S3 HOT 1
- failed to read parquet generated by pandas HOT 5
- Add Pekko support HOT 4
- Is it possible to write a file without Akka or Fs2 integration? HOT 2
- `ParquetReader.projectedGeneric` does not work when selecting more than one column from a same group HOT 5
- Reading from gcs bucket HOT 1
- Do not publish a pekko/akko versions of scapapb module HOT 1
- missing tail records of large(~193M) parquet files HOT 4
- Protobuf enums deserialisation HOT 3
- compatible parquet-hadoop with spark3.1 HOT 3
- Unsure how to use for 'semiauto' approach HOT 2
- ParquetSchemaResolver test fails on recent JVMs HOT 1
- [akka/pekko] Too many paths created during record partitioning HOT 2
- [RFC] Refactor timestamp codecs HOT 2
- Feature request: Expose partitions as a `Stream[F, Stream[F, Record]]` for FS2 HOT 5
- Incorrect value after reading parquet HOT 7
- [Question] get a listing of parquet files? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parquet4s.