Comments (6)
@moonkev Linked PR should solve the issue. Unfortunately, I didn't find simple solution using implicits that would be also easy for library users. Therefore, I decided to extend existing interfaces with parameter: ValueCodecConfiguration
. Both reader and writer now have optional Options
parameter where, amongst other settings, developer can set the time zone.
Please check if the solution satisfies your needs. I am open to your suggestions regarding both code and the functionality.
Hint, in case you need it: Just use sbt +publishLocal
to have the distribution on your local machine.
from parquet4s.
@mjakubowski84 Thank you very much for this addition. I think your implementation is great from a client perspective. I was able to perform testing on a few different aspects. I will say this definitely addresses the issue I raised in #68 where I was seeing midnight UTC rendering as the previous day with a time of 24:00 in Impala. With me specifying the timezone now as UTC, it renders as the proper date, with a time of 00:00.
I did a few other random tests, and mostly everything was good, however I did run into one issue. The issue arises if you try to read a date time from a timezone where the date is the next day compared to your local time zone. For instance, right now I am in the US and it is still Feb 20 here. If I try to write and then read back a time in any timezone that is still Feb 20, everything is fine. However if I try with a timezone where it is now Feb 21, I get a failure when trying to read that value. For instance, if I try with UTC timezone it will fail with as the time is underflowing.
You can reproduce the issue with the following code (You just need to change zoneId to a timezone that happens to have rolled over to the next day from your own).
package com.github.mjakubowski84.parquet4s
import java.time.{LocalDateTime, ZonedDateTime}
import java.util.TimeZone
import org.apache.parquet.hadoop.ParquetFileWriter
object TimezoneVerify extends App {
case class TimeContainer(zoneId: String, time: LocalDateTime)
val timeZone = TimeZone.getTimeZone("US/Central")
val dateTime = ZonedDateTime.now(timeZone.toZoneId).toLocalDateTime
val writeOptions = ParquetWriter.Options(writeMode = ParquetFileWriter.Mode.OVERWRITE, timeZone = timeZone)
ParquetWriter.write("target/out.parq", Seq(TimeContainer(timeZone.toZoneId.toString, dateTime)), writeOptions)
val readOptions = ParquetReader.Options(timeZone = timeZone)
val record = ParquetReader.read[TimeContainer]("target/out.parq").toSeq.head
println(record)
}
I think the following addition to the overflow check at the end of the decodeLocalDateTime function in TimeValueCodecs companion object should fix the issue (I tested with this change and was able to then successfully read times that were from a timezone that has the next day from me
if (timeInNanos >= NanosPerDay) { // fixes issue with Spark when in number of nanos >= 1 day
val time = LocalTime.ofNanoOfDay(timeInNanos - NanosPerDay)
LocalDateTime.of(date.plusDays(1), time)
} else if (timeInNanos < 0) {
val time = LocalTime.ofNanoOfDay(timeInNanos + NanosPerDay)
LocalDateTime.of(date.plusDays(-1), time)
} else {
val time = LocalTime.ofNanoOfDay(timeInNanos)
LocalDateTime.of(date, time)
}
from parquet4s.
How good is to have a fellow programmer in different time zone! ;) I will add more tests and a fix!
from parquet4s.
@moonkev I pushed the commit that introduces the fix suggested by you. I also added tests that cover the edge cases. If you think that everything is fine then we can merge, close the issue and release a new version. What do you think?
from parquet4s.
@mjakubowski84 I ran through all the same tests again, and everything looks great to me. I would say with confidence it is ready for merge. Thank you again for your excellent work on this project! I will definitely be promoting this to fellow developers whenever I get a chance, and will look for opportunities to contribute in the future. It is a very underrated project, and would allow many projects that are using large frameworks spark streaming, flume or nifi to transition to a much more lightweight solution.
from parquet4s.
@moonkev Thank you for your incentive words!
I didn't think much about application of this project in streaming (I mean infinite streams). I tried that project only in lightweight batch jobs and it already proves to serve here well. However, it looks like that even existing SequentialFileSplittingParquetSink
can take over where Spark Steaming is usually used. Still, it would need some improvements.
Thanks again for your contribution, I hope to see more of them!
from parquet4s.
Related Issues (20)
- Decoding of timestamp fields generated by pandas/pyarrow HOT 4
- Expose all new Parquet options in Options HOT 1
- Support for map Parquet legacy format HOT 4
- Reading of files with fields of type required AND optional Int64 HOT 5
- update to hadoop 3.3.4 HOT 2
- Feature Request: parquet-protobuf HOT 8
- Concurrently read and write files with parquet4s-fs2 HOT 9
- Some of FilteringSpec tests fail when run locally HOT 4
- Overwrite mode not overwriting partitions in S3 with `viaParquet`? HOT 5
- Applying `not in` (`in` negation) filter always resulting in 0 rows HOT 4
- Question. How to define a TypedSchemaDef for a Map[String, Any] HOT 8
- [question] is there a way to have a single parquet writer that writes for a class hierarchy? HOT 1
- [question] is there a clever way to register writers for lots of case classes HOT 1
- Question: How to provide Hadoop Path for reading files from S3 HOT 3
- Feature Request: scalapb support HOT 4
- An example for writing a required or optional value codec for custom type with more than one field. HOT 6
- Is schema backwards compatibility on projections possible? HOT 1
- Options `ParquetFileWriter.Mode.OVERWRITE` not deleting old parquet files in S3 HOT 1
- failed to read parquet generated by pandas HOT 5
- Add Pekko support HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parquet4s.