Code Monkey home page Code Monkey logo

Comments (6)

mjakubowski84 avatar mjakubowski84 commented on June 6, 2024

@moonkev Linked PR should solve the issue. Unfortunately, I didn't find simple solution using implicits that would be also easy for library users. Therefore, I decided to extend existing interfaces with parameter: ValueCodecConfiguration. Both reader and writer now have optional Options parameter where, amongst other settings, developer can set the time zone.
Please check if the solution satisfies your needs. I am open to your suggestions regarding both code and the functionality.
Hint, in case you need it: Just use sbt +publishLocal to have the distribution on your local machine.

from parquet4s.

moonkev avatar moonkev commented on June 6, 2024

@mjakubowski84 Thank you very much for this addition. I think your implementation is great from a client perspective. I was able to perform testing on a few different aspects. I will say this definitely addresses the issue I raised in #68 where I was seeing midnight UTC rendering as the previous day with a time of 24:00 in Impala. With me specifying the timezone now as UTC, it renders as the proper date, with a time of 00:00.

I did a few other random tests, and mostly everything was good, however I did run into one issue. The issue arises if you try to read a date time from a timezone where the date is the next day compared to your local time zone. For instance, right now I am in the US and it is still Feb 20 here. If I try to write and then read back a time in any timezone that is still Feb 20, everything is fine. However if I try with a timezone where it is now Feb 21, I get a failure when trying to read that value. For instance, if I try with UTC timezone it will fail with as the time is underflowing.

You can reproduce the issue with the following code (You just need to change zoneId to a timezone that happens to have rolled over to the next day from your own).

package com.github.mjakubowski84.parquet4s

import java.time.{LocalDateTime, ZonedDateTime}
import java.util.TimeZone

import org.apache.parquet.hadoop.ParquetFileWriter

object TimezoneVerify extends App {

  case class TimeContainer(zoneId: String, time: LocalDateTime)

  val timeZone = TimeZone.getTimeZone("US/Central")


  val dateTime = ZonedDateTime.now(timeZone.toZoneId).toLocalDateTime

  val writeOptions = ParquetWriter.Options(writeMode = ParquetFileWriter.Mode.OVERWRITE, timeZone = timeZone)
  ParquetWriter.write("target/out.parq", Seq(TimeContainer(timeZone.toZoneId.toString, dateTime)), writeOptions)

  val readOptions = ParquetReader.Options(timeZone = timeZone)
  val record = ParquetReader.read[TimeContainer]("target/out.parq").toSeq.head
  println(record)
}

I think the following addition to the overflow check at the end of the decodeLocalDateTime function in TimeValueCodecs companion object should fix the issue (I tested with this change and was able to then successfully read times that were from a timezone that has the next day from me

        if (timeInNanos >= NanosPerDay) { // fixes issue with Spark when in number of nanos >= 1 day
          val time = LocalTime.ofNanoOfDay(timeInNanos - NanosPerDay)
          LocalDateTime.of(date.plusDays(1), time)
        } else if (timeInNanos < 0) {
          val time = LocalTime.ofNanoOfDay(timeInNanos + NanosPerDay)
          LocalDateTime.of(date.plusDays(-1), time)
        } else {
          val time = LocalTime.ofNanoOfDay(timeInNanos)
          LocalDateTime.of(date, time)
        }

from parquet4s.

mjakubowski84 avatar mjakubowski84 commented on June 6, 2024

How good is to have a fellow programmer in different time zone! ;) I will add more tests and a fix!

from parquet4s.

mjakubowski84 avatar mjakubowski84 commented on June 6, 2024

@moonkev I pushed the commit that introduces the fix suggested by you. I also added tests that cover the edge cases. If you think that everything is fine then we can merge, close the issue and release a new version. What do you think?

from parquet4s.

moonkev avatar moonkev commented on June 6, 2024

@mjakubowski84 I ran through all the same tests again, and everything looks great to me. I would say with confidence it is ready for merge. Thank you again for your excellent work on this project! I will definitely be promoting this to fellow developers whenever I get a chance, and will look for opportunities to contribute in the future. It is a very underrated project, and would allow many projects that are using large frameworks spark streaming, flume or nifi to transition to a much more lightweight solution.

from parquet4s.

mjakubowski84 avatar mjakubowski84 commented on June 6, 2024

@moonkev Thank you for your incentive words!
I didn't think much about application of this project in streaming (I mean infinite streams). I tried that project only in lightweight batch jobs and it already proves to serve here well. However, it looks like that even existing SequentialFileSplittingParquetSink can take over where Spark Steaming is usually used. Still, it would need some improvements.
Thanks again for your contribution, I hope to see more of them!

from parquet4s.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.