Code Monkey home page Code Monkey logo

Comments (14)

appletreeisyellow avatar appletreeisyellow commented on July 3, 2024 1

I'd like to work on this issue 🙋‍♀️

from arrow-datafusion.

tustvold avatar tustvold commented on July 3, 2024 1

The challenge with an approach that relies on stripping the timezone, is you end up back with the ambiguous timestamp issue. The UX of relying on users to strip the timezone is also rather unfortunate. I'm curious as to why we wouldn't want to just make date_bin timezone aware? This would appear to have a lot less potential edge cases

from arrow-datafusion.

alamb avatar alamb commented on July 3, 2024

The way you can perform this binning in postgres is somewhat paradoxically to convert a timestamp with a timezone back to a timestamp without timezone and then apply date_bin.

The syntax to convert a timestamp to something without timezone is 🤯 : you apply AT TIME ZONE to a timestamp with a timezone and then it has no timezone) as @mhilton notes in #10368 (comment)

postgres=# select pg_typeof('2024-05-21T12:00:00'::timestamp AT TIME ZONE 'America/New_York');
        pg_typeof
--------------------------
 timestamp with time zone
(1 row)

postgres=# select pg_typeof('2024-05-21T12:00:00'::timestamp AT TIME ZONE 'America/New_York' AT TIME ZONE 'Europe/Brussels'
);
          pg_typeof
-----------------------------
 timestamp without time zone. <-- WTF no timezone as the result of AT TIME ZONE
(1 row)

from arrow-datafusion.

alamb avatar alamb commented on July 3, 2024

If we cast using arrow_cast back to Timestamp(Nanosecond, None) the binning does appear to work correctly

> create or replace view t_roundtrip as select arrow_cast(column1, 'Timestamp(Nanosecond, None)') as "column1" from t;
0 row(s) fetched.
Elapsed 0.002 seconds.

> select * from t_roundtrip;
+---------------------+
| column1             |
+---------------------+
| 2024-01-01T00:00:01 | <--- timestamps now have no timezone
| 2024-02-01T00:00:01 |
| 2024-03-01T00:00:01 |
| 2024-04-01T00:00:01 |
| 2024-05-01T00:00:01 |
| 2024-06-01T00:00:01 |
| 2024-07-01T00:00:01 |
| 2024-08-01T00:00:01 |
| 2024-09-01T00:00:01 |
| 2024-10-01T00:00:01 |
| 2024-11-01T00:00:01 |
| 2024-12-01T00:00:01 |
+---------------------+
12 row(s) fetched.
Elapsed 0.001 seconds.

> select arrow_typeof(column1) as arrow_type, column1, date_bin(interval '1 day', column1) as "date_bin" from t_roundtrip;
+-----------------------------+---------------------+---------------------+
| arrow_type                  | column1             | date_bin            |
+-----------------------------+---------------------+---------------------+
| Timestamp(Nanosecond, None) | 2024-01-01T00:00:01 | 2024-01-01T00:00:00 | <-- bins are as desired
| Timestamp(Nanosecond, None) | 2024-02-01T00:00:01 | 2024-02-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-03-01T00:00:01 | 2024-03-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-04-01T00:00:01 | 2024-04-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-05-01T00:00:01 | 2024-05-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-06-01T00:00:01 | 2024-06-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-07-01T00:00:01 | 2024-07-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-08-01T00:00:01 | 2024-08-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-09-01T00:00:01 | 2024-09-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-10-01T00:00:01 | 2024-10-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-11-01T00:00:01 | 2024-11-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-12-01T00:00:01 | 2024-12-01T00:00:00 |
+-----------------------------+---------------------+---------------------+
12 row(s) fetched.
Elapsed 0.003 seconds.

from arrow-datafusion.

alamb avatar alamb commented on July 3, 2024

Given the statement in the description, here is the best I can come up with using arrow_cast

-- Times in brussels
WITH t_brussels
AS (
  SELECT
    column1 AT TIME ZONE 'Europe/Brussels' as ts -- timestamp in specified timezone
  FROM t_utc
)
SELECT
  ts as "time in Brussels",
  date_bin(
    interval '1 day',
    arrow_cast(ts, 'Timestamp(Nanosecond, None)'),
    '2020-01-01T00:00:00Z'::timestamp
  ) as date_bin
FROM
  t_brussels;

+---------------------------+---------------------+
| time in Brussels          | date_bin            |
+---------------------------+---------------------+
| 2024-04-30T23:30:00+02:00 | 2024-04-30T00:00:00 |
| 2024-05-01T00:30:00+02:00 | 2024-04-30T00:00:00 | <-- this is in the wrong bin
| 2024-05-01T01:30:00+02:00 | 2024-04-30T00:00:00 | <-- this in in the wrong bin 
| 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T12:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T22:30:00+02:00 | 2024-05-01T00:00:00 |
+---------------------------+---------------------+
7 row(s) fetched.
Elapsed 0.004 seconds.

from arrow-datafusion.

alamb avatar alamb commented on July 3, 2024

@mhilton and I agree that if we had the functionality suggested by @Abdullahsab3's on #10368 (comment)

given a UTC timestamp, I would like to have that timestamp in local time of a given timezone

We think we could get the right value out of date_bin.

For example, If we had a function like remove_timezone that behaved like

select remove_timezone("time in Brussels")

+---------------------------+---------------------+
| time in Brussels          | remove_timezone            |
+---------------------------+---------------------+
| 2024-04-30T23:30:00+02:00 | 2024-04-30T23:30:00 |
| 2024-05-01T00:30:00+02:00 | 2024-05-01T00:30:00 | <-- timezone removed (no +02:00, but value not adjusted)
| 2024-05-01T01:30:00+02:00 | 2024-05-01T01:30:00 |
| 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:30:00+02:00 | 2024-05-01T02:30:00 |
| 2024-05-01T12:30:00+02:00 | 2024-05-01T12:30:00 |
| 2024-05-01T22:30:00+02:00 | 2024-05-01T22:30:00 |
+---------------------------+---------------------+

Note that this is different than arrow_cast("time in Brussels", 'Timestamp(Nanosecond, None)') because arrow_cast will adjust the timestamp values

-- Times in brussels
WITH t_brussels
AS (
  SELECT
    column1 AT TIME ZONE 'Europe/Brussels' as ts -- timestamp in specified timezone
  FROM t_utc
)
SELECT
  ts as "time in Brussels",
  arrow_cast(ts, 'Timestamp(Nanosecond, None)') as arrow_cast
FROM
  t_brussels;

+---------------------------+---------------------+
| time in Brussels          | arrow_cast          |
+---------------------------+---------------------+
| 2024-04-30T23:30:00+02:00 | 2024-04-30T21:30:00 |
| 2024-05-01T00:30:00+02:00 | 2024-04-30T22:30:00 | <-- note this is now in 2024-04-30 not 2024-05-01
| 2024-05-01T01:30:00+02:00 | 2024-04-30T23:30:00 |
| 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:30:00+02:00 | 2024-05-01T00:30:00 |
| 2024-05-01T12:30:00+02:00 | 2024-05-01T10:30:00 |
| 2024-05-01T22:30:00+02:00 | 2024-05-01T20:30:00 |
+---------------------------+---------------------+
7 row(s) fetched.
Elapsed 0.003 seconds.

from arrow-datafusion.

alamb avatar alamb commented on July 3, 2024

My suggested next steps for this ticket:

  1. Someone prototype the "strip_timezone" function as a ScalarUDF and verify that we can in fact we can achieve the expected result from date_bin
  2. If we can achieve the expected results, then file another ticket / proposal with how to integrate this into DataFusion (e.g. strip_timezone function, and/or update the arrow_cast kernel, and/or add syntax like WITHOUT TIME ZONE)

from arrow-datafusion.

alamb avatar alamb commented on July 3, 2024

I'm curious as to why we wouldn't want to just make date_bin timezone aware? This would appear to have a lot less potential edge cases

I believe @mhilton had thought about this and maybe has some thoughts

from arrow-datafusion.

mhilton avatar mhilton commented on July 3, 2024

I'm curious as to why we wouldn't want to just make date_bin timezone aware?

I'm not sure that it's obvious how a time-zone aware date_bin would behave. For example On the day in october when many time zones repeat an hour would the two hours count into a single bin? What about if you had hour long bins that start 30 minutes into the hour? Given that the two hours are easily distinguishable the answer is probably not, and if you really want that behviour then stripping time zone information would be a reasonable way to get it.

If we restricted timezone awareness to intervals of unit days then that would be much more reasonable. We could make it so that date_bin understands the number of hours in any given day in a time zone and bins a time accordingly. Anyone needing a strict 24 hour binning could specify the interval in hour units.

from arrow-datafusion.

tustvold avatar tustvold commented on July 3, 2024

zones repeat an hour would the two hours count into a single bin

This would be the intuitive thing for it to do IMO, if people don't want this, they should cast to a timezone that doesn't have DST

from arrow-datafusion.

appletreeisyellow avatar appletreeisyellow commented on July 3, 2024

To make date_bin timezone aware, there are some edge cases we need to consider when design it:

  1. Daylight Saving Time (DST) Transitions:
    • Spring Forward: When the clocks move forward, there is a "missing" hour.
    • Fall Back: When the clocks move backward, there is an "extra" hour. For example, in US central time zone, when DST ends at 2:00 AM, the clocks are set back to 1:00 AM. This means that there are two 1:20 AM in a day. If a user do a date_bin with 10 min interval, how to handle the returned data? Aggregate them? Return two sets of data?
  2. Crossing Midnight Boundaries:
    • Ensure that timestamps are correctly binned at the right local day and hour boundaries, especially when converting from UTC to a local timezone. This edge case is what this issue is trying to solve.
  3. Timezone Offsets:
    • Different timezones have different offsets from UTC, and these can change over time (e.g., due to DST).
    • For certain timezones, ensure the offset works correctly at the 30-min mark when DST happens. e.g.:
      • Lord Howe Island Time Zone (LHST) in Australia shifts time by 30 min during DST
      • Nepal (NPT) shifts by 45 min
      • Chatham Islands, New Zealand (CHAST) shifts by 12 hr and 45 min.
    • Ensure the function correctly applies the current offset for the given timestamp, considering historical changes in timezone rules. e.g.:
      • America/Sao_Paulo stoped doing DST in 2019
      • Russia set the clocks ahead permanently in 2011, then was reversed in 2014
      • Iceland, Turkey, Egypt, India, and China
  4. Leap Years and Leap Seconds:
    • Leap Years: February 29th is handled correctly in leap years.
    • Leap Seconds: this is less common. e.g. the most recent one was on Dec. 31, 2016, a leap second was introduced at 23:59:60 UTC
  5. Handling Null and Invalid Timestamps:
    • Ensure that null or invalid timestamps are handled gracefully, either by ignoring them or providing a default bin. e.g. '2021-03-28T02:30:00' AT TIME ZONE 'Europe/Brussels' does not exist
  6. Time Precision:
    • Ensure that the function handles different precisions of timestamps correctly (e.g., seconds, milliseconds, nanoseconds).

from arrow-datafusion.

appletreeisyellow avatar appletreeisyellow commented on July 3, 2024
  • Fall Back: When the clocks move backward, there is an "extra" hour. For example, in US central time zone, when DST ends at 2:00 AM, the clocks are set back to 1:00 AM. This means that there are two 1:20 AM in a day. If a user do a date_bin with 10 min interval, how to handle the returned data? Aggregate them? Return two sets of data?

I want to highlight this open-ended question and hear how others think

from arrow-datafusion.

Abdullahsab3 avatar Abdullahsab3 commented on July 3, 2024

Thanks for filing the ticket and for all the detailed explanations! very enriching

I wonder whether the Postgres behavior is actually that bad. Though it looks weird, it still is generic enough to make it widely applicable. The issue with making date_bin specifically timezone-aware in my opinion is the fact that future or similar grouping functions (or rolling windows functions) will also have to be implemented in a timezone-aware way, accounting for similar pitfalls and edge cases. The same problem will also occur in other functionalities; for example if you would like to return local time. I personally think that the problem might be better tackled as an additional time functionality, which may be the same way that Postgres does it.

The postgres way of converting UTC to local time of a given timezone is:

select '2024-05-21T12:00:00Z'::timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels';

timezone
--
2024-05-21 14:00:00

As already mentioned by Andrew, the AT TIME ZONE operator in postgres converts a timezone-aware timestamp to local time (with no offset or timezone information), and local time to a timezone-aware timestamp. Though the overloaded functionalities of the AT TIME ZONE operator in Postgres are weird, they are definitely usable in my opinion. I also like the idea of having a separate to_local_time/strip_timezone function, though this would break Datafusions compatibility with Postgres.

from arrow-datafusion.

alamb avatar alamb commented on July 3, 2024

Thank you @tustvold and @Abdullahsab3 and @mhilton and @appletreeisyellow for the thoughts.

From my perspective, the current (non timezone aware) date_bin function has the benefit of being (relatively) simple to implement and fast (as it is some arithmetic calculation on the underlying integer value without having to do DST conversions per row)

Given the differences in underlying timestamp representation between arrow and postgres I do think some difference is likely inevitable and thus likely not a deal breaker.

Here are my suggested next steps @appletreeisyellow tries to prototype one or both proposals and see if we can get it to produce the desired results:

  1. create a to_local_time function
  2. Modify to the date_bin function to make it timezone aware

I think the to_local_time might be the simplest approach.

  • A ScalarUDF
  • a Signature that takes Timestamp(.., *)
  • produces a Timestamp(.., None).
  • The invoke function would do the oppositte of whatever cast(Timestamp(.., None) --> Timestamp(.., TZ)) does

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.