Code Monkey home page Code Monkey logo

Comments (36)

namnguyenbk avatar namnguyenbk commented on September 26, 2024 2

@plamut
thanks for help me, just wait in a minute, I will get back to you soon

from python-bigquery.

tswast avatar tswast commented on September 26, 2024 2

@jgadbois I suggest the following two workarounds:

  1. Create the table ahead of time, and append to it using the streaming API client.insert_rows_from_dataframe
  2. Use pandas-gbq, which currently serializes to CSV, not Parquet. https://pandas-gbq.readthedocs.io/en/latest/writing.html

from python-bigquery.

tswast avatar tswast commented on September 26, 2024 2

This library recently added the ability to serialize a DataFrame to CSV for upload. With that format, DATETIME columns are supported.

job_config = bigquery.LoadJobConfig(
    schema=table_schema, source_format=SourceFormat.CSV
)
load_job = bigquery_client.load_table_from_dataframe(
    dataframe, table_id, job_config=job_config
)

from python-bigquery.

namnguyenbk avatar namnguyenbk commented on September 26, 2024 1

@plamut I saw @tswast 's PR to fix this is not a correct solution,
he has changed "datetime64[ns]": "DATETIME" to "datetime64[ns]": "TIMESTAMP". I think this is a wrong solution because TIMESTAMP of Bigquery force datetime to UTC format . so that is incorrect

from python-bigquery.

namnguyenbk avatar namnguyenbk commented on September 26, 2024 1

@plamut @tswast
I just updated my problem #56 (comment) please take a look

from python-bigquery.

namnguyenbk avatar namnguyenbk commented on September 26, 2024 1

thank you guys.
So you mean I still can't insert DATETIME to Bigquery and must wait for updating. @plamut @tswast

from python-bigquery.

plamut avatar plamut commented on September 26, 2024 1

insert_rows_from_dataframe() uses insert_rows_json() under the hood, it just converts dataframe rows to JSON data and uploads it through the streaming API:

insert_rows_from_dataframe() --> insert_rows() (for each rows chunk) --> insert_rows_json()

If your data is already in a DataFrame and you do not mind the conversion overhead, then yes, go ahead with insert_rows_from_dataframe(), especially if your use case is not negatively affected by a few trade-offs the streaming API brings (documentation).

from python-bigquery.

namnguyenbk avatar namnguyenbk commented on September 26, 2024 1

@plamut many thanks for your kind help

from python-bigquery.

plamut avatar plamut commented on September 26, 2024 1

@namnguyenbk Since this is about a DATE type, I suggest opening a separate issue for it, will make tracking both of them easier.

I don't know the answer from the top of my head, unfortunately, but can have a look at it next week when I'm back from a short absence.

from python-bigquery.

plamut avatar plamut commented on September 26, 2024 1

@namnguyenbk Good to hear that, and thanks for posting the link. No need to open a separate issue then.

from python-bigquery.

tswast avatar tswast commented on September 26, 2024 1

The internal bug 147108331 to add DATETIME (microsecond-precision) support was closed as fixed on March 16, 2020. It's probably worth reverting part of this change https://github.com/googleapis/google-cloud-python/pull/10028/files#diff-f7cb34ad7828ff0648d57694b2fc2aa4L55 or at least adding a test to check that DATETIME values can be uploaded if explicitly set in the schema.

from python-bigquery.

plamut avatar plamut commented on September 26, 2024 1

I'll try to have a look at this again at the end of the week maybe. In the best case removing the hack that we had to add back then would be enough, meaning that we would would only have to (re)add a test covering this specific case.

from python-bigquery.

tswast avatar tswast commented on September 26, 2024 1

Backend bug with Parquet DATETIME loading (147108331) is reported to be fix in the development server. We should be able remove our client-side workarounds once it rolls out to production.

from python-bigquery.

tswast avatar tswast commented on September 26, 2024 1

@jimfulton Could you add system tests to https://github.com/googleapis/python-bigquery/blob/master/tests/system/test_pandas.py for uploading DATETIME columns?

While you're at it, could you add similar tests for TIME columns? I believe this will let us close https://issuetracker.google.com/169230812

In both cases, there can be a difference between millisecond-precision values and microsecond-precision values, so we should test for both.

Re: #56 (comment), let's revert the change to the default dtype mapping in a PR to the v3 branch, since that could be considered a breaking change.

from python-bigquery.

namnguyenbk avatar namnguyenbk commented on September 26, 2024

I tried to change DATETIME to TIMESTAMP on schema but It always has UTC at the end of values. How should I fix them? please!

from python-bigquery.

namnguyenbk avatar namnguyenbk commented on September 26, 2024

Also, I tried to change this column (df['created']) to milliseconds format, it didn't work. Please help me. Is that a bug.

from python-bigquery.

plamut avatar plamut commented on September 26, 2024

@namnguyenbk Will investigate this shortly.

In the meantime, would ti be possible to get a minimal reproducible code sample? And the versions of the libraries used? That would make the investigation easier, thanks!

from python-bigquery.

namnguyenbk avatar namnguyenbk commented on September 26, 2024

googleapis/google-cloud-python#9996
tswast said change DATETIME -> TIMESTAMP to fix this, but I don't think We should do that, TIMESTAMP on Bigquery has a different format with DATETIME and my client doesn't agree with this format, because it is being forced to UTC format while it is not.
This is a bug and you guys should fix them to support insert DATETIME to Bigquery by load_table_from_dataframe(). I have used insert_rows_by_dataframe() but it is a streaming buffer API,

from python-bigquery.

plamut avatar plamut commented on September 26, 2024

@tswast Bringing this to your attention.

You mentioned back then that the backend lacked support for DATETIME when using Parquet uploads. Are there any updates to your backend feature request?

from python-bigquery.

namnguyenbk avatar namnguyenbk commented on September 26, 2024
my_df = get_sessions()  # this return a dataframe has a column name is created which is datetime[ns] type ex :"2020-01-08 08:00:00"
my_df['created'] = pd.to_datetime(sessions_df['created'], format='%Y-%m-%d %H:%M:%S').astype('datetime64[ns]')
res = bigquery_client.client.load_table_from_dataframe(my_df, table_id)
res.result()

ex: my value "2020-01-08 08:00:00" is being changed as INVALID or this value "0013-03-01T03:05:00" @plamut please help

from python-bigquery.

namnguyenbk avatar namnguyenbk commented on September 26, 2024

I'm using newest version of google-cloud-python @plamut 1.24.0

from python-bigquery.

plamut avatar plamut commented on September 26, 2024

@namnguyenbk Noted, thank you.

from python-bigquery.

tswast avatar tswast commented on September 26, 2024

You mentioned back then that the backend lacked support for DATETIME when using Parquet uploads. Are there any updates to your backend feature request?

There have been some updates on the internal issue 147108331 that I filed. It looks like DATETIME support might actually be added soon, but needs some time to rollout to production. Hopefully @shollyman can let you know when this changes and we can revert my "fix" to always use TIMESTAMP.

from python-bigquery.

plamut avatar plamut commented on September 26, 2024

Thanks for the update!

I marked the issue as external, and also classified it as P2 - it's already been around for quite awhile now with relatively low number bug report count, and there also exists a schema workaround (albeit not a perfect one).

from python-bigquery.

plamut avatar plamut commented on September 26, 2024

Not through the dataframe, unfortunately, at least not yet. Although, IIRC, it might be possible to get around this by using load_table_from_json() instead (and with an explicit schema), although the latter is slower than using a dataframe (would have to re-check).

from python-bigquery.

namnguyenbk avatar namnguyenbk commented on September 26, 2024

@plamut how about insert_rows_from_dataframe() It worked fine with my problem but It's a streaming buffer API while I want to insert all rows in a short time likes what load_table_data_frame() does

from python-bigquery.

namnguyenbk avatar namnguyenbk commented on September 26, 2024

@plamut How to insert null DATE to Bigquery, I tried, but It was forced to "0001-01-01"
using load_table_from_dataframe()

from python-bigquery.

namnguyenbk avatar namnguyenbk commented on September 26, 2024

@plamut I fixed anyway. I think conversion in load_table_from_dataframe might be annoying sometime, btw.
This one saved my life https://stackoverflow.com/questions/42818262/pandas-dataframe-replace-nat-with-none

from python-bigquery.

PPan1215 avatar PPan1215 commented on September 26, 2024

Hi @plamut Any plan for production rollout?
We are having the same issue with conversion from (%Y-%m-%d %H:%M:%S') in dataframe
to DATETIME in bq.
We are using parquet to speed up our importing task with airflow.
The datetime conversion ends up becoming the blocker.
For the sake of backwards compatibility, converting existing schema type
from DATETIME in bq to TIMESTAMP, unfortunately, is not an option.

from python-bigquery.

plamut avatar plamut commented on September 26, 2024

@tswast I tried uploading a DATETIME field, but no luck, I"m afraid.

google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Invalid datetime value 1578470400000000 for field 'datetime_col' of type 'INT64' (logical type 'TIMESTAMP_MICROS'): generic::out_of_range: Cannot return an invalid datetime value of 1578470400000000 microseconds relative to the Unix epoch. The range of valid datetime values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]

The error message is a bit surprising, as 1578470400000000 epoch microseconds is actually 2020-01-08 08:00:00, which falls in the range of valid values.


Posting more details below for a sanity check.

Modified type map to use DATETIME again:

diff --git google/cloud/bigquery/_pandas_helpers.py google/cloud/bigquery/_pandas_helpers.py
index 953b7d0..df66e76 100644
--- google/cloud/bigquery/_pandas_helpers.py
+++ google/cloud/bigquery/_pandas_helpers.py
@@ -55,7 +55,7 @@ _PANDAS_DTYPE_TO_BQ = {
     "datetime64[ns, UTC]": "TIMESTAMP",
     # BigQuery does not support uploading DATETIME values from Parquet files.
     # See: https://github.com/googleapis/google-cloud-python/issues/9996
-    "datetime64[ns]": "TIMESTAMP",
+    "datetime64[ns]": "DATETIME",
     "float32": "FLOAT",
     "float64": "FLOAT",
     "int8": "INTEGER",

The script used to test the behavior:

import pandas as pd
from google.cloud import bigquery


PROJECT = "..."
DATASET = "..."
TABLE = "..."


def main():
    bigquery_client = bigquery.Client()

    table_name = f"{PROJECT}.{DATASET}.{TABLE}"

    df = pd.DataFrame({
        "float_col": [0.255, 0.55],
        "datetime_col": ["2020-01-08 08:00:00", "2112-07-22 15:56:00"],
    })
    df = df.astype(dtype={"datetime_col": "datetime64[ns]"})

    job_config = bigquery.LoadJobConfig(
        schema=[
            bigquery.SchemaField("float_col", "FLOAT"),
            bigquery.SchemaField("datetime_col", "DATETIME"),
        ],
        write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
    )
    query_job = bigquery_client.load_table_from_dataframe(
        df, table_name, job_config=job_config
    )
    query_job.result()


if __name__ == "__main__":
    main()

The relevant package versions were up to date:

pandas==1.1.1
pyarrow==1.0.1
fastparquet==0.4.1

The parquet file generated and uploaded in the process:
aaa.parquet.gz

from python-bigquery.

tswast avatar tswast commented on September 26, 2024

Thanks. I've re-opened the internal bug 147108331. I can replicate this error with the bq command as well

$ bq load --source_format=PARQUET swast-scratch:my_dataset.aaa ./aaa.parquet ./aaa.schema
Upload complete.
Waiting on bqjob_rcdae8f70e185b28_000001742b104318_1 ... (5s) Current status: DONE
BigQuery error in load operation: Error processing job 'swast-scratch:bqjob_rcdae8f70e185b28_000001742b104318_1': Error while
reading data, error message: Invalid datetime value 1578470400000000 for field 'datetime_col' of type 'INT64' (logical type
'TIMESTAMP_MICROS'): generic::out_of_range: Cannot return an invalid datetime value of 1578470400000000 microseconds relative to
the Unix epoch. The range of valid datetime values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]
Failure details:
- prod-scotty-183204df-1614-4661-92c4-727e36ddf960: Error while
reading data, error message: Invalid datetime value
1578470400000000 for field 'datetime_col' of type 'INT64' (logical
type 'TIMESTAMP_MICROS'): generic::out_of_range: Cannot return an
invalid datetime value of 1578470400000000 microseconds relative to
the Unix epoch. The range of valid datetime values is [0001-01-1
00:00:00, 9999-12-31 23:59:59.999999]

aaa.schema.zip

from python-bigquery.

tswast avatar tswast commented on September 26, 2024

Can we confirm that this is converting to microseconds in the Parquet file? Nanoseconds for DATETIME are not supported and eng thinks that may be the cause.

from python-bigquery.

tswast avatar tswast commented on September 26, 2024

Sorry, I misunderstood the engineers in internal bug 147108331. It was marked fixed because BigQuery no longer loads nonsense DATETIME values when such a file is encountered. Now it raises a validation error.

I've filed 166476249 as a feature request to properly add support for DATETIME values.

from python-bigquery.

jgadbois avatar jgadbois commented on September 26, 2024

Given the above, what is the recommended way to insert a datetime from pandas into a bigquery DATETIME column?

from python-bigquery.

VinceCabs avatar VinceCabs commented on September 26, 2024

that was a great workaround, thanks!

from python-bigquery.

Lingesh7 avatar Lingesh7 commented on September 26, 2024

@tswast your workaround works.

from python-bigquery.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.