Comments (36)
@plamut
thanks for help me, just wait in a minute, I will get back to you soon
from python-bigquery.
@jgadbois I suggest the following two workarounds:
- Create the table ahead of time, and append to it using the streaming API client.insert_rows_from_dataframe
- Use pandas-gbq, which currently serializes to CSV, not Parquet. https://pandas-gbq.readthedocs.io/en/latest/writing.html
from python-bigquery.
This library recently added the ability to serialize a DataFrame to CSV for upload. With that format, DATETIME columns are supported.
job_config = bigquery.LoadJobConfig(
schema=table_schema, source_format=SourceFormat.CSV
)
load_job = bigquery_client.load_table_from_dataframe(
dataframe, table_id, job_config=job_config
)
from python-bigquery.
@plamut I saw @tswast 's PR to fix this is not a correct solution,
he has changed "datetime64[ns]": "DATETIME" to "datetime64[ns]": "TIMESTAMP". I think this is a wrong solution because TIMESTAMP of Bigquery force datetime to UTC format . so that is incorrect
from python-bigquery.
@plamut @tswast
I just updated my problem #56 (comment) please take a look
from python-bigquery.
thank you guys.
So you mean I still can't insert DATETIME to Bigquery and must wait for updating. @plamut @tswast
from python-bigquery.
insert_rows_from_dataframe()
uses insert_rows_json()
under the hood, it just converts dataframe rows to JSON data and uploads it through the streaming API:
insert_rows_from_dataframe()
--> insert_rows()
(for each rows chunk) --> insert_rows_json()
If your data is already in a DataFrame and you do not mind the conversion overhead, then yes, go ahead with insert_rows_from_dataframe()
, especially if your use case is not negatively affected by a few trade-offs the streaming API brings (documentation).
from python-bigquery.
@plamut many thanks for your kind help
from python-bigquery.
@namnguyenbk Since this is about a DATE type, I suggest opening a separate issue for it, will make tracking both of them easier.
I don't know the answer from the top of my head, unfortunately, but can have a look at it next week when I'm back from a short absence.
from python-bigquery.
@namnguyenbk Good to hear that, and thanks for posting the link. No need to open a separate issue then.
from python-bigquery.
The internal bug 147108331 to add DATETIME (microsecond-precision) support was closed as fixed on March 16, 2020. It's probably worth reverting part of this change https://github.com/googleapis/google-cloud-python/pull/10028/files#diff-f7cb34ad7828ff0648d57694b2fc2aa4L55 or at least adding a test to check that DATETIME values can be uploaded if explicitly set in the schema.
from python-bigquery.
I'll try to have a look at this again at the end of the week maybe. In the best case removing the hack that we had to add back then would be enough, meaning that we would would only have to (re)add a test covering this specific case.
from python-bigquery.
Backend bug with Parquet DATETIME loading (147108331) is reported to be fix in the development server. We should be able remove our client-side workarounds once it rolls out to production.
from python-bigquery.
@jimfulton Could you add system tests to https://github.com/googleapis/python-bigquery/blob/master/tests/system/test_pandas.py for uploading DATETIME columns?
While you're at it, could you add similar tests for TIME columns? I believe this will let us close https://issuetracker.google.com/169230812
In both cases, there can be a difference between millisecond-precision values and microsecond-precision values, so we should test for both.
Re: #56 (comment), let's revert the change to the default dtype mapping in a PR to the v3
branch, since that could be considered a breaking change.
from python-bigquery.
I tried to change DATETIME to TIMESTAMP on schema but It always has UTC
at the end of values. How should I fix them? please!
from python-bigquery.
Also, I tried to change this column (df['created']) to milliseconds format, it didn't work. Please help me. Is that a bug.
from python-bigquery.
@namnguyenbk Will investigate this shortly.
In the meantime, would ti be possible to get a minimal reproducible code sample? And the versions of the libraries used? That would make the investigation easier, thanks!
from python-bigquery.
googleapis/google-cloud-python#9996
tswast said change DATETIME -> TIMESTAMP to fix this, but I don't think We should do that, TIMESTAMP on Bigquery has a different format with DATETIME and my client doesn't agree with this format, because it is being forced to UTC format while it is not.
This is a bug and you guys should fix them to support insert DATETIME to Bigquery by load_table_from_dataframe(). I have used insert_rows_by_dataframe() but it is a streaming buffer API,
from python-bigquery.
@tswast Bringing this to your attention.
You mentioned back then that the backend lacked support for DATETIME when using Parquet uploads. Are there any updates to your backend feature request?
from python-bigquery.
my_df = get_sessions() # this return a dataframe has a column name is created which is datetime[ns] type ex :"2020-01-08 08:00:00"
my_df['created'] = pd.to_datetime(sessions_df['created'], format='%Y-%m-%d %H:%M:%S').astype('datetime64[ns]')
res = bigquery_client.client.load_table_from_dataframe(my_df, table_id)
res.result()
ex: my value "2020-01-08 08:00:00" is being changed as INVALID or this value "0013-03-01T03:05:00" @plamut please help
from python-bigquery.
I'm using newest version of google-cloud-python @plamut 1.24.0
from python-bigquery.
@namnguyenbk Noted, thank you.
from python-bigquery.
You mentioned back then that the backend lacked support for DATETIME when using Parquet uploads. Are there any updates to your backend feature request?
There have been some updates on the internal issue 147108331 that I filed. It looks like DATETIME support might actually be added soon, but needs some time to rollout to production. Hopefully @shollyman can let you know when this changes and we can revert my "fix" to always use TIMESTAMP.
from python-bigquery.
Thanks for the update!
I marked the issue as external, and also classified it as P2 - it's already been around for quite awhile now with relatively low number bug report count, and there also exists a schema workaround (albeit not a perfect one).
from python-bigquery.
Not through the dataframe, unfortunately, at least not yet. Although, IIRC, it might be possible to get around this by using load_table_from_json()
instead (and with an explicit schema), although the latter is slower than using a dataframe (would have to re-check).
from python-bigquery.
@plamut how about insert_rows_from_dataframe() It worked fine with my problem but It's a streaming buffer API while I want to insert all rows in a short time likes what load_table_data_frame() does
from python-bigquery.
@plamut How to insert null DATE to Bigquery, I tried, but It was forced to "0001-01-01"
using load_table_from_dataframe()
from python-bigquery.
@plamut I fixed anyway. I think conversion in load_table_from_dataframe might be annoying sometime, btw.
This one saved my life https://stackoverflow.com/questions/42818262/pandas-dataframe-replace-nat-with-none
from python-bigquery.
Hi @plamut Any plan for production rollout?
We are having the same issue with conversion from (%Y-%m-%d %H:%M:%S') in dataframe
to DATETIME in bq.
We are using parquet to speed up our importing task with airflow.
The datetime conversion ends up becoming the blocker.
For the sake of backwards compatibility, converting existing schema type
from DATETIME in bq to TIMESTAMP, unfortunately, is not an option.
from python-bigquery.
@tswast I tried uploading a DATETIME field, but no luck, I"m afraid.
google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Invalid datetime value 1578470400000000 for field 'datetime_col' of type 'INT64' (logical type 'TIMESTAMP_MICROS'): generic::out_of_range: Cannot return an invalid datetime value of 1578470400000000 microseconds relative to the Unix epoch. The range of valid datetime values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]
The error message is a bit surprising, as 1578470400000000 epoch microseconds is actually 2020-01-08 08:00:00
, which falls in the range of valid values.
Posting more details below for a sanity check.
Modified type map to use DATETIME again:
diff --git google/cloud/bigquery/_pandas_helpers.py google/cloud/bigquery/_pandas_helpers.py
index 953b7d0..df66e76 100644
--- google/cloud/bigquery/_pandas_helpers.py
+++ google/cloud/bigquery/_pandas_helpers.py
@@ -55,7 +55,7 @@ _PANDAS_DTYPE_TO_BQ = {
"datetime64[ns, UTC]": "TIMESTAMP",
# BigQuery does not support uploading DATETIME values from Parquet files.
# See: https://github.com/googleapis/google-cloud-python/issues/9996
- "datetime64[ns]": "TIMESTAMP",
+ "datetime64[ns]": "DATETIME",
"float32": "FLOAT",
"float64": "FLOAT",
"int8": "INTEGER",
The script used to test the behavior:
import pandas as pd
from google.cloud import bigquery
PROJECT = "..."
DATASET = "..."
TABLE = "..."
def main():
bigquery_client = bigquery.Client()
table_name = f"{PROJECT}.{DATASET}.{TABLE}"
df = pd.DataFrame({
"float_col": [0.255, 0.55],
"datetime_col": ["2020-01-08 08:00:00", "2112-07-22 15:56:00"],
})
df = df.astype(dtype={"datetime_col": "datetime64[ns]"})
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("float_col", "FLOAT"),
bigquery.SchemaField("datetime_col", "DATETIME"),
],
write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
)
query_job = bigquery_client.load_table_from_dataframe(
df, table_name, job_config=job_config
)
query_job.result()
if __name__ == "__main__":
main()
The relevant package versions were up to date:
pandas==1.1.1
pyarrow==1.0.1
fastparquet==0.4.1
The parquet file generated and uploaded in the process:
aaa.parquet.gz
from python-bigquery.
Thanks. I've re-opened the internal bug 147108331. I can replicate this error with the bq
command as well
$ bq load --source_format=PARQUET swast-scratch:my_dataset.aaa ./aaa.parquet ./aaa.schema
Upload complete.
Waiting on bqjob_rcdae8f70e185b28_000001742b104318_1 ... (5s) Current status: DONE
BigQuery error in load operation: Error processing job 'swast-scratch:bqjob_rcdae8f70e185b28_000001742b104318_1': Error while
reading data, error message: Invalid datetime value 1578470400000000 for field 'datetime_col' of type 'INT64' (logical type
'TIMESTAMP_MICROS'): generic::out_of_range: Cannot return an invalid datetime value of 1578470400000000 microseconds relative to
the Unix epoch. The range of valid datetime values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]
Failure details:
- prod-scotty-183204df-1614-4661-92c4-727e36ddf960: Error while
reading data, error message: Invalid datetime value
1578470400000000 for field 'datetime_col' of type 'INT64' (logical
type 'TIMESTAMP_MICROS'): generic::out_of_range: Cannot return an
invalid datetime value of 1578470400000000 microseconds relative to
the Unix epoch. The range of valid datetime values is [0001-01-1
00:00:00, 9999-12-31 23:59:59.999999]
from python-bigquery.
Can we confirm that this is converting to microseconds in the Parquet file? Nanoseconds for DATETIME are not supported and eng thinks that may be the cause.
from python-bigquery.
Sorry, I misunderstood the engineers in internal bug 147108331. It was marked fixed because BigQuery no longer loads nonsense DATETIME values when such a file is encountered. Now it raises a validation error.
I've filed 166476249 as a feature request to properly add support for DATETIME values.
from python-bigquery.
Given the above, what is the recommended way to insert a datetime from pandas into a bigquery DATETIME column?
from python-bigquery.
that was a great workaround, thanks!
from python-bigquery.
@tswast your workaround works.
from python-bigquery.
Related Issues (20)
- integrate with `lower-bound-checker` to ensure constraints files stay in sync with `setup.py` HOT 7
- Support inserting data into BigQuery directly from a Polars DataFrame HOT 4
- BigQuery Python client fails to query JSON field with heterogeneous data types
- Cannot append to REQUIRED field when using `client.load_table_from_file` without providing table schema HOT 7
- Should a rateLimitExceeded have a "429 Too Many Requests" instead of "403 Forbidden"? HOT 1
- snapshot_definition.snapshot_time throws exception
- Add a warning in the type mapping if no conversion function is found
- `table_constraints` has no setter?
- Retry didn't happen
- BUG: `read_pandas` Error: 'values' Attribute Missing in `ChunkedArray` HOT 4
- totalBytesProcessed is not exposed from RowIterator HOT 1
- BUG: `bq_to_arrow_field` in `_pandas_helper.py` always sets `pyarrow.field` to nullable
- [BUG] Undefined symbol in __all__
- RecursionError when pickling Table object
- Should a rateLimitExceeded have a "429 Too Many Requests" instead of "403 Forbidden"? HOT 2
- Cannot append to REPEATED field when using `client.load_table_from_file` with parquet file HOT 1
- ArrayQueryParameter.from_api_repr does not support empty array
- No retry on Impersonation Token Bigquery HOT 2
- support adbc APIs in DB-API package
- Select columns with `extract_table`? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from python-bigquery.