Environment details google-cloud-bigquery: version 1.22

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

BigQuery: to_dataframe does not respect date fields about python-bigquery HOT 13 CLOSED

googleapis commented on September 26, 2024

BigQuery: to_dataframe does not respect date fields

from python-bigquery.

Comments (13)

QuinRiva commented on September 26, 2024 2

I'm also seeing problem using load_table_from_dataframe with datetime columns. If I specificy the schema as part of the LoadJobConfig (specifying the column as DATE, it works fine.

However if I fail to specify the schema, it identifies the column as DATETIME and creates invalid dates:

Row	| startDate	
1	| 0021-04-25T22:13:48.360448
2	| 0021-04-25T22:13:48.360448
3	| 0021-08-13T11:42:29.163840

As opposed to :

Row	| startDate	
1	| 2017-07-01
2	| 2017-07-01
3	| 2018-01-01

Pandas dtype is datetime64[ns]

from python-bigquery.

PaddyAlton commented on September 26, 2024 1

I've been having the same issue with load_table_from_dataframe since updating dependencies. I have uncovered the real problem. Some of my recent timestamps throw errors like this one:

Cannot return an invalid datetime value of 1575982560000000 microseconds relative to the Unix epoch

What's actually going on here is that I have a pandas DataFrame with values like this:

'2019-12-10 12:56:00', dtype='datetime64[ns]

which, in ns, is 1575982560000000000.

In other words, somewhere between being a dataframe and ending up in BigQuery, the underlying value stored for this date is being interpreted as microseconds since the Unix epoch rather than nanoseconds!

I recently had an issue with the same code throwing errors after a pyarrow update, which might be related - I ended up fixing that one by rounding my timestamps to ms-precision before uploading to BigQuery (see https://stackoverflow.com/questions/53333742/schema-conflict-when-storing-dataframes-with-datetime-objects-using-load-table-f for an example).

EDIT: it appears that something has changed in terms of the BigQuery type being inferred from pandas. Previously, the type used was TIMESTAMP. Now, it appears to be DATETIME. I am trying to fix my specific issue by using a partial schema - in case it helps anyone else:

        job_config = LoadJobConfig()

        dt_columns = df.columns[df.dtypes == "datetime64[ns]"]

        job_config.schema = [SchemaField(col, "TIMESTAMP") for col in dt_columns]

from python-bigquery.

HemangChothani commented on September 26, 2024

@QuinRiva I have tried to reproduce the issue related to load_table_from_dataframe but unable to do that, here is my sample code and dataframe:

contractId | contractTypeId |  startDate    |   endDate   |   created
    2      |    2           |  2019-12-06   | 2019-12-06  | 2019-12-06 17:09:11.317689+00:00
    1      |    1           |  2019-12-06   | 2019-12-06  | 2019-12-06 17:09:11.317689+00:00

df.types

contractId                      int64
contractTypeId                  int64
startDate                      object
endDate                        object
created           datetime64[ns, UTC]

Sample code:

from google.cloud import bigquery
client = bigquery.Client()
destination_table = bigquery.table('your-project.your_dataset.your_table')
client.load_table_from_dataframe (df,destination_table)

It creates table with the valid dates and datetime.
If it doesn't help than could you please share reproducible code?

from python-bigquery.

HemangChothani commented on September 26, 2024

@tswast , @plamut please comment, is it related to pyarrow?

from python-bigquery.

tswast commented on September 26, 2024

Actually, let's specify a second issue for load_table_from_dataframe. googleapis/google-cloud-python#9996

from python-bigquery.

tswast commented on September 26, 2024

@QuinRiva is the behavior of to_dataframe any different when using the BigQuery Storage API?

You might also want to try setting the dtypes argument to to_dataframe. We don't have any logic to convert from the API response types to pandas dtypes unless the dtypes argument is specified.

from python-bigquery.

emkornfield commented on September 26, 2024

@tswast I expect the current behavior with the BigQuery Storage API that the dtype will still be "object" but at least with Arrow (and I assume Avro) the actual objects will python datetime.date object instead of strings.

@QuinRiva could you specify which dtype/object type you are expecting for the date columns?

from python-bigquery.

HemangChothani commented on September 26, 2024

@emkornfield Another issue has filed by user #136 which is duplicate of this issue.

As per comment received from user on the issue, Manually setting the dtype argument works but user wants that automated.

from python-bigquery.

emkornfield commented on September 26, 2024

@HemangChothani I think this might be fixed after e75ff82 (is this in the latest release?). Otherwise I think it is possible this might be going through a different code path and not using pyarrow through the storage API (and not using arrow data so it still might be maintaining the string). pyarrow will covert date types to datetime.date objects without intervention when using pa.record_batch.to_dataframe()

from python-bigquery.

HemangChothani commented on September 26, 2024

@emkornfield I have tried with the master branch as feature has not released, but still to_dataframe is the culprit.

pyarrow works fine as to_arrow returns date32 datatype[pyarrow.RecordBatch dt: date32[day]]
but line given below convert to df and return df.dtype ['column'] = object

python-bigquery/google/cloud/bigquery/table.py

Line 1725 in de9998f

df = record_batch.to_pandas()

from python-bigquery.

emkornfield commented on September 26, 2024

@HemangChothani I expect it to be an object (but the object should be datetime.date). If conversion to dtype=datetime64 is desired behavior I think you need to call record_batch.to_pandas(date_as_object=False)

from python-bigquery.

HemangChothani commented on September 26, 2024

@emkornfield I have completely ignored that and expect it to be a datetime64, thanks for the clarification, is there any plan to expose this parameter for users, if not than can we close this issue as it's resolved?

from python-bigquery.

emkornfield commented on September 26, 2024

is there any plan to expose this parameter for users, if not than can we close this issue as it's resolved?
The parameter is public in the pyarrow as far as I can tell. It seems like at least some users of BQ expect date to become datetime64 I would therefor think it is a good idea to expose this as a parameter for BQ, but I don't know the exact policies that are followed.

from python-bigquery.

BigQuery: to_dataframe does not respect date fields about python-bigquery HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent