Code Monkey home page Code Monkey logo

Comments (13)

QuinRiva avatar QuinRiva commented on September 26, 2024 2

I'm also seeing problem using load_table_from_dataframe with datetime columns. If I specificy the schema as part of the LoadJobConfig (specifying the column as DATE, it works fine.

However if I fail to specify the schema, it identifies the column as DATETIME and creates invalid dates:

Row	| startDate	
1	| 0021-04-25T22:13:48.360448
2	| 0021-04-25T22:13:48.360448
3	| 0021-08-13T11:42:29.163840

As opposed to :

Row	| startDate	
1	| 2017-07-01
2	| 2017-07-01
3	| 2018-01-01

Pandas dtype is datetime64[ns]

from python-bigquery.

PaddyAlton avatar PaddyAlton commented on September 26, 2024 1

I've been having the same issue with load_table_from_dataframe since updating dependencies. I have uncovered the real problem. Some of my recent timestamps throw errors like this one:

Cannot return an invalid datetime value of 1575982560000000 microseconds relative to the Unix epoch

What's actually going on here is that I have a pandas DataFrame with values like this:

'2019-12-10 12:56:00', dtype='datetime64[ns]

which, in ns, is 1575982560000000000.

In other words, somewhere between being a dataframe and ending up in BigQuery, the underlying value stored for this date is being interpreted as microseconds since the Unix epoch rather than nanoseconds!

I recently had an issue with the same code throwing errors after a pyarrow update, which might be related - I ended up fixing that one by rounding my timestamps to ms-precision before uploading to BigQuery (see https://stackoverflow.com/questions/53333742/schema-conflict-when-storing-dataframes-with-datetime-objects-using-load-table-f for an example).

EDIT: it appears that something has changed in terms of the BigQuery type being inferred from pandas. Previously, the type used was TIMESTAMP. Now, it appears to be DATETIME. I am trying to fix my specific issue by using a partial schema - in case it helps anyone else:

        job_config = LoadJobConfig()

        dt_columns = df.columns[df.dtypes == "datetime64[ns]"]

        job_config.schema = [SchemaField(col, "TIMESTAMP") for col in dt_columns]

from python-bigquery.

HemangChothani avatar HemangChothani commented on September 26, 2024

@QuinRiva I have tried to reproduce the issue related to load_table_from_dataframe but unable to do that, here is my sample code and dataframe:

contractId | contractTypeId |  startDate    |   endDate   |   created
    2      |    2           |  2019-12-06   | 2019-12-06  | 2019-12-06 17:09:11.317689+00:00
    1      |    1           |  2019-12-06   | 2019-12-06  | 2019-12-06 17:09:11.317689+00:00

df.types

contractId                      int64
contractTypeId                  int64
startDate                      object
endDate                        object
created           datetime64[ns, UTC]

Sample code:

from google.cloud import bigquery
client = bigquery.Client()
destination_table = bigquery.table('your-project.your_dataset.your_table')
client.load_table_from_dataframe (df,destination_table)

It creates table with the valid dates and datetime.
If it doesn't help than could you please share reproducible code?

from python-bigquery.

HemangChothani avatar HemangChothani commented on September 26, 2024

@tswast , @plamut please comment, is it related to pyarrow?

from python-bigquery.

tswast avatar tswast commented on September 26, 2024

Actually, let's specify a second issue for load_table_from_dataframe. googleapis/google-cloud-python#9996

from python-bigquery.

tswast avatar tswast commented on September 26, 2024

@QuinRiva is the behavior of to_dataframe any different when using the BigQuery Storage API?

You might also want to try setting the dtypes argument to to_dataframe. We don't have any logic to convert from the API response types to pandas dtypes unless the dtypes argument is specified.

from python-bigquery.

emkornfield avatar emkornfield commented on September 26, 2024

@tswast I expect the current behavior with the BigQuery Storage API that the dtype will still be "object" but at least with Arrow (and I assume Avro) the actual objects will python datetime.date object instead of strings.

@QuinRiva could you specify which dtype/object type you are expecting for the date columns?

from python-bigquery.

HemangChothani avatar HemangChothani commented on September 26, 2024

@emkornfield Another issue has filed by user #136 which is duplicate of this issue.

As per comment received from user on the issue, Manually setting the dtype argument works but user wants that automated.

from python-bigquery.

emkornfield avatar emkornfield commented on September 26, 2024

@HemangChothani I think this might be fixed after e75ff82 (is this in the latest release?). Otherwise I think it is possible this might be going through a different code path and not using pyarrow through the storage API (and not using arrow data so it still might be maintaining the string). pyarrow will covert date types to datetime.date objects without intervention when using pa.record_batch.to_dataframe()

from python-bigquery.

HemangChothani avatar HemangChothani commented on September 26, 2024

@emkornfield I have tried with the master branch as feature has not released, but still to_dataframe is the culprit.

pyarrow works fine as to_arrow returns date32 datatype[pyarrow.RecordBatch dt: date32[day]]
but line given below convert to df and return df.dtype ['column'] = object

df = record_batch.to_pandas()

from python-bigquery.

emkornfield avatar emkornfield commented on September 26, 2024

@HemangChothani I expect it to be an object (but the object should be datetime.date). If conversion to dtype=datetime64 is desired behavior I think you need to call record_batch.to_pandas(date_as_object=False)

from python-bigquery.

HemangChothani avatar HemangChothani commented on September 26, 2024

@emkornfield I have completely ignored that and expect it to be a datetime64, thanks for the clarification, is there any plan to expose this parameter for users, if not than can we close this issue as it's resolved?

from python-bigquery.

emkornfield avatar emkornfield commented on September 26, 2024

is there any plan to expose this parameter for users, if not than can we close this issue as it's resolved?
The parameter is public in the pyarrow as far as I can tell. It seems like at least some users of BQ expect date to become datetime64 I would therefor think it is a good idea to expose this as a parameter for BQ, but I don't know the exact policies that are followed.

from python-bigquery.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.