Code Monkey home page Code Monkey logo

Comments (6)

gjoseph92 avatar gjoseph92 commented on July 28, 2024 4

@clausmichele is right; a workaround for now is to drop the missing item. Or you could manually fix the input datetime properties to all be in the same format.

However, I consider this a stackstac bug. The problem indeed is with that one datetime near the end that doesn't have fractional seconds:

 '2022-09-03T10:32:02.517000Z',
 '2022-09-03T10:32:05Z',
 '2022-09-05T10:22:19.490000Z',
 '2022-09-05T10:22:20.519000Z']

The STAC spec just says that datetimes should be formatted according to RFC 3339, section 5.6. We can see there that time-secfrac is an optional part of partial-time, so this is still a valid datetime and should be parsed correctly.

I found that the issue seems to be related to the errors="coerce" argument to pd.to_datetime here, along with infer_datetime_format:

times = pd.to_datetime(
[item["properties"]["datetime"] for item in items],
infer_datetime_format=True,
errors="coerce",
)

Weirdly, with pandas 1.3.5, using errors='raise' doesn't raise any error, but seems to parse correctly:

>>> ts = sorted(item.properties["datetime"] for item in items)

>>> pd.to_datetime(ts, infer_datetime_format=True, errors='raise')
DatetimeIndex(['2022-08-04 10:32:05.634000+00:00',
               ...
               '2022-09-03 10:32:02.517000+00:00',
                      '2022-09-03 10:32:05+00:00',
               '2022-09-05 10:22:19.490000+00:00',
               '2022-09-05 10:22:20.519000+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

>>> pd.to_datetime(ts, infer_datetime_format=True, errors='coerce')
DatetimeIndex(['2022-08-04 10:32:05.634000', '2022-08-04 10:32:08.099000',
               '2022-08-06 10:22:16.627000', '2022-08-06 10:22:18.455000',
               '2022-08-09 10:32:12.888000', '2022-08-09 10:32:15.351000',
               '2022-08-11 10:22:08.959000', '2022-08-11 10:22:10.785000',
               '2022-08-14 10:32:04.138000', '2022-08-14 10:32:06.617000',
               '2022-08-16 10:22:17.889000', '2022-08-16 10:22:19.720000',
               '2022-08-19 10:32:13.767000', '2022-08-19 10:32:16.217000',
               '2022-08-21 10:22:06.588000', '2022-08-21 10:22:08.412000',
               '2022-08-24 10:32:01.411000', '2022-08-24 10:32:03.910000',
               '2022-08-26 10:22:20.430000', '2022-08-26 10:22:21.453000',
               '2022-08-29 10:32:13.577000', '2022-08-29 10:32:16.024000',
               '2022-08-31 10:22:06.328000', '2022-08-31 10:22:08.153000',
               '2022-09-03 10:32:02.517000',                        'NaT',
               '2022-09-05 10:22:19.490000', '2022-09-05 10:22:20.519000'],
              dtype='datetime64[ns]', freq=None)

However, after switching to pandas 2, we do get a helpful and informative error:

>>> pd.__version__
2.0.3

>>> In [7]: pd.to_datetime(ts, errors='raise')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-0ac50347e0b9> in <cell line: 1>()
----> 1 pd.to_datetime(ts, errors='raise')
...
ValueError: time data "2022-09-03T10:32:05Z" doesn't match format "%Y-%m-%dT%H:%M:%S.%f%z", at position 25. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

This makes sense: there are multiple datetime formats present, so infer_datetime_format (which became True by default in pandas 2) rightfully complains about this. The fact that it didn't raise an error in older versions was probably a bug.

Given that the STAC spec indicates all datetimes should be a subset of ISO 8601, format='ISO8601' seems like just what we want:

>>> pd.to_datetime(ts, errors='raise', format='ISO8601')
DatetimeIndex(['2022-08-04 10:32:05.634000+00:00',
               ...
               '2022-08-31 10:22:08.153000+00:00',
               '2022-09-03 10:32:02.517000+00:00',
                      '2022-09-03 10:32:05+00:00',
               '2022-09-05 10:22:19.490000+00:00',
               '2022-09-05 10:22:20.519000+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

format='ISO8601' is only available in pandas 2, so that's a compelling reason to require pandas 2, xref #213. (pandas-dev/pandas#41047 is also fixed in newer versions of pandas, allowing us to remove some old code for #2).

I think I'll probably fix this by using format='ISO8601' and requiring pandas 2.

from stackstac.

clausmichele avatar clausmichele commented on July 28, 2024 1

You can drop them before trying to crop to the AOI like this:

https://github.com/Open-EO/openeo-processes-dask/blob/8d9e3fb009676a39c3ce5009bc1c09909e5bbfb3/openeo_processes_dask/process_implementations/cubes/_filter.py#L82

from stackstac.

J6767 avatar J6767 commented on July 28, 2024

Thanks guys, that worked great. stack = stack.where(~np.isnat(stack.time), drop=True)

from stackstac.

gjoseph92 avatar gjoseph92 commented on July 28, 2024

Glad that works for you, but dropping data doesn't seem like an ideal workaround to me, so I'll reopen this until the underlying problem is fixed.

from stackstac.

ZZMitch avatar ZZMitch commented on July 28, 2024

To provide a bit more context to this bug, I have been using stackstac to document HLS (Landsat 8/9 + Sentinel-2) clear observation availability over Canada and came across these NaT time-steps.

My "fix" has been to drop the NaT time-steps in a similar manner as demonstrated here, and I have documenting how prevalent they are across the HLS archive. Originally, I thought this was a metadata issue within the HLS archive, and only recently realized this was a stackstac bug, otherwise I would have fixed the missing dates, rather than removing them.

For HLS, this impacts Sentinel-2 more than Landsat. Across the couple years of imagery I have documented so far (2018 - 2020), it impacts <0.1% of Landsat images and 0.2 - 1% of Sentinel-2 images depending on the year. It seems to be more prevalent in 2018 for S30 (1%), dropping to 0.2% by 2020.

Most of the time, the NaT time-steps are a couple images here and there (e.g., <5 in a year looking at all available imagery). However, I have noticed a few cases where almost every time-step in a year is NaT.

Here is a basic reproducible example of this case:

lon, lat = -91.4888563, 48.8885808 # All but first S30 will be NaT datetime
#lon, lat = -89.4888563, 48.8885808 # No NaT datetime
 
start = '2019-01-01' # In 2020 or starting on 01-03, no issue
end = '2019-12-31'

import os
import pystac_client as pc

gdalEnv = ss.DEFAULT_GDAL_ENV.updated(dict(
    GDAL_DISABLE_READDIR_ON_OPEN = 'TRUE',
    GDAL_HTTP_COOKIEFILE = os.path.expanduser('~/cookies.txt'),
    GDAL_HTTP_COOKIEJAR = os.path.expanduser('~/cookies.txt'),
    GDAL_HTTP_MAX_RETRY = 10,
    GDAL_HTTP_RETRY_DELAY = 15))

catalog = pc.Client.open('https://cmr.earthdata.nasa.gov/stac/LPCLOUD')

itemsS30 = catalog.search(intersects = dict(type = 'Point', coordinates = [lon, lat]), datetime = f'{start}/{end}', collections = ['HLSS30.v2.0'], limit = 100).item_collection()
#itemsS30 # Can see datetime values in all images

import stackstac as ss

s30 = ss.stack(itemsS30, assets = ['Fmask'], epsg = 32615, resolution = 30, gdal_env = gdalEnv)
#s30 # Will see #NaT date-times here

s30.time.values # All but first are NaT

In this case, the first datetime in the stack is '2019-01-02T17:10:02.000000000'. It appears that since this date time does not have fractional seconds, and it is the first datetime in the stack, all the other datetimes are considered invalid and become NaT. Obviously a much more problematic result than having just one time here and there becoming NaT! These situations are pretty rare (I have noticed it only 3-4 times across the few years of work I have done over Canada so far) - but wanted to mention here now that I know it is a stackstac issue rather than an HLS metadata issue.

When doing NRT work or phenology, for-example, every image counts! So good to document and hopefully fix these edge cases.

from stackstac.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.