The pystac from stac-utils

Implement Extension: Data Cube

Implement the Data Cube extension, which adds dimension data to Items.

https://github.com/radiantearth/stac-spec/tree/v0.8.0/extensions/datacube

Infinite recursion with `get_all_items`

When reading s3://rasterfoundry-development-data-us-east-1/berlin-catalog/catalog.json, initially, the reported links are correct:

> from pystac import Catalog
> catalog = Catalog.from_file("s3://rasterfoundry-development-data-us-east-1/berlin-catalog/catalog.json")
> catalog.links

[<Link rel=child target=./collection.json>,
 <Link rel=self target=s3://rasterfoundry-development-data-us-east-1/berlin-catalog/catalog.json>,
 <Link rel=root target=<Catalog id=berlin>>]

However, iterating over get_all_items throws a RecursionError: maximum recursion depth exceeded in comparison, and inspecting the links after the fact reveals that the catalog has picked up a child link to itself:

> catalog.links
[<Link rel=child target=<Catalog id=berlin>>,
 <Link rel=self target=s3://rasterfoundry-development-data-us-east-1/berlin-catalog/catalog.json>,
 <Link rel=parent target=<Catalog id=berlin>>,
 <Link rel=root target=<Catalog id=berlin>>]

Discussion: is PyStac useful for creating offline STAC Catalogs?

Hi, I was looking for a support forum but couldn't find one. I hope this is an okay place to ask for support. If not, could someone direct me to the appropriate place?

I've got a set of GeoTIFFs in a Google bucket that I will be adding to over time. I'd like to use STAC to organize them in a standard way for listing and searching. What I'd like to do is be able to create a local catalog then upload it to a public accessible bucket. I followed the tutorial here: https://pystac.readthedocs.io/en/latest/tutorials/how-to-create-stac-catalogs.html and got what I could from the docs.

I made a tracer script to see if this would do what I wanted to do: https://gist.github.com/richpsharp/77e390901c6c266c00f5c2333a835e48, but when I examine my catalog.json there are a lot of Windows path slashes. Here's a snippet:

{
    "id": "mygsstac",
    "stac_version": "0.8.1",
    "description": "mygsstac",
    "links": [
        {
            "rel": "root",
            "href": ".\\catalog.json",
            "type": "application/json"
        },
        {
            "rel": "item",
            "href": ".\\raster_a\\raster_a.json",
            "type": "application/json"
        },
...

I was hoping to see paths that would be compatible with URLs or perhaps I could set a "root" which all the other hrefs would be built locally from. I'm worried that I don't understand how to use STAC or PySTAC and what I'm doing is a little skewed. Either that or I'm missing a big "set base url" switch. Thank you for any help!

Update PySTAC to use STAC 0.9.0

Update schemas to 0.9.0 versions and fix so that tests pass.

Changes to account for (from CHANGELOG)

Crossed out items did not require changes in PySTAC.

Added

Changed

Removed

Fixed

Improve Collection <-> Item property inheritance

Currently, reading from a Collection with common Item properties will work, e.g. if an Item lists eo in it's stac_extensions properties, but all eo:* properties are on the Collection, PySTAC will read it fine and it will produce EOItems.

However, on the write side, there's not a good mechanism to merge common Item properties into the collection.

This issue is for figuring out the best approach and API for collapsing Item properties for writing.

Wrong absolute path while listing catalog

Hi,

I have noticed a problem in absolute HREF construction. When I try to read a catalog from file using the relative path, the error occurs:

[Errno 22] Invalid argument: 'c://C:\Users\*****'

I checked the source files and it seems that the problem is in the utils.py in make_absolute_href() function.

abs_path = os.path.abspath(os.path.join(start_dir, parsed_source.path))

abs_path already includes C:\ , so there is no need to build:
'{}://{}{}'.format(parsed_start.scheme, parsed_start.netloc, abs_path)

If I comment the following:

if parsed_start.scheme != '':
return '{}://{}{}'.format(parsed_start.scheme,
parsed_start.netloc, abs_path)

everything works fine for me!

Has anybody faced the similar problem?

Set owning item on assets in item from_dict

Read items do not assign themselves as their asset owners when reading in from a dictionary.

Implement Extension: SAR

Implement the SAR extension, for Items containing SAR data.

https://github.com/radiantearth/stac-spec/tree/v0.8.0/extensions/sar

Cloning catalog duplicates root link

When cloning a catalog, you end up with two root links:

In [5]: catalog = Catalog.from_file("/opt/data/scene-catalog/catalog.json")

In [6]: cloned = catalog.clone()

In [7]: cloned.links
Out[7]: 
[<Link rel=root target=<Catalog id=ABSOLUTE_PUBLISHED>>,
 <Link rel=self target=/opt/data/scene-catalog/catalog.json>,
 <Link rel=child target=/opt/data/scene-catalog/c8d68bc1-a862-4d7f-96d3-5af2436695a4/collection.json>,
 <Link rel=root target=<Catalog id=ABSOLUTE_PUBLISHED>>]

This is bad because it breaks assumptions about what get_root_link should return -- or at least it appears to, if I'm reading the next() call correctly as expecting only one result. I think this is also responsible for some surprising behaviors with normalization when building up a catalog but I can't prove that part.

Datetime format doesn't work well with libraries expecting ISO8601 formatted strings

While the STAC spec specifies that datetimes should be formatted according to RFC 3339, section 5.6, the space separator between the date and time causes some issues with libraries expecting an ISO8601 formatted string. Changing the separator to 'T' would make the datetime compliant with both RFC 3339, section 5.6 as well as ISO8601.

For example, currently a datetime is formatted like this 2019-01-01 00:00:00Z and the updated datetime string would be formatted like this 2019-01-01T00:00:00Z.

Add 'collection' property to item

needs to be added as a string property

CI badge shows status of any CI build instead of just current develop CI status

The CI workflow doesn't differentiate between builds of a PR or the develop branch. This causes the CI badge in the README to show failures when the job has failed on PR builds, which gives the appearance that the develop branch has broken the build.

The documentation for workflows says that a branch parameter to the badge might work, e.g.

https://github.com/azavea/pystac/workflows/CI/badge.svg?branch=develop

though currently that results in a no-status badge:

Publish PySTAC 0.2

Implement Extension: Point Cloud

Implement the Point Cloud extension, for creating STACs of point cloud data.

https://github.com/radiantearth/stac-spec/tree/v0.8.0/extensions/pointcloud

Implement EO Extension

Includes implementing an EOItem and EOAsset. May have to refactor how Assets are created in order to account for Asset extensions.

Also an open question: how do we handle extension properties in collections?

https://github.com/radiantearth/stac-spec/tree/master/extensions/eo

Store items in non-canonical path

I want to create a self contained catalog that uses relative links, but overrides that items should be stored in subdirectories of their parent catalog. e.g I want

/tmp/test-catalog/test_dataset_1/BD44_500_031096.json

rather than

/tmp/test-catalog/test_dataset_1/BD44_500_031096/BD44_500_031096.json

Is this possible? I tried setting the href paths manually with the set_self_href method on all STAC objects and not calling normalize_hrefs, but that didn't work. See example here: https://gist.github.com/palmerj/a78dc0b99da0720266ac19c736785802

Develop method for identifying STAC object type and version number

Develop a method that takes in a dict, and returns a tuple (object_type, version, [extensions]).

This will be used to identify the STAC objects that are contained in dictionaries. This will be useful for two things:

Validation: stac-validator and other validation tools - including one we are building to validate all examples in the stac-spec as part of radiantearth/stac-spec#623 - can use this method to determine what schema(s) to apply to the validation process.
Deserialization in PySTAC for reading older version of STAC - a proposed solution to #36 could include a JSON transformation from older versions of stac objects to the version that the to_dict on PySTAC objects requires. In order to write this in a sane way, it would be good to know a priori what the STAC object, version, and extensions we are working with.

This method is probably not going to be very pretty - it's hard to imagine a version that is a set of conditionals that looks like when you take headphones out of a messy drawer. The idea is, if we put this method in PySTAC, we can centralize that messy logic to one place so that no duplicates have to exist. This follows what I call the One Monster Rule: if you're going to create a monster, make sure it's only created once.

Another approach we tries was to use jsonschema with a large combined oneOf reference that included all schemas from the STAC repo - jsonschema will try to validate each of the schemas to determine what works. However, if multiple schemas pass validation, this throws an error, which makes a lot of sense.

Any suggestions on how to better solve this are welcome!

/cc Team Validation @ the Arlington STAC Sprint @jbants @anayeaye

Develop tutorial jupyter notebooks

Develop Jupyter Notebook tutorials a la (here and here).

Docs link goes to dead page

https://pystac.readthedocs.io/

Timestamps should be timezoned

RF backend expects the timestamps to have a timezone included

pystac is now on conda-forge!

Thanks to @ocefpaf, pystac is now on conda-forge: https://anaconda.org/conda-forge/pystac

Created via the recipe in: https://github.com/conda-forge/pystac-feedstock

Catalog from_file showing not working

source code:

from pystac import Catalog
cat = Catalog.from_file('https://sentinel-stac.s3.amazonaws.com/catalog.json'))

showing error:
Traceback (most recent call last):
File "source/main.py", line 5, in
cat = Catalog.from_file('https://sentinel-stac.s3.amazonaws.com/catalog.json')
File "/home/nawaz/Documents/disaster_managment/venv/lib/python3.6/site-packages/pystac/stac_object.py", line 408, in from_file
d = STAC_IO.read_json(href)
File "/home/nawaz/Documents/disaster_managment/venv/lib/python3.6/site-packages/pystac/stac_io.py", line 99, in read_json
return json.loads(STAC_IO.read_text(uri))
File "/home/nawaz/Documents/disaster_managment/venv/lib/python3.6/site-packages/pystac/stac_io.py", line 64, in read_text
return cls.read_text_method(uri)
File "/home/nawaz/Documents/disaster_managment/venv/lib/python3.6/site-packages/pystac/stac_io.py", line 14, in default_read_text_method
with open(uri) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'https://sentinel-stac.s3.amazonaws.com/catalog.json'

Child catalogs have the same name as the parent catalog

Specifically for ISERV. This makes browsing the various catalog levels by name difficult.

PySTAC 0.2.0 release

Zenhub Epic to track final tasks before initial release to pypi.

Add ability for STACObjects to capture JSON properties that are not handled

Currently, if a STAC JSON contains properties that are not parsed by the STACObject.from_dict method, they are ignored. We should capture this somehow so that unimplemented extension data that could be contained in those properties is not lost.

LabelItems require too much state tracking from users

Problem Description

As currently implemented, it's challenging to find a place to put a LabelItem's asset. The workflow this prevents is creating a bunch of LabelItems in memory, creating a Collection in memory, then adding that Collection and those LabelItems later to a Catalog, which is only at the very end written to disk. Having nowhere to store the data while building the object up in memory makes fulfilling the following requirement from the STAC spec difficult:

The Label Extension requires at least one asset that uses the key "labels". The asset will contain a link to the actual label data.

I don't know where it's going to go right now! 🤔 And I don't think it's great ergonomically to require a bunch of filesystem manipulation/writing to disk to build up a STAC. Any failures midway through the process then leave halfway complete catalogs lying around, which adds to my manual cleanup burden.

Proposed solution

It would be nice to shift IO to the end of the world to the extent possible. One way to do this would be to initialize LabelItems with a FeatureCollection that gets written out adjacent to the label item (and assets get updated) whenever the LabelItem is saved.

Figure out relationship between extensions and collections

Open question: how do we handle extension properties in collections?

potentially create EOCollection class

Item links in a catalog are being marked as `child` instead of `item`

Within the links of a catalog, any links to an actual feature should have a rel value of item instead of child. The "rel": "child" is reserved for sub-catalogs.

items should encode bbox as a 2d list

according to the stac spec the bbox property should be a list of lists

https://github.com/azavea/pystac/blob/187ab4e2d423e4eb72f44c1ed8d833277faeaddd/pystac/item.py#L41

Add mechanism to allow for some backwards compatibility

As STAC is still in flux, property names, etc might shift. Until things stabilize, it would be nice to have a consistent way to read STAC that is a little bit off, i.e. a version or so back.

One option is to have to_dict methods simply push the dict through a transformation that accounts for any old versions. We're already doing some of that in the codebase, e.g. https://github.com/azavea/pystac/blob/v0.3.0/pystac/collection.py#L206 . It would be good to formalize this and have a place to collect these types of updating transformations.

Implement ItemCollection

Implement an object for ItemCollections

Fix unit test failing on 404s

In #105 the travis build is failing, seemingly because a file that was being read remotely is no longer available.

https://travis-ci.org/github/azavea/pystac/jobs/701785011

Travis failing to publish tag to pypi

Travis publishes tags to pypi via this configuration. It appears that the authentication information for the rasterfoundry user is invalid.

The goal of this issue is to track down why this deploy is failing and update the API key if necessary.

Implement Extension: Scientific

Implement the Scientific extension, which adds citation information to Items and Collections.

https://github.com/radiantearth/stac-spec/tree/v0.8.0/extensions/scientific

Implement Extension: Asset Definition Specification

Implement the Asset Definition Extension, which adds Asset-like information to Collections.

https://github.com/radiantearth/stac-spec/tree/v0.8.0/extensions/asset

Proprietary collections can be written without license link

Problem Description

See catalog in s3://demo-mlhub-earth/panama-water-features-jan-apr-2019/ -- the collections have a proprietary license, but there's no license link in links.

Expected Behavior or Output

pystac should get mad at me when I try to write an invalid catalog

Improve documentation on STAC_IO usage

Pydocs for STAC_IO should make it more clear on how to override the read_text_method etc in order to work with user-defined IO methods. Also, add sphinx docs on the subject.

Move STAC version to 0.8.1 now that STAC 0.8.1 is released.

Implement Extension: Datetime range

Implement the Datetime range extension, for adding a datetime range property to Items.

https://github.com/radiantearth/stac-spec/tree/v0.8.0/extensions/datetime-range

`self` links shouldn't be relative

Self explanatory, but this adds a lot of difficulty in navigating from a catalog.

Improve API around licenses and providers

As of 0.8.0, Items can have licenses and providers in their properties. Currently we only have Collections associated with Provider objects. We should develop API to handle these optional fields.

Also, there should be some API to handle license links on Collections and Items, so that if these objects do point to a license, it can be handled via the API.

Allow setting relative paths with links and assets.

Currently, we can read STAC's that have relative links, but can only write STACs with absolute links.

Implement a method on catalog to set uris to relative. Default behavior should follow the best-practices guidelines (i.e. self-contained catalogs with relative paths should not contain 'self' links).

Saving part of a catalog

I've created an application that loads and saves the a static STAC catalog to S3 using something like:

def my_read_method(uri):
            parsed = urlparse(uri)
            if parsed.scheme == "s3":
                bucket = parsed.netloc
                key = parsed.path[1:]
                s3 = _get_s3_resource()
                try:
                    logger.debug("Reading %s", key)
                    obj = s3.Object(bucket, key)
                    txt = obj.get()["Body"].read().decode("utf-8")
                except ClientError as e:
                    raise StacCatalogError(str(e) + " " + key)
                return txt
            else:
                return stac.STAC_IO.default_read_text_method(uri)

        def my_write_method(uri, txt):
            parsed = urlparse(uri)
            if parsed.scheme == "s3":
                bucket = parsed.netloc
                key = parsed.path[1:]
                s3 = _get_s3_resource()
                try:
                    logger.debug("Writing %s", key)
                    s3.Object(bucket, key).put(Body=txt)
                except ClientError as e:
                    raise StacCatalogError(str(e) + " " + key)
            else:
                stac.STAC_IO.default_write_text_method(uri, txt)

        stac.STAC_IO.read_text_method = my_read_method
        stac.STAC_IO.write_text_method = my_write_method

I've noticed that once I start adding multiple collections to the catalog PySTAC will load all catalog collections and their child items even if I'm not iterating through everything. This is slow when saving objects to AWS S3. To make matters worse when I save a new collection (calling sub_catalogue.save on the parent non-root catalog), it will re-save all collection siblings and their child items. Some of my collections will end up having 1000s of items. I've read to the docs and can't see how to use the API to save part of the catalog or avoid saving catalog items that have not changed. Is this possible?

Thanks heaps in advance.

Docs link to notebooks broken

The link in this section: https://pystac.readthedocs.io/en/latest/concepts.html#concepts should point to https://github.com/azavea/pystac/tree/develop/docs/tutorials

Discussion: what should happen when self-contained STACs are copied?

I have a STAC that includes a bunch of chips of tifs. When I clone that STAC, I keep all of the items, and the assets point to the old tifs. I don't think this is necessarily wrong. I'm curious whether it's a deliberate choice to leave the references to the old tifs and not copy the tifs into the new stac or whether that's something that happened incidentally. I can see arguments for both ways --

In favor of not copying the data:

since the tifs are part of a different catalog, there's no relative path from the new catalog to the old data, so path construction requires some assumptions on PySTAC's part
presumably if I'm building stacs from other stacs i have access to the data in both places, so why copy?

In favor of being able within PySTAC to copy the data (obviously I can do whatever I want outside of PySTAC):

self-contained catalogs are nice, and there's currently no way to tell PySTAC to make a new self-contained catalog from an existing one as far as I can tell (it won't infer the copy behavior)
in multi-step pipelines for STAC production, I might want to delete everything but the output of the last step (i.e. only keep the "complete" catalog, where "complete" means "has had everything I want to do to it done), which means at the end my references to assets from previous stages will be invalid

Migrate from Travis CI to GitHub Actions

Replace the existing Travis CI configuration with one based on GitHub Actions. Ensure that:

Matrix builds are preserved
Caches are setup appropriately
Tagged releases are automatically published to pypi

See: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions

Error when saving and then loading Catalog

Using PySTAC 0.3.3, the following code executed as a script works as expected (it saves a STAC and then opens it and prints the items). But if you uncomment the code in Block A, Line B will fail with the following stack trace. This especially strange because all that Block A does is read from the catalog before saving it. This came up in the context of trying to modify some links to GeoTIFF files in an existing stack, and then saving the modified copy. I found that the following was the minimal example that replicates the error.

from urllib.parse import urlparse
from os.path import join

import boto3
from pystac import STAC_IO
from pystac import Catalog, LabelItem, CatalogType

# Copied from PyStac tutorial.
def setup_stac_s3():
    def my_read_method(uri):
        parsed = urlparse(uri)
        if parsed.scheme == 's3':
            bucket = parsed.netloc
            key = parsed.path[1:]
            s3 = boto3.resource('s3')
            obj = s3.Object(bucket, key)
            return obj.get()['Body'].read().decode('utf-8')
        else:
            return STAC_IO.default_read_text_method(uri)

    def my_write_method(uri, txt):
        parsed = urlparse(uri)
        if parsed.scheme == 's3':
            bucket = parsed.netloc
            key = parsed.path[1:]
            s3 = boto3.resource("s3")
            s3.Object(bucket, key).put(Body=txt)
        else:
            STAC_IO.default_write_text_method(uri, txt)

    STAC_IO.read_text_method = my_read_method
    STAC_IO.write_text_method = my_write_method

# Open a catalog and then save a copy locally.
setup_stac_s3()
stac_uri = ('s3://rasterfoundry-production-data-us-east-1/stac-exports/'
            'fd478c2b-3f71-41e4-a87b-e97a8a0d0afa/catalog.json')
cat = Catalog.from_file(stac_uri)
'''
# Block A. If this is uncommented, then Line B will fail.
for item in cat.get_all_items():
    print(item)
'''
new_stac_uri = '/opt/data/foo/test-stac-catalog'
cat.normalize_hrefs(new_stac_uri)
cat.save(catalog_type=CatalogType.SELF_CONTAINED)

# Open the local copy and iterate over it.
new_stac_uri = '/opt/data/foo/test-stac-catalog'
cat = Catalog.from_file(join(new_stac_uri, 'catalog.json'))
# Line B
for item in cat.get_all_items():
    print(item)

Stack trace:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/src/foo/pystac_test.py", line 51, in <module>
    for item in cat.get_all_items():
  File "/opt/conda/lib/python3.6/site-packages/pystac/catalog.py", line 281, in get_all_items
    yield from child.get_all_items()
  File "/opt/conda/lib/python3.6/site-packages/pystac/catalog.py", line 281, in get_all_items
    yield from child.get_all_items()
  File "/opt/conda/lib/python3.6/site-packages/pystac/catalog.py", line 279, in get_all_items
    yield from self.get_items()
  File "/opt/conda/lib/python3.6/site-packages/pystac/stac_object.py", line 260, in get_stac_objects
    link.resolve_stac_object(root=self.get_root())
  File "/opt/conda/lib/python3.6/site-packages/pystac/link.py", line 160, in resolve_stac_object
    obj = STAC_IO.read_stac_object(target_href, root=root)
  File "/opt/conda/lib/python3.6/site-packages/pystac/stac_io.py", line 119, in read_stac_object
    return cls.stac_object_from_dict(d, href=uri, root=root)
  File "/opt/conda/lib/python3.6/site-packages/pystac/serialization/__init__.py", line 31, in stac_object_from_dict
    merge_common_properties(d, json_href=href, collection_cache=collection_cache)
  File "/opt/conda/lib/python3.6/site-packages/pystac/serialization/common_properties.py", line 47, in merge_common_properties
    collection = STAC_IO.read_json(collection_href)
  File "/opt/conda/lib/python3.6/site-packages/pystac/stac_io.py", line 96, in read_json
    return json.loads(STAC_IO.read_text(uri))
  File "/opt/conda/lib/python3.6/site-packages/pystac/stac_io.py", line 61, in read_text
    return cls.read_text_method(uri)
  File "/opt/src/foo/pystac_test.py", line 19, in my_read_method
    return STAC_IO.default_read_text_method(uri)
  File "/opt/conda/lib/python3.6/site-packages/pystac/stac_io.py", line 12, in default_read_text_method
    with open(uri) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/data/foo/test-stac-catalog/12cb17a8-ae08-469c-a2be-4d0619240014/400c22e3-5b54-438b-b600-5e9bd6d0a498/3c67b59c-2e6f-47fb-ba3c-0dd106941096/collection.json'

Implement Extension: Checksum

Implement the Checksum extension for adding checksum data to Links and Assets.

https://github.com/radiantearth/stac-spec/tree/v0.8.0/extensions/checksum

PySTAC reading collection redundantly

Currently PySTAC reads the collection of an item over and over, even if it's already read the collection - there is no cache support that gets triggered. This can slow reading STACs from cloud providers down considerably.

Perhaps implement a caching strategy that caches on HREF instead of just ID.

Efficiently fetching a specific child object

I'm having trouble reducing the number of s3 calls in my application with a catalog containing many 1000's of child objects. If I call catalog.get_child(id, recursive=False) it will iterate through the child objects in sequence resolving each of them until it finds object it needs. Given that the child links don't contain the ID of the referenced object this is understandable. However, the child URLs from a STAC catalog published using best practises do contain the ID. Would it be possible to somehow add an optimisation to pystac to use the URL to short circuit the lookup process? I guess it could fall back to traversing the child objects if the URL extract of the ID doesn't work and the returned object ID doesn't match. I'm also wondering if this an issue with the STAC specification.

stac-utils / pystac Goto Github PK

pystac's People

Contributors

Stargazers

Watchers

Forkers

pystac's Issues

Added

Changed

Removed

Fixed

Problem Description

Proposed solution

Problem Description

Expected Behavior or Output

Recommend Projects

Recommend Topics

Recommend Org