Code Monkey home page Code Monkey logo

aiocogeo's Issues

Make header size configurable with environment variable

https://github.com/geospatial-jeff/async-cog-reader/blob/e3b613717291be7d247359480bd8e2f2cd2fe60a/async_cog_reader/constants.py#L3

GDAL docs:

Partial downloads (requires the HTTP server to support random reading) are done with a 16 KB granularity by default. Starting with GDAL 2.3, the chunk size can be configured with the CPL_VSIL_CURL_CHUNK_SIZE configuration option, with a value in bytes. If the driver detects sequential reading it will progressively increase the chunk size up to 2 MB to improve download performance. Starting with GDAL 2.3, the GDAL_INGESTED_BYTES_AT_OPEN configuration option can be set to impose the number of bytes read in one GET call at file opening (can help performance to read Cloud optimized geotiff with a large header).

Ref: https://gdal.org/user/virtual_file_systems.html#virtual-file-systems

Reduce memory usage

Aiocogeo uses ~4x more memory than rio tiler when reading a single tile:

Line #    Mem usage    Increment   Line Contents
================================================
    44    115.3 MiB    115.3 MiB   @profile
    45                             def main():
    46    125.7 MiB     10.4 MiB       asyncio.run(_aiocogeo())
    47    128.5 MiB      2.8 MiB       rio_tile()

The culprit is the call to skimage.resize when resampling the image:

Line #    Mem usage    Increment   Line Contents
================================================
   292    118.9 MiB    118.9 MiB       @profile
   293                                 def _postprocess(
   294                                     self, arr: NpArrayType, img_tiles: TileMetadata, out_shape: Tuple[int, int]
   295                                 ) -> NpArrayType:
   296                                     """Wrapper around ``_clip_array`` and ``_resample`` to postprocess the partial read"""
   297    118.9 MiB      0.0 MiB           return self._resample(
   298    126.5 MiB      7.6 MiB               self._clip_array(arr, img_tiles), img_tiles=img_tiles, out_shape=out_shape
   299                                     )

Read COG tile

Once #2 is ready to go, we need a method to use IFD/tag metadata to read a given tile. COGDumper uses cogdumper.cog_tiles.COGTiff.read_tile which looks to just a single XYZ tile based on the tile's coordinate with respect to the (top left?) of the image from the appropriate overview.

As @vincentsarago pointed out, we could use pyproj to:

  • get geospatial info from the COG
  • fetch only the internal tile (and overview tiles) for a specific .read request.

I think it would be nice to implement something similar to rasterio.windows where we can use pyproj to map a particular bounding box to the corresponding XYZ tiles in the COG, but I'm definitely open to other ideas. This brings some questions.

  • How this will work with rio-tiler-v2 -- if at all. If the COGTiff class can implement a similar interface to rasterio.io.DatasetReader it could be passed in as the src_dst to rio_tiler.reader._read but I'm not sure if that is feasible.

Add STAC filesystem

The STAC filesystem would search the item's assets for COGs and return potentially several http or s3 readers

Add cog validator

aiocogeo supports a much smaller subset of COG types than gdal, so it would be good to have a way to validate if an image can be read.

fix tag value type divergence

Tag values are currently typed as Union[Any, Tuple[Any]]. This causes lots of downstream issues because the type is unclear. It would make the code much cleaner if we removed the Union and only used a single type for tag values. This would also let us add mypy to pre-commit.

rio-tiler integration

Work began in #68 to support tiling with aiocogeo. The next step is to extend rio-tiler's BaseReader instead of defining our own class so aiocogeo can be (kind of) compatible with applications that already use rio-tiler.

Fix reading of byte formatted tags with text

Example:
Tag(code=305, name='Software', tag_type=TagType(format='c', size=1), count=21, length=21, value=(b'T', b'r', b'i', b'm', b'b', b'l', b'e', b' ', b'G', b'e', b'r', b'm', b'a', b'n', b'y', b' ', b'G', b'm', b'b', b'H', b'\x00'))

Define explicit IFD attributes for supported tags

When the interface is finished we should define explicit IFD attributes for the supported tiff tags, a few reasons:

  • Having a huge LUT containing a bunch of tags indicates that the library supports all of those tags when we really only want to support a small subset of tags defined in the TIFF spec which are necessary for partial reads.
  • As currently written, its not explicitly defined which tiff tags are attached to an IFD. This makes the code much harder to understand and maintain. A user/developer should be able to look at the IFD class definition and know exactly which tiff tags it supports and how they are accessed.

remove `run_in_background`

Decompression and postprocessing will never really block the main thread, so its causing more harm than good.

Improve IFD/tag composition

Tags are really just metadata about the IFD and its annoying to access them like:

ifd.tag['TagName'].value

Would be easier to do:

ifd.TagName.value

Checkout COGDumper

๐Ÿ‘‹ @geospatial-jeff
The subject looks interesting ๐Ÿ˜„

Not sure what's your idea but if we want to go full async maybe we can use some of the code from https://github.com/mapbox/COGDumper to go GDAL Free ...

COGDumper is not smart and doesn't do any spatial stuff but if we add pyproj we might be able to do;

  • get geospatial info from the COG
  • fetch only the internal tile (and overview tiles) for a specific .read request.

Refactor partial read for internal masks

Doing a partial read when an internal mask is present is different enough from no mask to warrant refactoring the partial read into two methods. This should also make it easier to support internal masks when merging range requests (#29)

Cache only COG header

https://cogeotiff.slack.com/archives/C01DE57GLHE/p1603130953009500

Summary

Consider a case where N unique tile requests are made to a single COG. Despite the ENABLE_CACHE environment variable being enabled, all requests would be cache misses. Thus at least 2 * N range requests would need to be made to the COG. But if the COG header were cached separately, then only 1 + N range requests would need to be made.

Details

I plan to incorporate aiocogeo within a traditional tile server middleware that handles regular z/x/y.png requests. These currently read PNG tiles that are stored as a tile pyramid in bucket storage. This dated architecture is space inefficient but very performant. I'm hoping to achieve the space savings of COGs (via YCbCr JPEG compression + GDAL mask bands) without a meaningful increase in latency. One way to eliminate that latency is by caching the header in redis or another fast cache available to many servers. For example:

  • Client requests Mercator tile (z, x, y) for cog.tif in cloud bucket storage. Networking layer routes it to COG server 1 (one of many COG servers).
    • COG server 1 checks redis (or another cache) for cog.tif header, but it's not found. CACHE MISS.
    • COG server 1 makes range request to cog.tif header.
    • COG server 1 caches cog.tif header in redis.
    • COG server 1 makes range request for (z, x, y) tile data.
    • COG server 1 performs postprocessing and returns (z, x, y) tile data to client.
  • Client requests Mercator tile (z, x + 1, y) for cog.tif in cloud bucket storage. Networking layer routes it to COG server 2.
    • COG server 2 checks redis for cog.tif header and it's found. CACHE HIT.
    • COG server 2 makes range request for (z, x + 1, y) tile data.
    • COG server 2 performs postprocessing and returns (z, x + 1, y) tile data to client.

The two italicized operations for the first request are not necessary for the second request.

Run cpu bound code in background

There is a lot of cpu bound code which blocks the main thread at higher concurrencies like decompression, resampling, and numpy operations. We should look to use something like starlette.concurrency.run_in_threadpool or aiofiles.os.wrap which both use asyncio.loop.run_in_executor to run code in the background without blocking the main thread.

It would be worth benchmarking the difference between a ProcessPoolExecutor and ThreadPoolExecutor (process would definitely be faster but by how much?).

Add config option to enable/disable block cache

Add config option to enable/disable block cache (GDAL has this as well). Also the cache is causing tests to fail, so it would be nice to disable caching during tests. That specific test case works by itself but fails during build because the same tiles are requested and cached in a different test case which throws off the number of requests.

Aiocache supports this through cache_read and cache_write kwargs injected to the cache key generator (https://aiocache.readthedocs.io/en/latest/decorators.html#cached)

Large tag value offsets

The first 16KB of the header should contain all IFDs, but large tag values which don't fit in the 12 bytes provided by each IFD for the tag's value may be stored anywhere in the file (even after image data) in which case we'll need to do another range request into the file fetch the tag value.

Merge consecutive Requests

linked to #21, GDAL merge consecutive requests (horizontal tiles, when band interleave I think) up to 2Mb (configurable).

Partial downloads (requires the HTTP server to support random reading) are done with a 16 KB granularity by default. Starting with GDAL 2.3, the chunk size can be configured with the CPL_VSIL_CURL_CHUNK_SIZE configuration option, with a value in bytes. If the driver detects sequential reading it will progressively increase the chunk size up to 2 MB to improve download performance. Starting with GDAL 2.3, the GDAL_INGESTED_BYTES_AT_OPEN configuration option can be set to impose the number of bytes read in one GET call at file opening (can help performance to read Cloud optimized geotiff with a large header).

Ref: https://gdal.org/user/virtual_file_systems.html#virtual-file-systems

add logging

Would be really useful for debugging purposes to have more verbosity on reads.

I often use CPL_DEBUG and CPL_CURL_VERBOSE withing GDAL to see how much data and how many GET/LIST/HEAD request gdal is doing.

Side note: myabe having and internal variable to host this could be cool:

async with COGReader("https://async-cog-reader-test-data.s3.amazonaws.com/webp_cog.tif") as cog:
    x = y = z = 0
    tile = await cog.get_tile(x, y, z)

print(cog.requests)
{
    count: 3,   
    size: TotalSizeOfRequest
    get: [
       'offset1-offset2', sizeOfRequest1,
       'offset3-offset4',  sizeOfRequest2,
       'offset5-offset6',  sizeOfRequest3
   ]
} 

Support more compressions

It would be great to add support for other compressions. Cross referencing the compressions supported by imagecodecs to rio-cogeo profiles, we should support:

  • lzma
  • packbits
  • lerc

We should also support no compression, although I don't think this is very common.

Caching merged range requests

Ref #23

I think there are a few options which could work:

  • Cache individual tiles after the ranged request. This has the benefit of caching the tile regardless of how it was requested (merged vs. unmerged), but adds complexity because we need to check if all of the tiles encapsulated by a specific merged request are cached before doing the request, skipping the merged request and pulling tiles directly from the cache if this is the case.
  • Cache the range request itself, using start/end as the cache key. This is easier to implement but wont cache the same tile across merged and unmerged requests. Another downside is we will only get a cache hit if the exact same range request is performed (ex. if you have two ranges A->D and B->D there will not be a cache hit even though 75% of the imagery is the same between the two requests).
  • Another solution is to cache with some sort of range key so we never request the same byte from the image more than once. This would of course be useful for every range request we perform and would be implemented on the lower-level Filesystem which is a nice design pattern, but I don't think aiocache has support for this.

There is also an argument to be made that choosing a caching stragegy which works across both merged/unmerged requests since (I think?) most users would be exclusively using either merged or un-merged range requests.

Add rasterio/tiler extra

Add an extra which includes code to do dynamic tiling with aiocogeo (ex. pip install aiocogeo[tiler])

  • This would be an extra because rasterio is required for coordinate system logic, and I don't want to include it as a core dependency.
  • We should aim to implement a similar interface to rio_tiler.io.base.BaseReader.

Boundless reads

Confirm that boundless reads work (reading a map tile which isn't fully covered by image tiles).

Should just have to add exception handling here to catch TileNotFoundError, and create a mask for the missing portion of the map tile.

Also let the user define what value is used to fill empty pixels.

cc @vincentsarago

Read COG metadata (IFD/tags)

I'm thinking something like:

class Tag:
    ...

class IFD:
    tags: Dict[str, Tag]


class COGTiff:
    fpath: str
    ifds: List[IFD]    

    async def __aenter__(self):
        # Request first 16kb, parse ifd and their tags.
        ifds = <ifd with tags>
        return self(ifds)

Usage would be:

async with COGTiff('https://coolsat.com/cog.tif') as cog:
    await cog.read_tile()

We can reuse most of the code from COGDumper but I'd like to make it more object orientated to make the interface a little easier to use.

I like the AbstractReader used by COGDumper, for now lets focus on making it work for http files and then introduce pluggable readers.

Define tags as semi-private

Made some improvements with #4 and #9 but I'm still not a huge fan.

I think it would be best to switch tags to semi-private attributes and expose important metadata through properties like here. At minimum we should have properties for the rasterio profile. A few reasons:

  1. I don't think most users care about the metadata contained on each Tag object (or even care about all of the defined tags)
  2. Keeping Tag defined on the IFD still resolves #9.
  3. Properties of course will be more user friendly to use (ifd.width instead of ifd.ImageWidth.value).
  4. Making Tag semi-private prevents confusion (ex. ifd.Compression vs ifd.compression is confusing)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.