Code Monkey home page Code Monkey logo

gfw_pixetl's Introduction

GFW pixETL

Codacy Badge Codacy Badge

PixETL reads raster and vector source data and converts data into Cloud Optimized GeoTIFF (without overviews) clipped to specified grid sizes.

It will upload all tiles to GFW data lake following GFW naming convention.

Installation

./scripts/setup

Dependencies

  • GDAL 2.4.x or 3.x
  • libpq-dev

Usage

 ██████╗ ███████╗██╗    ██╗    ██████╗ ██╗██╗  ██╗███████╗████████╗██╗
██╔════╝ ██╔════╝██║    ██║    ██╔══██╗██║╚██╗██╔╝██╔════╝╚══██╔══╝██║
██║  ███╗█████╗  ██║ █╗ ██║    ██████╔╝██║ ╚███╔╝ █████╗     ██║   ██║
██║   ██║██╔══╝  ██║███╗██║    ██╔═══╝ ██║ ██╔██╗ ██╔══╝     ██║   ██║
╚██████╔╝██║     ╚███╔███╔╝    ██║     ██║██╔╝ ██╗███████╗   ██║   ███████╗
 ╚═════╝ ╚═╝      ╚══╝╚══╝     ╚═╝     ╚═╝╚═╝  ╚═╝╚══════╝   ╚═╝   ╚══════╝

Usage: pixetl [OPTIONS] LAYER_JSON

  LAYER_JSON: Layer specification in JSON

Options:
  -d, --dataset TEXT                        Name of dataset to process  [required]
  -v, --version TEXT                        Version of dataset to process  [required]
  --subset TEXT                             Subset of tiles to process
  -o, --overwrite                           Overwrite existing tile in output location
  --help                                    Show this message and exit.

Example

pixetl -d umd_tree_cover_density_2000 -v v1.6 '{"source_type": "raster", "pixel_meaning": "percent", "data_type": "uint8", "nbits": 7, "grid": "10/40000", "source_uri": "s3://gfw-files/2018_update/tcd_2000/tiles.geojson", "resampling": "average"}'

Layer JSON

You define layer sources in JSON as the one required argument

Layer source definitions follow this pattern

{
    "option": "value",
    ...
}

Supported Options:

Raster Sources:

Option Mandatory Description
source_type yes Always "raster"
pixel_meaning yes A string indicating the value represented by pixel. This can either be a field name or a unit. Always use lower caps, unless when specifying a unit that uses capital letters
data_type yes Data type of output file (boolean, uint8, int8, uint16, int16, uint32, int32, float32, float64)
grid yes Grid size of output dataset
no_data no Integer value, for float datatype use NAN. If left out or set to null output dataset will have no no_data value
nbits no Max number of bits used for given datatype
source_uri yes List of URIs of source folders or tiles.geojson file(s)
resampling no Resampling method (nearest, mod, avg, etc), default `nearest
calc no Numpy expression to transform array. Use namespace np, not numpy when using numpy functions. When using multiple input bands, reference each band with uppercase letter in alphabetic order (A,B,C,..). To output multiband raster, wrap list of bands in a masked array ie np.ma.array([A, B, C]).
symbology no Add optional symbology to the output raster
compute_stats no Compute band statistics and add to tiles.geojson
compute_histogram no Compute band histograms and add to tile.geojson
process_locally no When set to True, forces PixETL to download all source files prior to processing. Default False
photometric no Color interpretations of bands

NOTE:

File listed in source_uri must be stored on S3 and accessible to PixETL. The file path must use the s3 protocol (s3://). File content must be of format geoJSON. The geojson must contain a FeatureColletion where each feature represents one geoTIFF file. The feature geometry describes the extent of the geoTIFF, the property name the path to the geotiff using GDAL vsi notation. You can reference file hosted on S3 (/vsis3/), GCS (/vsigs/) or anywhere else accessible over http (/vsicurl/) You can use the pixetl_prep script to generate the tile.geojson file.

GeoTIFFs hosted on S3 must be accessible by the AWS profile used by PixETL. When referencing geotiffs hosted on GCS, you must set the ENV variable GOOGLE_APPLICATION_CREDENTIALS which points to a json file in the file system which holds the GCS private key of the google service account you will use to access the data.

You can store the private key in AWS Secret Manager. In that case set AWS_GCS_KEY_SECRET_ARN to specify the secret id together with GOOGLE_APPLICATION_CREDENTIALS. PixETL with then attempt to download the private key from AWS Secret Manager and store it the json file specified.

Goolge Cloud Storage support is experimental only. It should work as documented, but we don't have the tools in place to fully test this feature locally. The only way we can test right now is with integration tests after we deployed code in staging. For local tests to past AWS_GCS_KEY_SECRET_ARN must NOT be set as we currently have issues running tests with a second moto server on Github Actions.

For example here is a pretty-printed sample raster layer definition followed by the command that one would issue to process it:

{
     "source_type": "raster",
     "pixel_meaning": "percent",
     "data_type": "uint8",
     "nbits": 7,
     "grid": "10/40000",
     "source_uri": "s3://gfw-files/2018_update/tcd_2000/tiles.geojson",
     "resampling": "average"
 }

Vector Sources

Option Mandatory Description
source_type yes Always "vector"
pixel_meaning yes Field in source table used for pixel value
data_type yes Data type of output file (boolean, uint, int, uint16, int16, uint32, int32, float32, float64)
grid yes Grid size of output dataset
no_data no Integer value to use for no data value.
nbits no Max number of bits used for given datatype
order no How to order field values of source table (asc, desc)
rasterize_method no How to rasterize tile (value or count). value uses value from table, count counts number of features intersecting with pixel
calc no PostgreSQL expression (ie CASE to use to reformat input values
symbology no Add optional symbology to the output raster
compute_stats no Compute band statistics and add to tiles.geojson
compute_histogram no Compute band histograms and add to tile.geojson

NOTE:

Source files must be loaded into a PostgreSQL database prior to running this pipeline. PixETL will look for a PostgreSQL schema named after the dataset and a table named after the version. Make sure, geometries are of type Polygon or MultiPolygon and valid before running PixETL.

PixETL will look for a field named after pixel_meaning parameter. This field must be a field of type integer. If you need to reference a non-integer field, make use of the calc parameter. Use a PostGreSQL CASE expression to map desires field values to integer values.

When using vector sources, PixETL will need access to the PostgreSQL database. Use the standard PostgreSQL environment variables to configure the connection.

Run with Docker

This is probably the easiest way to run PixETL locally since you won't need to install any of the required dependencies. The master branch of this repo is linked to Dockerhub image globalforestwatch/pixetl:latest. You can either pull from here, or build you own local image using the provided dockerfile.

docker build . -t globalforestwatch/pixetl

Make sure you map a local directory to container's /tmp directory if you want to monitor temporary files created. Make sure you set all required ENV vars (see above). Also make sure that your docker container has all the required AWS privileges.

docker run -it -v /tmp:/tmp -v $HOME/.aws:/root/.aws:ro -e AWS_PROFILE=gfw-dev globalforestwatch/pixetl [OPTIONS] NAME  # pragma: allowlist secret

RUN in AWS Batch

The terraform module in this repo will add a PixETL owned compute environment to AWS Batch. The compute environment will make use of EC2 instance which come with ephemeral storage (ie instance of the instance families r5d, r5ad, r5nd). A bootstrap script on the instance will mount one of the ephemeral storage devices as folder /tmp. A second ephemeral storage device (if available) will be mounted as swap space. Any other available ephemeral storage device of the instance will be ignored.

The swap space is only a safety net. AWS Batch kills a tasks without further notice if it uses more than the allocated memory. The swap space allows the batch job to use more memory than allocated. However, the job will become VERY slow. It is hence always the best strategy to keep the Memory/CPU ration high, in particular when working with float data types.

Terraform will also create a PixETL job definition, which reference the docker image (hosted on AWS ECR) and set required ENV variables for the container. In case you want to run jobs using a vector source, you will have to set the PostgreSQL ENV vars manually.

The job definition also maps the /tmp volume of the host ec2 instance to the /tmp folder of the docker container. Hence, any data written to /tmp inside the docker container will persist on the ec2 instance.

PixETL will create a subfolder in /tmp using the BatchJOB ID to name space the data of a given job.
PixETL will write all temporary for a given job into that subfolder. It will also clean up temporary data during runtime to avoid filling up the disc space. This strategy should avoid running out of disc space. However, in some scenarios you might still experience issues. For example if multiple jobs where killed by Batch (due to using too much memory) before PixETL was able to clean up. The EC2 instance will stay available of other scheduled jobs and if this happens multiple times, the discs fills up, and eventually you run out of space.

The AWS IAM role used for the docker container should have all the required permissions to run PixETL.

When creating a new PixETL job, it will be easiest to specify the job parameter using the JSON format, not the default space-separated format. The entrypoint of the docker image used is pixetl and you will only need to specify the CLI options and arguments, not the binary itself. When using the JSON format, you will have to escape the quotes inside= the LAYER_OPTION objects

["-d", "umd_tree_cover_density_2000", "-v", "v1.6", "{\"source_type\": \"raster\", \"pixel_meaning\": \"percent\", \"data_type\": \"uint8\", \"nbits\": 7, \"grid\": \"10/40000\", \"source_uri\": \"s3://gfw-files/2018_update/tcd_2000/tiles.geojson\", \"resampling\": \"average\"}"]

pixetl_prep

gfw_pixetl's People

Contributors

dagibbs22 avatar dmannarino avatar jterry64 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

fossabot

gfw_pixetl's Issues

Set up testing environment for API migration on AWS: Add Terraform files

Is your feature request related to a problem? Please describe.
We currently configure launch template, compute environment, job queue and job description for this pipeline manually. We will need terraform files to describe the desired state for easier deployment

Describe the solution you'd like
Write Terraform files and deployment scripts for required AWS resources

Make sure resampling works correctly

Is your feature request related to a problem? Please describe.
Make sure resampling from one grid size to the other works correctly. Add test cases where needed

Assure tool supports different grid sizes

Is your feature request related to a problem? Please describe.
We currently only support 10x10 degree grids. Make sure we can support any kind of grid size and any kind of pixel resolution

Increase test coverage

Is your feature request related to a problem? Please describe.
Unit tests currently only cover validation of yaml files. Expand test coverage to cover all stages.

Describe the solution you'd like
Expand test coverage to cover all stages plus e2e tests which run the entire pipeline using different input test data

log level gets not applied

Subject of the issue

When setting log level to debug, this is not applied to other modules. Only if I set the level directly in init.py all modules change level

Steps to reproduce

add --debug flag to script

Expected behaviour

logs all debug messages

Actual behaviour

only logs info and above

Allow to pass layer config via CLI argument

Is your feature request related to a problem? Please describe.
We currently need to first write to the layer config file in order to define the tile pipeline. It would be more flexible to allow users to pass config specifications via CLI arguments

Describe the solution you'd like
Allow to pass layer config via CLI arguments. This could be either a dict or a link to a yaml file containing the layer specifications.

Add retry logic in case translate fails with a NoneType error

Subject of the issue

Sometimes gdal_translate fails with a NoneType error. This is likely caused due to network issues. We need to add some sort of retry logic to assure the tile get's processed

Your environment

  • AWS Batch with Docker

Steps to reproduce

treecover_density_2010 -v v1.6 -s raster -f threshold
vCPU: 16
Memory 65536

Expected behaviour

A Tile that was found in the previous stage should also be processed in the translate stage

Actual behaviour

Fails with NoneType Error

Add support for projection transformations

Is your feature request related to a problem? Please describe.
If input data source is in a different projection than output tiles, we need to assure, that this is correctly handled in the transform stage

Describe the solution you'd like
Add gdalwrap stage

Link remote Aurora Serverless database to vector pipeline

Is your feature request related to a problem? Please describe.
In the current setup, we connect to a local postgres version. Instead this should be an Aurora Serverless Postgres instance o AWS.

Describe the solution you'd like
Setup aurora serverless instance and update connection credentials. Store PW in AWS secret store.

Create VRT at the end of pipeline

Is your feature request related to a problem? Please describe.
We will need VRTs for all datasets, to be able to resample.
At the end of the pipeline, we should collect all process tiles and write them to a VRT, which sits in the same folder as the tiles themselves.
Call VRT all.vrt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.