Code Monkey home page Code Monkey logo

dea-cogger's Introduction

NetCDF to COG Conversion Tool

Convert Open Datacube NetCDF files to Cloud Optimised GeoTIFFs.

It runs prior to uploading from the NCI file systems to AWS S3.

Use dea-cogger convert to COG:

Installation

conda create -n cogger datacube=1.8.3 python=3.8 gdal boto3 boltons structlog mpi4py colorama requests chardet
conda activate cogger
pip install requests-aws4auth
pip install -e .

Usage

Example commands for calculating which datasets require conversion to COG by downloading a subset of an S3 Inventory and comparing it with an ODC Index.

dea-cogger save-s3-inventory -p ls7_fc_albers -o tmp/ 
dea-cogger generate-work-list --product-name ls7_fc_albers -o tmp

Running Conversion in parallel on Gadi

Example PBS submission script to run in parallel on Gadi.

#!/bin/bash
#PBS -l wd,walltime=5:00:00,mem=190GB,ncpus=48,jobfs=1GB
#PBS -P v10
#PBS -q normal
#PBS -l storage=gdata/v10+gdata/fk4+gdata/rs0+gdata/if87
#PBS -W umask=33
#PBS -N cog_ls8_nbar_albers 


module load dea
module load openmpi/3.1.4

mpirun --tag-output dea-cogger mpi-convert --product-name "{{params.product}}" --output-dir "{{work_dir}}/out/" ls8_nbar_albers_file_list.txt
                

Example configuration file

    products:
      product_name:
       prefix: WOfS/WOFLs/v2.1.5/combined
       name_template: x_{x}/y_{y}/{time:%Y}/{time:%m}/{time:%d}/LS_WATER_3577_{x}_{y}_{time:%Y%m%d%H%M%S%f}
       predictor: 2
       no_overviews: ["source", "observed"]
       default_resampling: average
       white_list: None
       black_list: None
  • product_name: A unique user defined string (required)
  • prefix: Define the cogs folder structure and name (required)
  • name_template: Define how to decipher the input file names (required)
  • default_resampling: Define the resampling method of pyramid view (default: average)
  • predictor: Define the predictor in COG convert (default: 2)
  • no_overviews: A list of keywords of bands which don't require resampling (optional)
  • white_list: A list of keywords of bands to be converted (optional)
  • black_list: A list of keywords of bands excluded in cog convert (optional)

Note: no_overviews contains the key words of the band names which one doesn't want to generate overviews. This element cannot be used with other products as this 'cause it will match as source'. For most products, this element is not needed. So far, only fractional cover percentile use this.

Command: save-s3-inventory

Scan through S3 bucket for the specified product and fetch the file path of the uploaded files. Save those file into a pickle file for further processing. Uses a configuration file to define the file naming schema.

Command: generate-work-list

Compares ODC URI's against an S3 bucket and writes the list of datasets for COG conversion into a file.

Uses a configuration file to define the file naming schema.

Command: mpi-convert

Bulk COG Convert netcdf files to COG format using MPI tool. Iterate over the file list and assign MPI worker for processing. Split the input file by the number of workers, each MPI worker completes every nth task. Also, detect and fail early if not using full resources in an MPI job.

Reads the file naming schema from the configuration file.

Command: verify

Verify converted GeoTIFF files are (Geo)TIFF with cloud optimized compatible structure. Mandatory Requirement: validate_cloud_optimized_geotiff.py gdal file.

COG Creation Settings (What to set for predictor and resampling)

Predictor

(1=default, 2=Horizontal differencing, 3 =floating point prediction)

  • Horizontal differencing

    particularly useful for 16-bit data when the high-order and low-order bytes are changing at different frequencies.

    Predictor=2 option should improve compression on any files greater than 8 bits/resel.

  • Floating Point Prediction The floating point predictor PREDICTOR=3 results in significantly better compression ratios for floating point data.

    There doesn't seem to be a performance penalty either for writing data with the floating point predictor, so it's a pretty safe bet for any Float32 data.

Raster Resampling

default_resampling (average, nearest, mode)

  • nearest: Nearest neighbor has a tendency to leave artifacts such as stair-stepping and periodic striping in the data which may not be apparent when viewing the elevation data but might affect derivative products. They are not suitable for continuous data

  • average: average computes the average of all non-NODATA contributing pixels

  • mode: selects the value which appears most often of all the sampled points

dea-cogger's People

Contributors

ashoka1234 avatar emmaai avatar harshurampur avatar kieranricardo avatar kirill888 avatar omad avatar santoshamohan avatar uchchwhash avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dea-cogger's Issues

GDAL NetCDF Subdatasets

I'm Andrew from TERN. This is a cool tool. We have NetCDFs and Geotiffs that we want to convert to COG.

I've got a question regarding NetCDF Subdatasets. From my understanding, GDAL thinks that a NetCDF file has Subdatasets if there is more than one data variable, like temperature and rainfall in one file. If a file just has temperature, then it will not be a Subdataset, there will just be Bands.

Based on this line in the conversion code, does this mean that all of your NetCDF files have Subdatasets?

Some of our files just have one data variable, so GDAL doesn't detect them as Subdatasets. I tried converting them with this tool but it didn't do anything. If we end up using the tool I'm happy to do a PR, but I just wanted to get in touch first to check that I'm not misunderstanding Subdatsets.

See below for ncdump and gdalinfo from one of our NetCDF files:

ncdump Output:

netcdf FPARchl_EVI.A2014091.aust.v01.5600m.terra {
dimensions:
	time = UNLIMITED ; // (1 currently)
	latitude = 670 ;
	longitude = 813 ;
variables:
	double time(time) ;
		time:long_name = "time" ;
		time:standard_name = "time" ;
		time:units = "days since 1800-01-01 00:00:00.0" ;
		time:calendar = "gregorian" ;
		time:axis = "T" ;
	double latitude(latitude) ;
		latitude:long_name = "latitude" ;
		latitude:standard_name = "latitude" ;
		latitude:axis = "Y" ;
		latitude:units = "degrees_north" ;
	double longitude(longitude) ;
		longitude:long_name = "longitude" ;
		longitude:standard_name = "longitude" ;
		longitude:axis = "X" ;
		longitude:units = "degrees_east" ;
	float FPARchl(time, latitude, longitude) ;
		FPARchl:_FillValue = -3000.f ;
		FPARchl:long_name = "Ecosystem fraction of photosynthetic radiation absorbed by chlorophyll (fPARchl)" ;
		FPARchl:units = "unitless" ;
		FPARchl:grid_mapping = "crs" ;
	int crs ;
		crs:grid_mapping_name = "latitude_longitude" ;
		crs:longitude_of_prime_meridian = 0. ;
		crs:semi_major_axis = 6378137. ;
		crs:inverse_flattening = 298.257223563 ;
}

gdalinfo output:

Driver: netCDF/Network Common Data Format
Files: /input_rasters/chla_fraction/FPARchl_EVI.A2014091.aust.v01.5600m.terra.nc
Size is 813, 670
Coordinate System is:
GEOGCRS["unknown",
    DATUM["unnamed",
        ELLIPSOID["Spheroid",6378137,298.257223563,
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433,
            ID["EPSG",9122]]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433,
                ID["EPSG",9122]]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433,
                ID["EPSG",9122]]]]
Data axis to CRS axis mapping: 2,1
Origin = (112.924999999999997,-10.074999999999999)
Pixel Size = (0.050000000000000,-0.050000000000000)
Metadata:
  crs#grid_mapping_name=latitude_longitude
  crs#inverse_flattening=298.257223563
  crs#longitude_of_prime_meridian=0
  crs#semi_major_axis=6378137
  time#axis=T
  time#calendar=gregorian
  time#long_name=time
  time#standard_name=time
  time#units=days since 1800-01-01 00:00:00.0
Corner Coordinates:
Upper Left  ( 112.9250000, -10.0750000) (112d55'30.00"E, 10d 4'30.00"S)
Lower Left  ( 112.9250000, -43.5750000) (112d55'30.00"E, 43d34'30.00"S)
Upper Right ( 153.5750000, -10.0750000) (153d34'30.00"E, 10d 4'30.00"S)
Lower Right ( 153.5750000, -43.5750000) (153d34'30.00"E, 43d34'30.00"S)
Center      ( 133.2500000, -26.8250000) (133d15' 0.00"E, 26d49'30.00"S)
Band 1 Block=813x670 Type=Float32, ColorInterp=Undefined
  NoData Value=-3000
  Unit Type: unitless
  Metadata:
    grid_mapping=crs
    long_name=Ecosystem fraction of photosynthetic radiation absorbed by chlorophyll (fPARchl)
    NETCDF_DIM_time=78252
    NETCDF_VARNAME=FPARchl
    units=unitless
    _FillValue=-3000

Work list generation is slower

Work list generation stage of Cog conversion is slower than expected. To compare s3 list and that of datacube for one month of wofs_albers product data took 7+ hours of walltime in a qsub job.

Grab bag of outstanding issues

Work Generation

  • Incremental. Need a way to compare what's on S3 to what's not.
    • Our current unit of work for COG conversion is a NetCDF file. Either stacked or unstacked. It will be awkward to compare Stacked NetCDF files to data existing on S3 since they represent different one dataset vs many..

COG Converter

  • Configurable COG parameters when generating overviews.
    • Resampling method for overlays for different products
    • Number of overview levels
    • Compression/chunk size (maybe, deflate/512 is good, but...)
  • Is it faster/easier/more configurable to use rio cogeo than raw GDAL.
  • Review/test the parameters used

Uploader

  • Specify bucket instead of having COG-Conversion define it.
  • Give uploader an option to move files to a COMPLETE directory instead of deleting them. Will let us test upload to a dev bucket, and then run again against the prod bucket.
  • SPEED How fast can we upload in a single thread, do we need parallel upload processes?
  • MAYBE Ability to watch multiple directories?

usage confusing?

Hi.
I'm probably being slow, but I don't understand where to launch the converter from - do I clone it into a directory? I have module use dea etc...

  • may need to update readme

python3: can't open file 'converter/cog_conv_app.py': [Errno 2] No such file or directory

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.