tilezen / joerd Goto Github PK

Joerd can be used to download, merge and generate tiles from digital elevation data

License: MIT License

Python 87.35% QML 12.65%

joerd's Introduction

Joerd, can be used to download, merge and generate tiles from digital elevation data. These tiles can then be used in a variety of ways; for map display in Walkabout, in Valhalla's Skadi for elevation influenced routing. In keeping with the Norse mythological theme used by Valhalla, the jotunn/goddess Jörð was chosen as she is the personification of the Earth.

How do I pronounce it?

Jörð is pronounced:

j = y as in english 'yellow'
ö = ö as in german 'Göttin'
r = r as in spanish 'pero'
ð = th as in english 'they'

Which is close to "y-earthe". Many thanks to @baldur for lending us his voice.

Building

Build status:

Joerd is a Python command line tool using setuptools. To install on a Debian or Ubuntu system, you need to install its dependencies:

sudo apt-get install python-gdal python-bs4 python-numpy gdal-bin python-setuptools python-shapely

(NOTE: not sure if this works: I installed GDAL-2.0.1 manually here, but I don't think it really needs it.)

You can then install it (recommended in a virtualenv) by running:

python setup.py install

Using

Joerd installs as a command line library, and there are currently three commands:

server starts up Joerd as a server listening for jobs on a queue. It is intended for use as part of a cluster to parallelise very large job runs.
enqueue-downloads reads a config file and outputs a job to the queue for each source file needed by an output file in any configured region listed in the regions of the configuration file. This is intended for filling the queue for server to get work out of, but can also be used for local testing along with the fake queue type.
enqueue-renders reads a config file and outputs a job to the queue for each output file in each region listed in the regions of the configuration file. This is intended for filling the queue for server to get work out of, but can also be used for local testing with the fake queue type.

There is also a script/generate.py program to generate a configuration with lots of little jobs all split up.

To run a command, type something like this:

joerd <command> --config config.example.yaml

Where <command> is one of the commands above (currently only process). The config has five sections:

regions is a map of named sections, each with a bbox section having top, left, bottom and right coordinates. These describe the bounding box of the region. Data from the sources will be downloaded to cover that region, and outputs within it will be generated.
outputs is a list of output plugins. Currently available:
- skadi creates output in SRTMHGT format suitable for use in Skadi.
- terrarium creates tiled output in GeoTIFF format.
sources is a list of source plugins. Currently available:
- etopo1 downloads data from ETOPO1, a 1 arc-minute global bathymetry and topology dataset.
- gmted downloads data from GMTED, a global topology dataset at 30 or 15 arc-seconds.
- srtm downloads data from SRTM, an almost-global 3 arc-second topology dataset.
logging has a single section, config, which gives the location of a Python logging config file.
cluster contains the queue configuration.
- queue is used for all job communication, and can be either sqs or fake:
  - type should be either sqs to use SQS for communicating jobs, or fake to run jobs immediately (i.e: not queue them at all).
  - queue_name (sqs only) the name of the SQS queue to use.
store is the store used to put output tiles after they have been rendered. The store should indicate a type and some extra configuration as sub-keys:
- type should be either s3 to store files in Amazon S3, or file to store them on the local file system.
- base_dir (file only) the filesystem path to use as a prefix for stored files.
- bucket_name (s3 only) the name of the bucket to store into.
- upload_config (s3 only) a dictionary of additional parameters to pass to the upload function.
source_store is the store to download source files to when processing a download job, and retrieve them from when processing a render job. Note that all the source files needed by the render jobs must be present in the source store before the render jobs are run. Configuration is the same as for store.

Caveats

When using SRTM source HGT files, it's possible to run into this bug. The work-around given in the issue (export GDAL_SKIP=JPEG) appears to work.

License

Joerd uses the MIT License.

Contributing

We welcome contributions to Joerd. If you would like to report an issue, or even better fix an existing one, please use the Joerd issue tracker on GitHub.

Tests

We highly encourage running and updating the tests to make sure no regressions have been made. This can be done by running:

python setup.py test

joerd's People

Contributors

Stargazers

Watchers

Forkers

phillrob sunhongyue4500 rmglennon wgruzek joycezw nextzen pinkfroot kafunk anagraph martinmikita cuulee kar120p chirag19 snowfox1939 usyba isikl annarrrr arnaureyes21 laprice leochencipher mhconradt skywalkers2012 davixdevelop jeveruberry guangbinl mattbeckley lp249839965 elmerehbi mohammadrezabk aakarsh molaiea naveenkratos predlybot velocat test-test111 ncking raizglobal exu1977 petabytekr stevenjwbrooks wesvance kiku98 nbushby tomleelive stephent satelytics

joerd's Issues

Chef recipe for deploy

So how do I pronounce this name?

Baldur to the rescue!

Batch downloading.

At the moment, each job will download the data it needs individually. This is great for simplicity, but at smaller job sizes it means duplicating a lot of downloads. Instead, it would be better to download (or mirror) the set of source files to S3 in one batch step, then process them from/to S3 in a second batch step. This would avoid stress on the source servers as well as providing higher bandwidth access and lower latency job processing from within EC2.

Support 16-bit PNGs (with 2-channels)

Team Tangram has requested raw 16-bit elevations, as a 2-channel PNG (two channels of 8 bits each).

/cc @bcamper @blair1618

Bake off demo s3 bucket of California tiles

Clone dev S3 bucket so we have a stable set of tiles isolated from dev for California demo.

NED has overlapping parts

NED contains many files which overlap each other, some even in the same year. For example, overlapping chunks near SF;

ned19_n38x00_w122x50_ca_sanfrancisco_topobathy_2010
ned19_n38x00_w122x50_ca_sanfrancisocoast_2010
ned19_n38x00_w122x50_ca_goldengate_2010
ned19_n38x00_w122x50_ca_alamedaco_2007
ned19_n38x00_w122x50_ca_contracostaco_2008

All of the above have the same bounding box, although the first one is the only one which covers the whole bounding box without large regions of NODATA. It would be good to have some way of detecting this, so that the other files aren't downloaded and don't have the opportunity to make stitching the data back up any harder than it has to be.

Great Lakes bathymetry

https://www.ngdc.noaa.gov/mgg/greatlakes/
http://www.glerl.noaa.gov/data/bathy/bathy.html

(ported from https://github.com/valhalla/skadi/issues/26)

Add NED data

NED is a dataset at various scales in different parts of the USA - some 1/9th arc second, 1/3rd or 1. We should pull in this data and apply it such that we get the best resolution in the areas which have it.

Add 10 m elevation data for Norway

The dataset is available here:
http://data.kartverket.no/download/content/digital-terrengmodell-10-m-utm-33

TIFF: Set AREA_OR_POINT=Point

Currently "Area":

$ gdalinfo 0.tif
Driver: GTiff/GeoTIFF
Files: 0.tif
Size is 512, 512
Coordinate System is:
PROJCS["WGS 84 / Pseudo-Mercator",
    GEOGCS["WGS 84",
        DATUM["WGS_1984",
            SPHEROID["WGS 84",6378137,298.257223563,
                AUTHORITY["EPSG","7030"]],
            AUTHORITY["EPSG","6326"]],
        PRIMEM["Greenwich",0],
        UNIT["degree",0.0174532925199433],
        AUTHORITY["EPSG","4326"]],
    PROJECTION["Mercator_1SP"],
    PARAMETER["central_meridian",0],
    PARAMETER["scale_factor",1],
    PARAMETER["false_easting",0],
    PARAMETER["false_northing",0],
    UNIT["metre",1,
        AUTHORITY["EPSG","9001"]],
    EXTENSION["PROJ4","+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext  +no_defs"],
    AUTHORITY["EPSG","3857"]]
Origin = (-20037508.339999999850988,20037508.339999999850988)
Pixel Size = (78271.516953124999418,-78271.516953124999418)
Metadata:
  AREA_OR_POINT=Area
Image Structure Metadata:
  COMPRESSION=LZW
  INTERLEAVE=BAND
Corner Coordinates:
Upper Left  (-20037508.340,20037508.340) (180d 0' 0.00"W, 85d 3' 4.06"N)
Lower Left  (-20037508.340,-20037508.340) (180d 0' 0.00"W, 85d 3' 4.06"S)
Upper Right (20037508.340,20037508.340) (180d 0' 0.00"E, 85d 3' 4.06"N)
Lower Right (20037508.340,-20037508.340) (180d 0' 0.00"E, 85d 3' 4.06"S)
Center      (   0.0000000,   0.0000000) (  0d 0' 0.01"E,  0d 0' 0.01"N)
Band 1 Block=256x256 Type=Int16, ColorInterp=Gray
  NoData Value=-32768

(This is purely a metadata thing; it doesn't require changes to the underlying data.)

Support PNG outputs.

In the terrarium output, in addition to supporting GeoTIFFs, we should also output 16-bit greyscale PNGs. These aren't widely supported, but should be readable with pngtoy. Note that PNGs only support unsigned channels, so we'll need to add an offset (32,768?) to handle the bathymetry depths.

Add NED 1m data, where available.

From the USGS NED site:

1-meter – This dataset was introduced in 2015 with limited coverage of the U.S.,
but will be expanding as new DEMs from 3DEP quality level 2 or better lidar data
are acquired. Horizontal coordinates are referenced to the Universal Transverse
Mercator projection.

Note the different projection.

Add memcached cache.

Larger machines will have spare RAM (looks like at least 4GB spare on the dev cluster machines). Using that for cache can only help.

Fix Lanczos artefacts

Using the Lanczos filter everywhere not only takes a lot of time, but also leads to artefacts at or near the edge of datasets at lower zooms. Instead, the filter function to use should be a function of the scale. To do this, we need to pass in the scale of the current tile (min scale, x & y?) to the function which returns the filter.

Cache SRTM & NED indexes

These can take a while to fetch, so it's worth caching them.

TIFF: Consider Float32 as the data type

GeoTIFFs are currently Int16, which means that the limit on vertical resolution is 1m. Higher resolution datasets (i.e. LiDAR) effectively lose resolution when converted to Int16. (NED is distributed as Float32, so this may already be occurring.)

Investigate whether we can stream zip files

Context is this comment from @rmarianski.

Possible issues: we sometimes need to get 2 files out of the zip - is there a callback per file that we can use? Also, (IIRC - this might be bad information) zip files have the directory block at the end of the file - does this mean we'd need to buffer it all in RAM? If so, how large is the largest file we're likely to download?

NED file fails to download

This N38.0 x W122.5 tile (ftp://rockyftp.cr.usgs.gov/vdelivery/Datasets/Staged/NED/19/IMG/ned19_n38x00_w122x50_ca_sanfrancisco_topobathy_2010.zip) fails to download. I don't think this is a code problem in Joerd, as it also failed to download with wget. Either a temporary server failure, or something weirder is going on. Needs investigation.

Use 1/3rd arcsecond NED as well.

Joerd pulls in the 1/9th arcsecond NED, but this doesn't have full coverage of the USA (yet?). The 1/3rd arcsecond has some issues around the coastline, but probably better than SRTM.

NOAA SLR viewer & data

https://coast.noaa.gov/slrdata/

https://coast.noaa.gov/slr/

Handle out of disk space gracefully

At the moment, an out-of-space error is fatal - an exception will be thrown, and nothing will be generated.

Instead, after downloading each file, the download code should ensure there's enough space to move the file into place by deleting files which aren't needed any more. The main work here is to figure out which files aren't needed any more (or, conversely, which files are needed), but we already know some of this as we build the download list before running.

Add 5 m data for Australia

Geoscience Australia provides a 5 m Digital Elevation Model (DEM) of Australia derived from LiDAR. It's licensed under CC-BY-4.0, dataset covers 245000 km² of "Australia's populated coastal zone; floodplain surveys within the Murray Darling Basin, and individual surveys of major and minor population centres"

Giant list of open elevation data

From @meetar:

For the record:

https://github.com/openterrain/openterrain/wiki/Terrain-Data

Migrated from: https://github.com/valhalla/skadi/issues/33.

Expand bounding box to include all regions.

At the moment, files are downloaded which intersect the region(s) of interest. Also, tiles are generated which intersect the region(s) of interest. However, given that a tile might overhang the edge of the region of interest, the files which are downloaded should encompass all tiles which might be generated.

Loop over the tiles first, and union their bounding boxes to provide the regions input for the download stage.

Determine bounding boxes from data

The current design assumes that bounding boxes for each raster tile can be determined without downloading the tile - usually from coordinates embedded in the tile name. This is mostly useful because there are many very large datasets (e.g: SRTM, NED) and it would be onerous to have to download all of them if you only wanted to render a small part of the world.

However, this assumption means that it's more difficult to write data sources which don't adhere to this pattern (e.g: Great Lakes).

A possible fix is to introduce a new phase to enumerate / index all the tiles within a source, which could involve downloading the source, or a metadata file, which would provide the bounds.

Fix SRTM coastal blockiness issues

SRTM has artifacts near the coast - blocks of data which are neither zero nor NODATA, but aren't valid data nevertheless. It seems like we need to bring in another source of data to clip against - an accurate coastline, preferably OSM or NE.

Missing prod tile at 8/41/98 in all formats

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>30A9CFAE12A5F5FE</RequestId>
<HostId>
XHlEHd5bfdzTKIg50O1jeS1skJ2hZlt5mhLlqus3ltuCse7dAEm39icGGl//eZoMJnEBKn9/vug=
</HostId>
</Error>

Skadi output should compress tiles

The original Skadi scripts compress the HGT files with gzip, and Joerd should do the same.

Glitch in Lake Superior

That's tile terrarium/7/33/44, but the effect is visible at zooms 6 & 7, but fixed at 8. This seems to suggest a problem with one of the data sources which is overlaid at 8+. Possibly, it's due to a piece of data being included which shouldn't be (bbox parsing error?), or possibly it's an error in the upstream data.

Needs investigation.

Don't shell out to gdalbuildvrt

We're using the Python bindings to GDAL, so it's possible to build the VRT directly in Python without needing to shell out to gdalbuildvrt. The question will be how much logic is in gdalbuildvrt that we would need to re-implement?

Generate tile indexes (covering shapes) per dataset

When generating derived datasets, I'd like to be able to restrict the areas being processed to those where data is available.

In GDAL-land, I would run gdaltindex against each of the source files to generate a Shapefile (or whatever) with features covering the extent of each source file. This collection of features can then be run through something (e.g. mercantile) to generate a list of tile candidates for processing (rather than running over the whole world).

As a work-around, I'm going to try to use one of the Natural Earth coastline shapes as the input to mercantile to restrict the input dataset.

Add Canadian DEM data

From @nvkelso:

http://geogratis.gc.ca/api/en/nrcan-rncan/ess-sst/C40ACFBA-C722-4BE1-862E-146B80BE738E.html

Canadian Digital Elevation Model Mosaic (CDEM)
The Canadian Digital Elevation Model (CDEM) is part of Natural Resources Canada's altimetry system designed to better meet the users' needs for elevation data and products. The CDEM stems from the existing Canadian Digital Elevation Data (CDED). In these data, elevations can be either ground or reflective surface elevations. A CDEM mosaic can be obtained for a pre-defined or user-defined extent. The coverage and resolution of a mosaic varies according to latitude and to the extent of the requested area. Derived products such as slope, shaded relief and colour shaded relief maps can also be generated on demand.

Additional Information
Date published November 6, 2012
Series Canadian Digital Elevation Model Mosaic (CDEM) NaN
Issue 3.0
Purpose The CDEM plays the same role as contours and relief shading on conventional paper maps. The CDEM serves as key primary data in a range of applications critical to achieving sustainable development. These applications include environmental and ecological impact assessments, water flow and water quality analysis, climate change studies, forest regeneration planning and wildlife habitats. In addition, the CDEM can be used in the generation of three-dimensional graphics displaying terrain slope, profiles and lines of sight. Non-graphic applications include geoid calculations, terrain modelling, flood simulations and telecommunication studies.... show more
Author Government of Canada, Natural Resources Canada, Earth Sciences Sector, Mapping Information Branch, GeoAccess Division
Language This product is available in English and French
Product type Elevation 5

Migrated from: https://github.com/valhalla/skadi/issues/41.

Group similar jobs together.

At the moment 1 job = 1 message in the queue. This leads to situations where jobs can download a lot of data just to do a single output tile, and then download a completely different set of data for the next job. This makes caching less effective and reduces throughput.

When render jobs are added to the queue, the enqueueing process knows which source files they depend upon, so jobs can be grouped together by the set of dependencies they have and batched into arrays of jobs in each message. SQS has a limit on the size of each message, which will curtail the benefits of grouping everything together, but also allow a certain degree of granularity.

Make downloads more robust

An issue that seems to keep happening with GMTED data is that the server will drop the connection mid-file and subsequent requests will get an "empty response" for some period of time. This appears to be some sort of TCP-level rate-limiting, but is currently undetectable to us until we try to build the VRT.

What should happen:

Use a more reliable download library (e.g: urlgrabber?)
Integrate retries with back-off
Check download integrity before calling it done (e.g: run gdalinfo on it)

TIFF: Consider DEFLATE compression

@lossyrob did some benchmarking of the different compression options available to GDAL here: http://openterrain.tumblr.com/post/117553678231/gdal-compression-options-against-ned-data

For floating point data, the conclusion was that DEFLATE (with the floating point predictor) produces the smallest files at only a small time premium.

Give each data source a valid zoom range

Data sources such as NED are far too detailed to use at low zooms. Further, having a "whole USA" tile would require downloading all of NED, which is a non-trivial amount of data, most of which would be averaged away.

Each source should have a "valid zoom range" (or scale might be more appropriate), against which requests are intersected before downloading.

Move hard-coded URLs to config file

For example, NED_FTP_SERVER and NED_BASE_PATH in the NED source.

This allows them to be reconfigured if stuff changes (or we discover a mirror!), and also should allow us to start adding tests for these sorts of things.

0.5 m res elevation data for the canton of Zurich, Switzerland

Open access data available here: http://maps.zh.ch/download/hoehen/2014/dtm/tif/

Variable max zooms

Each job sent to Joerd has spatial extent and zoom extent already.

We can send one job to the queue for the low zooms, then for each zoom 9 tile that wasn't empty needs to be specified and enqueued.

Biased on Natural Earth land + minor islands.

Low zoom elevation tiles in high latitude is funky

The scaling of heights needs to be refined.

Because the tile spans so much north-south, some cos(lat) per cell might be useful?

High-Resolution California Bathymetry - Multibeam + LIDAR

http://www.cencoos.org/data/parameters/bathymetry
ftp://coast.noaa.gov/pub/DigitalCoast/lidar1_z/geoid12a/data/2612/

This project merged recently collected topographic, bathymetric, and acoustic elevation data along the entire California coastline from approximately the 10 meter elevation contour out to California's 3 mile state water's boundary.

Topographic lidar data was from the 2009-2011 CA Coastal Conservancy Lidar Project (Oct 2009 - Aug 2011), coverage extends landward 500 m from the shoreline, along the entire California coastline.
Bathymetric lidar data was 2009-2010 U.S. Army Corps of Engineers JALBTCX lidar collected for the California Coastal Mapping Project.
The multibeam acoustic data were high resolution CSMP survey data combined NGDC data.

See more at: http://www.cencoos.org/data/parameters/bathymetry#sthash.Avv6SHVy.dpuf

Clusterise

To run this across multiple processes we will need:

A long-running server process,
Which listens for jobs on a queue (e.g: SQS),
Which stores tiles on a remote filesystem (e.g: S3).

Upon receiving a job from the queue (which should be a large section of tiles, to amortize download costs), the server needs to run the download, buildvrt and generate jobs with the region configured as per the incoming request.

Open lidar-based DSM and DTM dataset for ~72% of UK

I'm currently using Valhalla as a fallback for working out the ground height below buildings in Polygon City (another Mapzen project) and having more accurate height data in urban environments would be very, very useful for us (as it would with my own project, ViziCities) – SRTM just doesn't cut it for the accuracy we need, both horizontally and vertically.

Let me know if this isn't helpful here – I saw that you wanted suggestions for open elevation datasets to use and assumed this was the best place. I have plenty more to suggest if it's helpful.

The Environment Agency in the UK recently released the first tranche of their LIDAR elevation data for around 72% of the UK – including DSM and DTM ASCII files at horizontal resolutions of 25cm, 50cm, 1m, and 2m. This data has been released under an Open Government License, which is compatible with the CC-BY 4.0 license.

There's a lot of data and currently the only way to download it is manually through a portal. The data is also more verbose than it needs to be due to unnecessary decimal accuracy and the EA are looking to reduce this overhead (and the file-size) in the near future.

There are 2 potential approaches to automate the download of the data:

The download links are easy to automate (based on the Ordnance Survey National Grid) and so it would be possible to batch them all up and slowly work through them
I have contacts at the Environment Agency who may be able to help get access a bulk download or some other bulk export

TIFs should be 512x512

Instead of following the standard, well-understood and expected scheme used by almost all tiled mercator sources in existence, our TIFs should:

Be 512x512.
Have internal 256x256 blocks.
~~Be named as if they were "even" tiles at zoom N rather than retina tiles at N-1.~~

worldwide lake bathymetry?

http://www.flake.igb-berlin.de/ep-data.shtml

A binary file GlobalLakeDepth.dat is a global gridded lake-depth data file with the spatial resolution (pixel size) of 30 arcsec (ca. 1 km at the equator). A raster map of ECOCLIMAP2 data set is used to locate lakes on the globe, i.e. to discriminate between the lake pixels and the non-lake pixels. The lake depth is given in decimetres.

(Ported from https://github.com/valhalla/skadi/issues/27.)

Coastal resampling

We have the following problem near the coast:

In this example, the red data is something like GMTED or SRTM for which we have no bathymetry, and the blue data is something of lower resolution but with bathymetry, such as ETOPO1. The problem occurs because the NODATA (represented by the empty circle) can't interpolate into the sea - and we wouldn't want it to, as we have no data there. But the lower resolution dataset fills in with something that will probably be a shallower gradient, given the wider spacing between data points. This leads to a "halo" around the land, and it's particularly bad near cliffs. It gets worse still when there's an offset mis-match between the data sets and the coastline is at different positions.

There are many ways to fix this. Some random thoughts:

Interpolate bathymetry and height data separately, and composite them. For example; first interpolate all land stuff with NODATA set to 0, composite that together. Then interpolate all the water stuff with NODATA set to zero, composite that together. Then use a water/land mask to chose between the two. This would require a high resolution coastline shapefile.
Backfill at the native resolution. At the moment, we mask each data source with NODATA at their native resolution, then warp / interpolate them to the target projection and composite them. The issue arises because we are interpolating a source with NODATA, so we could start by warping ETOPO1 to GMTED resolution, compositing there, then warping to SRTM resolution and compositing there, etc... The downside is that all the re-warping going on is likely to blur / produce artefacts in the earlier data sources.
Use a cleverer algorithm for interpolation. See "Improving quality of public domain digital elevation
models through data fusion" by Karkee, Steward, Abd Aziz for one way of doing this. Another would be to interpolate over a locally-reconstructed TIN of the "best" data points which aren't NODATA. Either way, this approach would likely produce the best results, but be the most complex to implement.

There will be other ways that I've missed. Thoughts?

Logger path should be relative to config

The logger path at the moment is relative to $CWD, but it should be relative to the location of the config file, as this means fewer surprises when running Joerd from arbitrary locations.

Available data sets for U.S. states of various resolutions

From @glw:

A list was put together on Wikipedia here:
https://en.wikipedia.org/wiki/National_Lidar_Dataset_%28United_States%29

If you do, do an update can you start with Illinois? :)
https://clearinghouse.isgs.illinois.edu/data/elevation/illinois-height-modernization-ilhmp-lidar-data

Compress source files at rest

Files at rest in the store can/should be compressed. At the moment, the SRTM and NED files are large and not compressed, which means they can be slow to download. Another option, and one which would be automatically supported by GDAL, would be converting these files from a non-compressed format such as HGT or HFA/IMG to one which supports built-in compression such as GeoTIFF.

Add normal maps to support hillshading

From the Tangram Team, request adding normal map files for hill shading.This would be more optimized to do on the server than re-computing over and over in the client.

32-bit PNG, 3 channels of normal map info, 1 channel of 8-bit quantized elevation (not for showing actual height, but a rough indicator of where the pixel falls relative to sea level, for styling).