stuckyb / gcdl Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 2.0 2.1 MB

Python 99.54% Dockerfile 0.33% Shell 0.13%

gcdl's People

Contributors

Stargazers

Watchers

Forkers

noamillsusda-ars haitaohuang-usda

gcdl's Issues

non-temporal datasets

We need to implement full support for non-temporal spatial datasets (e.g., DEMs).

TileSet performance optimization

When testing with SRTM DEM tiles, requests that require ~14 or fewer tiles return quickly, but requests that require much more than that just churn at the mosaic-building stage. The process is does not appear to be either CPU- nr memory-bound, so it's not clear to me what is going on. Regardless, there is clearly optimization work needed.

locally stored tiles

We need to support datasets that are stored locally as tiles (e.g., GTOPO30).

improvement to grid size metadata

Grid size metadata can't assume the units are meters. Alternative units need to be supported and the unit needs to be included with the metadata.

subset geometry uploads

It would be useful if users could upload point-level datasets for point-level processing on the server and then download the results.

streaming from OPeNDAP?

I've been taking a look into what options there are for streaming MODIS, particularly focusing on methods that can be useful for other products. OPeNDAP seems to be like a potential option for NASA earth data. I've been looking at LP DAAC hyrax, but there seem to be other potential products/servers of interest for our group. There is a python package pydap for accessing these kinds of servers. One general maybe issue is the requirement to authenticate an EarthData login (would need to review the user agreement, what account would to the requests?), but pydap has support for the authentication process if it is OK.

Spatial and temporal subsets can be made, and metadata queried, but formats might vary a bit depending on what server you are talking to.

Warning though: I've only dabbled in reading about this option and have not yet successfully executed my test of extracting a few timesteps of NDVI covering the Jornada. I can use basic curl commands to handle authentication and request the intended spatial extent from the intended NDVI product, but I'm not sure on the temporal part (I am pulling a few timesteps but I picked the first 4) and I haven't successfully read in the result. I was trying in R since that's what I know, but I can't tell if its an R netcd4 issue or not that I can view the layers that I downloaded, but the geographic info is missing. I tried the pydap package and I can make some example code download data from some other server, but not for my MODIS/hyrax test.

At this point, I'd need other people's input on if this is something worth continuing to explore as we try to include some streaming of popular datasets that are too big to store on scinet.

I found this LP DAAC webinar as a helpful starting point.

Add additional dataset support for pilot project

Based on my notes from the summer, we were thinking the pilot project would use datasets

climate: PRISM or daymet, which we have covered
elevation: USGS DEM, can assume it will be stored locally
NDVI: to be streamed. Did we decide on product and source for streaming?

output format options

The GCDL needs to support specification of alternative output formats.

point data formatting

We need consistent input (x,y) point formatting for endpoints.

different resampling methods

Different use cases will need different resampling methods used in reprojection. We are calling reproject from the rioxarray package, which looks to default to using the nearest neighbor method. That's the fastest, but won't be advisable for most use cases.

It would be nice to have an option for the user to choose from available resampling methods. To start we could do something like default to nearest neighbor for categorical variables (when we support them) and bilinear for continuous variables?

Note: rioxarray uses rasterio.warp.reproject which in turn uses GDAL. Available method will depend on the GDAL version available.
rasterio supported methods
GDAL methods doc

daymet dataset currently only supporting annual data

Only the annual files are supported at the moment.

Unlike PRISM, Daymet combines the 12 months into the same file. The file can be identified by the string 'mon' instead of 'ann', along with the year. getMonthlySubset could test if the file has 1 or 12 layers and determine how to read the monthly data from there (1 = current way of looping over monthly files, 12=looping over annual files and processing monthly layers).

How should the output file be structured? Always per month, or match the input data structure? This would also depend on the eventual output formats we support - netcdf would be more likely to combine months/years into one file.

output file structure and format

Right now, we output the results as one file per variable per time increment. That file is a GeoTIFF if its raster output or a csv if point extraction output. It would be good if we could have the option to combine variables and/or time increments in the output. Particularly so for point output, but also a handy option for when we support NetCDF output for rasters as well.

Issue #31 is closely related: if we end up processing stacks of rasters, that makes combining the results more straightforward.

stacking instead of explicitly looping over layers in the same dataset?

In R, I'm used to stacking multiple raster files of the same dataset and then many operations on the stack apply to each layer, e.g. reprojecting or clipping. Currently, extractData is being called in a loop iterating over variables, years, and months (if applicable). Is the looping over individual files the most efficient method? If not, are alternative faster methods amenable to the move to HPC?

This question came up in #30 implementing monthly data support when the input files are 12-band rasters per year. Do we still loop over months by reading different bands from the same file? Or is there the option to use a kind of stacking? Or something else?

building bounding box geometry should include crs data

Currently, the user specifies the bounding box coordinates. I assume the intent is for this to be lon/lat (which should be stated somewhere if that is indeed the case). The clip function assumes the geometry has the same crs as the raster if one isn't provided, but that won't always be the case for us.

metadata delivery

We need to implement metadata delivery with every data download, probably as a text file in some format (JSON?) that includes (at least):

Which datasets and variables are included.
Temporal range and granularity.
Spatial boundaries.
Grid size (if relevant).
Reprojection method(s) (if relevant).
Complete API call URL.

sub-monthly but non-daily datasets

Satellite missions data don't always fit into the daily or monthly temporal increment categories. The MODIS NDVI product that is currently the choice for the pilot is 8-day, so I'm including this issue in the pilot milestone.

Could the daily data implementation (#13) be general enough to recognize individual dates but not require every date within the requested date range?

I think this is also related to #31: instead of looping over every time increment in the requested data range (as is done in current monthly implementation), we can apply geospatial operations on a DataArray or Dataset that has filtered/concatenated time dimension matching the relevant date range.

dataset build file management

We need to implement a system for managing and cleaning up dataset build artifacts on the server. This will eventually intersect with #5.

dataset checkout

The library implementation needs to be modified so that datasets are "checked out" per request, meaning that each request has its own dataset instance(s). This will be important for efficient parallel processing.

configuration system

The GeoCDL needs a configuration system to make it easy to customize things like local data store location, scratch/build space location, cache location, etc.

dataset variable representation

How should we handle dataset variable/layer names for datasets that have different sets of variables depending on the temporal resolution (DAYMET, e.g.)? As currently implemented, all possible variables are reported for a dataset. An alternative would be to store variable sets per temporal resolution, but this would potentially make usage/implementation more complicated for relatively little gain.

automated tests

As this projects expands in complexity and scope, we really need to implement an automated testing framework for QA/QC as new code is developed.

background dataset builds

We need to implement a system for building datasets in the background (i.e., the API call returns immediately) and notifying the user (via email, presumably) when the dataset is ready. This could be used for every dataset download or we could implement rules for deciding when a dataset build is likely to return quickly.

vector datasets

Some datasets will be vector instead of raster, e.g. county-level data. Temporal characteristics will be the same as raster data, but not all spatial characteristics, e.g. grid size. So the catalog should be flexible enough to support that.

We will need to handle processing both directions for users if they want to merge the two types: rasterizing polygons or summarizing raster cells to polygons.

Request logging

We will need to implement a request logging system.

Add subset geometries to metadata output

We need to include subset geometries as part of the metadata output, but considering that we want to eventually support uploading large/complex subset geometries, shoving them into the metadata JSON is probably not a good solution.

porting dataset classes to new core architecture

Currently, only PRISM is done.

reproject clipping polygons to dataset CRS

Right now, datasets are always transformed to the target CRS prior to clipping, but for many datasets it will be more efficient to transform the polygon instead.

factor in-memory caching out of Daymet dataset

In-memory data caching is currently implemented directly in the Daymet dataset class. This should be factored out of the class into a separate component so it is easily re-usable elsewhere and also easy to test.

epsg codes assumed to be 4 digits

Currently, getSubset() throws an error if the provided crs epsg code isn't length 4, but really epsg codes can be 4 or 5 digits. For example, all Jornada spatial files are UTM 13N WGS84 - or epsg code 32613. I confirmed that CRS() from pyproj accepts this code.

>>> from pyproj import CRS
>>> CRS(32613)
<Projected CRS: EPSG:32613>
Name: WGS 84 / UTM zone 13N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Between 108°W and 102°W, northern hemisphere between equator and 84°N, onshore and offshore. Canada - Northwest Territories (NWT); Nunavut; Saskatchewan. Mexico. United States (USA).
- bounds: (-108.0, 0.0, -102.0, 84.0)
Coordinate Operation:
- name: UTM zone 13N
- method: Transverse Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

new dataset requests

Based on GeoCDL monthly meeting notes from June 2, 2021:

could have request form, similar to software installation request form, for taking a user’s requested data product and moving it to a derived product storage space for others to access.

daily dataset processing

Need to implement framework for processing daily-scale data requests.

abstractions and implementation for remote dataset support

Currently, dataset abstractions (e.g., dataset abstract base classes) are geared toward local datasets. A similar framework needs to be designed and implemented for remote datasets and unified with the local dataset abstractions. Thus, this issue is both a major software engineering task and implementation task.

arbitrary clipping regions

We need to support specification of arbitrary clipping polygons while retaining the ease-of-use of bounding box corner specification for the simple case.

catalog describing file structure/dimensions

Can we have something in the catalog to describe how the dataset stores its variables and temporal increments? For example, we are currently expecting PRISM to have one file per variable per year or per month. DaymetV4 also splits variables and years into different files, but stores monthly data in a 12-band file. Other datasets will store multiple variables in a file. I think having some indication of this in the catalog would be better than determining it the subsetting/extracting functions since the method for determining the structure might be different among the datasets. Then we can have different subset/extract approaches per general structure type as needed.

built-in polygons for subsetting

It would be useful to have commonly used boundary polygons (e.g., states, counties) built into the system so that they could be requested directly (e.g., by name) without explicitly providing the clipping polygons.

new dataset ingestion process

It would be nice to have either a (partially) automated new dataset creation process or at least some checks on a dataset catalog entry. One, it could save some time/work in expanding the library (plus it would be nice to steam metadata along with data when we access remote data). But also reduce errors: I just ran into the issue of testing #14 during which I requested PRISM to be 8000 m resolution. I'm used to PRISM being described as 4 km, I double-checked it was entered into the catalog as 4000 m, but the .bil in my testing data is actually in 0.04166667 degrees - so that explains why I got back one cell in my test. I may or may not have the intended version of PRISM in my local data for testing (we will likely standardize this in #29), but similar errors could happen down the line as we expand the library.

temporal summaries

We should support temporal summaries like calculating growing season PPT sums from monthly/daily PPT layers. It could reduce the amount of data returned to the user if their analyses use seasonal data, plus it could be more appropriate to do these temporal summaries before any reprojections, especially if their is significant spatial up-scaling.

add SRTM DEM dataset

The title says it all.

Mixed date grain requests

What if the user wants one dataset that is annual only (e.g. when we add landcover) and one that is some other date grain (e.g. the MODIS NDVI remote data I'm trying to add right now is only sub-monthly)? Only one format is currently accepted for start and end dates. Do we require separate requests or have default ways to handle like for non-temporal data (#17)?

metadata standards

There are metadata standards for geospatial data like ISO 19115-2. At a minimum, we could peruse this for ideas about what goes in the catalog and what metadata need to be passed along to the user.

eventual GUI?

Copying recent SCINet newsletter about Open OnDemand:

Open OnDemand for Virtual Desktops and Web Apps on Ceres and Atlas

Open OnDemand is now available. This software provides web-browser access to high-performance computing resources on Ceres and Atlas, including virtual desktop environments and scientific web applications. Available apps include:

A collection of Open OnDemand core apps, including a File Manager for browser-based file system access, a lightweight File Editor, a Shell App for in-browser command-line access, and a Job Composer App for creation and management of Slurm batch jobs
Virtual Desktop Environment (CentOS Linux)
JupyterLab and RStudio Server (with more web apps coming soon)

Open OnDemand allows user App Development, enabling SCINet users to develop and deploy private custom web apps on Ceres or Atlas from their home directory. Custom Interactive Apps launch apps Slurm jobs on compute nodes, run with the SCINet user’s privileges, and allow access to Ceres/Atlas parallel file systems (see the Open OnDemand Interactive Apps development documentation for more details). Open OnDemand app development is currently opt-in; to enable for your account, please contact the SCINet VRSC ([email protected]), specify the system (Ceres or Atlas), and provide a brief justification.

To get started with Open OnDemand on Atlas, see the Atlas Documentation. Preliminary documentation for accessing the Ceres Open OnDemand is available in the SCINet RStudio Server guide, with additional documentation updates to follow.

R package

We need to scope and implement a high-level R package interface to the REST API.

better implementation for request parameter passing

Request parameters are currently being passed around the framework as direct function parameters, but that approach is brittle and obscures responsibility for parameter validation. A more robust solution is needed.

Python package

We need to scope and implement a high-level Python package interface to the REST API.

point-level data extraction

We need to implement the ability for users to provide geographic points (e.g., lat/long coordinates) and extract either values for dataset grid cells containing those points or interpolated dataset values at those points. For the initial implementation, points can be provided directly as part of the GET parameters.

caching

When a user requests a custom dataset from the server, the dataset should be cached for some period of time. This will require:

cache expiration and cleanup
system for linking API requests to unique dataset IDs