gcdl's People
gcdl's Issues
non-temporal datasets
We need to implement full support for non-temporal spatial datasets (e.g., DEMs).
TileSet performance optimization
When testing with SRTM DEM tiles, requests that require ~14 or fewer tiles return quickly, but requests that require much more than that just churn at the mosaic-building stage. The process is does not appear to be either CPU- nr memory-bound, so it's not clear to me what is going on. Regardless, there is clearly optimization work needed.
locally stored tiles
We need to support datasets that are stored locally as tiles (e.g., GTOPO30).
improvement to grid size metadata
Grid size metadata can't assume the units are meters. Alternative units need to be supported and the unit needs to be included with the metadata.
subset geometry uploads
It would be useful if users could upload point-level datasets for point-level processing on the server and then download the results.
streaming from OPeNDAP?
I've been taking a look into what options there are for streaming MODIS, particularly focusing on methods that can be useful for other products. OPeNDAP seems to be like a potential option for NASA earth data. I've been looking at LP DAAC hyrax, but there seem to be other potential products/servers of interest for our group. There is a python package pydap
for accessing these kinds of servers. One general maybe issue is the requirement to authenticate an EarthData login (would need to review the user agreement, what account would to the requests?), but pydap has support for the authentication process if it is OK.
Spatial and temporal subsets can be made, and metadata queried, but formats might vary a bit depending on what server you are talking to.
Warning though: I've only dabbled in reading about this option and have not yet successfully executed my test of extracting a few timesteps of NDVI covering the Jornada. I can use basic curl commands to handle authentication and request the intended spatial extent from the intended NDVI product, but I'm not sure on the temporal part (I am pulling a few timesteps but I picked the first 4) and I haven't successfully read in the result. I was trying in R since that's what I know, but I can't tell if its an R netcd4
issue or not that I can view the layers that I downloaded, but the geographic info is missing. I tried the pydap
package and I can make some example code download data from some other server, but not for my MODIS/hyrax test.
At this point, I'd need other people's input on if this is something worth continuing to explore as we try to include some streaming of popular datasets that are too big to store on scinet.
I found this LP DAAC webinar as a helpful starting point.
Add additional dataset support for pilot project
Based on my notes from the summer, we were thinking the pilot project would use datasets
- climate: PRISM or daymet, which we have covered
- elevation: USGS DEM, can assume it will be stored locally
- NDVI: to be streamed. Did we decide on product and source for streaming?
output format options
The GCDL needs to support specification of alternative output formats.
point data formatting
We need consistent input (x,y) point formatting for endpoints.
different resampling methods
Different use cases will need different resampling methods used in reprojection. We are calling reproject
from the rioxarray package, which looks to default to using the nearest neighbor method. That's the fastest, but won't be advisable for most use cases.
It would be nice to have an option for the user to choose from available resampling methods. To start we could do something like default to nearest neighbor for categorical variables (when we support them) and bilinear for continuous variables?
Note: rioxarray uses rasterio.warp.reproject which in turn uses GDAL. Available method will depend on the GDAL version available.
rasterio supported methods
GDAL methods doc
daymet dataset currently only supporting annual data
Only the annual files are supported at the moment.
Unlike PRISM, Daymet combines the 12 months into the same file. The file can be identified by the string 'mon' instead of 'ann', along with the year. getMonthlySubset
could test if the file has 1 or 12 layers and determine how to read the monthly data from there (1 = current way of looping over monthly files, 12=looping over annual files and processing monthly layers).
How should the output file be structured? Always per month, or match the input data structure? This would also depend on the eventual output formats we support - netcdf would be more likely to combine months/years into one file.
output file structure and format
Right now, we output the results as one file per variable per time increment. That file is a GeoTIFF if its raster output or a csv if point extraction output. It would be good if we could have the option to combine variables and/or time increments in the output. Particularly so for point output, but also a handy option for when we support NetCDF output for rasters as well.
Issue #31 is closely related: if we end up processing stacks of rasters, that makes combining the results more straightforward.
stacking instead of explicitly looping over layers in the same dataset?
In R, I'm used to stacking multiple raster files of the same dataset and then many operations on the stack apply to each layer, e.g. reprojecting or clipping. Currently, extractData
is being called in a loop iterating over variables, years, and months (if applicable). Is the looping over individual files the most efficient method? If not, are alternative faster methods amenable to the move to HPC?
This question came up in #30 implementing monthly data support when the input files are 12-band rasters per year. Do we still loop over months by reading different bands from the same file? Or is there the option to use a kind of stacking? Or something else?
building bounding box geometry should include crs data
Currently, the user specifies the bounding box coordinates. I assume the intent is for this to be lon/lat (which should be stated somewhere if that is indeed the case). The clip function assumes the geometry has the same crs as the raster if one isn't provided, but that won't always be the case for us.
metadata delivery
We need to implement metadata delivery with every data download, probably as a text file in some format (JSON?) that includes (at least):
- Which datasets and variables are included.
- Temporal range and granularity.
- Spatial boundaries.
- Grid size (if relevant).
- Reprojection method(s) (if relevant).
- Complete API call URL.
sub-monthly but non-daily datasets
Satellite missions data don't always fit into the daily or monthly temporal increment categories. The MODIS NDVI product that is currently the choice for the pilot is 8-day, so I'm including this issue in the pilot milestone.
Could the daily data implementation (#13) be general enough to recognize individual dates but not require every date within the requested date range?
I think this is also related to #31: instead of looping over every time increment in the requested data range (as is done in current monthly implementation), we can apply geospatial operations on a DataArray
or Dataset
that has filtered/concatenated time dimension matching the relevant date range.
dataset build file management
We need to implement a system for managing and cleaning up dataset build artifacts on the server. This will eventually intersect with #5.
dataset checkout
The library implementation needs to be modified so that datasets are "checked out" per request, meaning that each request has its own dataset instance(s). This will be important for efficient parallel processing.
configuration system
The GeoCDL needs a configuration system to make it easy to customize things like local data store location, scratch/build space location, cache location, etc.
dataset variable representation
How should we handle dataset variable/layer names for datasets that have different sets of variables depending on the temporal resolution (DAYMET, e.g.)? As currently implemented, all possible variables are reported for a dataset. An alternative would be to store variable sets per temporal resolution, but this would potentially make usage/implementation more complicated for relatively little gain.
automated tests
As this projects expands in complexity and scope, we really need to implement an automated testing framework for QA/QC as new code is developed.
background dataset builds
We need to implement a system for building datasets in the background (i.e., the API call returns immediately) and notifying the user (via email, presumably) when the dataset is ready. This could be used for every dataset download or we could implement rules for deciding when a dataset build is likely to return quickly.
vector datasets
Some datasets will be vector instead of raster, e.g. county-level data. Temporal characteristics will be the same as raster data, but not all spatial characteristics, e.g. grid size. So the catalog should be flexible enough to support that.
We will need to handle processing both directions for users if they want to merge the two types: rasterizing polygons or summarizing raster cells to polygons.
Request logging
We will need to implement a request logging system.
Add subset geometries to metadata output
We need to include subset geometries as part of the metadata output, but considering that we want to eventually support uploading large/complex subset geometries, shoving them into the metadata JSON is probably not a good solution.
porting dataset classes to new core architecture
Currently, only PRISM is done.
reproject clipping polygons to dataset CRS
Right now, datasets are always transformed to the target CRS prior to clipping, but for many datasets it will be more efficient to transform the polygon instead.
factor in-memory caching out of Daymet dataset
In-memory data caching is currently implemented directly in the Daymet dataset class. This should be factored out of the class into a separate component so it is easily re-usable elsewhere and also easy to test.
epsg codes assumed to be 4 digits
Currently, getSubset()
throws an error if the provided crs
epsg code isn't length 4, but really epsg codes can be 4 or 5 digits. For example, all Jornada spatial files are UTM 13N WGS84 - or epsg code 32613. I confirmed that CRS()
from pyproj accepts this code.
>>> from pyproj import CRS
>>> CRS(32613)
<Projected CRS: EPSG:32613>
Name: WGS 84 / UTM zone 13N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Between 108°W and 102°W, northern hemisphere between equator and 84°N, onshore and offshore. Canada - Northwest Territories (NWT); Nunavut; Saskatchewan. Mexico. United States (USA).
- bounds: (-108.0, 0.0, -102.0, 84.0)
Coordinate Operation:
- name: UTM zone 13N
- method: Transverse Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
new dataset requests
Based on GeoCDL monthly meeting notes from June 2, 2021:
could have request form, similar to software installation request form, for taking a user’s requested data product and moving it to a derived product storage space for others to access.
daily dataset processing
Need to implement framework for processing daily-scale data requests.
abstractions and implementation for remote dataset support
Currently, dataset abstractions (e.g., dataset abstract base classes) are geared toward local datasets. A similar framework needs to be designed and implemented for remote datasets and unified with the local dataset abstractions. Thus, this issue is both a major software engineering task and implementation task.
arbitrary clipping regions
We need to support specification of arbitrary clipping polygons while retaining the ease-of-use of bounding box corner specification for the simple case.
catalog describing file structure/dimensions
Can we have something in the catalog to describe how the dataset stores its variables and temporal increments? For example, we are currently expecting PRISM to have one file per variable per year or per month. DaymetV4 also splits variables and years into different files, but stores monthly data in a 12-band file. Other datasets will store multiple variables in a file. I think having some indication of this in the catalog would be better than determining it the subsetting/extracting functions since the method for determining the structure might be different among the datasets. Then we can have different subset/extract approaches per general structure type as needed.
built-in polygons for subsetting
It would be useful to have commonly used boundary polygons (e.g., states, counties) built into the system so that they could be requested directly (e.g., by name) without explicitly providing the clipping polygons.
new dataset ingestion process
It would be nice to have either a (partially) automated new dataset creation process or at least some checks on a dataset catalog entry. One, it could save some time/work in expanding the library (plus it would be nice to steam metadata along with data when we access remote data). But also reduce errors: I just ran into the issue of testing #14 during which I requested PRISM to be 8000 m resolution. I'm used to PRISM being described as 4 km, I double-checked it was entered into the catalog as 4000 m, but the .bil in my testing data is actually in 0.04166667 degrees - so that explains why I got back one cell in my test. I may or may not have the intended version of PRISM in my local data for testing (we will likely standardize this in #29), but similar errors could happen down the line as we expand the library.
temporal summaries
We should support temporal summaries like calculating growing season PPT sums from monthly/daily PPT layers. It could reduce the amount of data returned to the user if their analyses use seasonal data, plus it could be more appropriate to do these temporal summaries before any reprojections, especially if their is significant spatial up-scaling.
add SRTM DEM dataset
The title says it all.
Mixed date grain requests
What if the user wants one dataset that is annual only (e.g. when we add landcover) and one that is some other date grain (e.g. the MODIS NDVI remote data I'm trying to add right now is only sub-monthly)? Only one format is currently accepted for start and end dates. Do we require separate requests or have default ways to handle like for non-temporal data (#17)?
metadata standards
There are metadata standards for geospatial data like ISO 19115-2. At a minimum, we could peruse this for ideas about what goes in the catalog and what metadata need to be passed along to the user.
eventual GUI?
Copying recent SCINet newsletter about Open OnDemand:
Open OnDemand for Virtual Desktops and Web Apps on Ceres and Atlas
Open OnDemand is now available. This software provides web-browser access to high-performance computing resources on Ceres and Atlas, including virtual desktop environments and scientific web applications. Available apps include:
- A collection of Open OnDemand core apps, including a File Manager for browser-based file system access, a lightweight File Editor, a Shell App for in-browser command-line access, and a Job Composer App for creation and management of Slurm batch jobs
- Virtual Desktop Environment (CentOS Linux)
- JupyterLab and RStudio Server (with more web apps coming soon)
Open OnDemand allows user App Development, enabling SCINet users to develop and deploy private custom web apps on Ceres or Atlas from their home directory. Custom Interactive Apps launch apps Slurm jobs on compute nodes, run with the SCINet user’s privileges, and allow access to Ceres/Atlas parallel file systems (see the Open OnDemand Interactive Apps development documentation for more details). Open OnDemand app development is currently opt-in; to enable for your account, please contact the SCINet VRSC ([email protected]), specify the system (Ceres or Atlas), and provide a brief justification.
To get started with Open OnDemand on Atlas, see the Atlas Documentation. Preliminary documentation for accessing the Ceres Open OnDemand is available in the SCINet RStudio Server guide, with additional documentation updates to follow.
R package
We need to scope and implement a high-level R package interface to the REST API.
better implementation for request parameter passing
Request parameters are currently being passed around the framework as direct function parameters, but that approach is brittle and obscures responsibility for parameter validation. A more robust solution is needed.
Python package
We need to scope and implement a high-level Python package interface to the REST API.
point-level data extraction
We need to implement the ability for users to provide geographic points (e.g., lat/long coordinates) and extract either values for dataset grid cells containing those points or interpolated dataset values at those points. For the initial implementation, points can be provided directly as part of the GET parameters.
caching
When a user requests a custom dataset from the server, the dataset should be cached for some period of time. This will require:
- cache expiration and cleanup
- system for linking API requests to unique dataset IDs
create GeoCDL Docker container
We need a Docker container for running the GeoCDL on a cluster.
adjusting spatial resolution
A key missing feature is adjusting the spatial resolution (i.e., grid size) of datasets.
intra-dataset variation in temporal coverage
Some datasets have different temporal coverage depending on the geographic area. E.g., DAYMET has more historical data for Puerto Rico than the continental U.S. What is the best way to handle this in the dataset metadata data structures?
initial user documentation
We need minimal user documentation for using the web API.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.