veinsoftheearth / rabpro Goto Github PK

View Code? Open in Web Editor NEW

35.0 4.0 8.0 23.96 MB

Delineating watershed basins and computing attribute statistics using Google Earth Engine

Home Page: https://VeinsOfTheEarth.github.io/rabpro

License: BSD 3-Clause "New" or "Revised" License

Python 81.26% R 0.62% Dockerfile 0.22% Shell 0.03% TeX 17.87%

python manuscript

rabpro's Introduction

Package to delineate watershed basins and compute attribute statistics using Google Earth Engine.

Setup

Software installation	Data configuration	Software configuration

Usage

See Example notebooks:

Data configuration	Basic workflow	Multiple basins workflow	Basin stats examples

Citation

The following text is the current citation for rabpro:

Schwenk, J., T. Zussman, J. Stachelek, and J. Rowland. (2022). rabpro: global watershed boundaries, river elevation profiles, and catchment statistics. Journal of Open Source Software, 7(73), 4237, https://doi.org/10.21105/joss.04237.

If you delineate watersheds, you should cite either or both (depending on your method) of HydroBasins:

Lehner, B., Grill G. (2013). Global river hydrography and network routing: baseline data and new approaches to study the world’s large river systems. Hydrological Processes, 27(15): 2171–2186. https://doi.org/10.1002/hyp.9740

or MERIT-Hydro:

Yamazaki, D., Ikeshima, D., Sosa, J., Bates, P. D., Allen, G. H., & Pavelsky, T. M. (2019). MERIT Hydro: A high‐resolution global hydrography map based on latest topography dataset. Water Resources Research, 55(6), 5053-5073. https://doi.org/10.1029/2019WR024873

Development

Testing

python -m pytest
python -m pytest -k "test_img"

Local docs build

cd docs && make html

Contributing

We welcome all forms of user contributions including feature requests, bug reports, code, and documentation requests - simply open an issue.

Note that rabpro adheres to Black code style and NumPy-style docstrings for documentation. We ask that contributions adhere to these standards as much as possible. For code development contributions, please contact us via email (rabpro at lanl [dot] gov) to be added to our slack channel where we can hash out a plan for your contribution.

rabpro's People

Contributors

Stargazers

Watchers

Forkers

jsta tzussman cyber-hydrology elbeejay davidchoi76 kkyong77 trellixvulnteam florinzai jzphlp

rabpro's Issues

Remove OpenCV dependency

See #48

Only one OpenCV function is used (findContours) in regionprops() in utils.py. If the skimage equivalent has the same behavior, use that instead. I think the function is currently optimized for OpenCV, so something like this could work, but long-term it'd be better to rewrite this for skimage.

Profiler takes redundant 'quiet' option

I notice that there is both a verbose and a quiet kwarg when instantiating the profiler. I think if verbose is False, quiet should be False.

Add function to build a GEE vector asset

Something like:

temp_dir = Path("temp")
basins.to_file(filename="temp/" + out_path + ".shp", driver='ESRI Shapefile')

with zipfile.ZipFile(out_path + ".zip", 'w') as zipf:
    for f in temp_dir.glob("*"):
        zipf.write(f, arcname=f.name)

Demo workflow for user contributed GEE assets

Currently the create_datapaths function looks for a file called "user_gee_datasets.json". How can one make such a file?

possible typo in basin_stats

I think "image" on subbasin_stats.py#L90 should be "Image". Can you confirm @tzussman?

get_merit_dem script error

Running the example in the docs I get the following error:

merit_dem(args.target, args.username, args.password)
  File "Data/scripts/get_merit_dem.py", line 32, in merit_dem
    url = [x["href"][2:] for x in soup.findAll("a", text=re.compile(filename), href=True)][0]
IndexError: list index out of range
make: *** [Makefile:2: merit_dem] Error 1

Add option to override appdirs folders

The entire merit dataset is huge. I can't store it on my C drive.

I favor checking for an environment variable like $rabpro_data:

os.environ['rabpro_data']

Functions for interacting with gee_datasets.json

Maybe you all are more adept at quick inspection of json objects but I could use some python functions to list available datasets ids/bands/(start/end)dates.

Add argument to get_datapaths to force rebuilding of VRTs

Does subbasin_stats work if the input is a MULTIPOLYGON?

Allow subbasin_stats to take a GEE featurecollection asset path instead of a geodataframe

Linting/formatting

Per Slack conversion, going through and linting/formatting py files using black.

Add proxy argument to get_merit_dem.py

Include a multibasin test dataset

Currently out tests run on only a single coordinate pair (see tests/data/test_coords.shp). Visualizing rabpro output on a single subbasin is kind of underwhelming.

Design method for tracking and including user-added datasets

When users add their own datasets (images/imagecollections) to GEE, they will also need to incorporate the metadata information somehow so rabpro has what it needs to compute statistics. It's not clear to me how this is supposed to happen (perhaps it's already designed?), but we should have a clear procedure. For datasets that we (rabpro developers) make public, these could be included in the fetched metadata file so that all users would have access to them (i.e. not just on our local machines).

Add an example of a custom subbasin_stats reducer function to docs or tests

One use-case might be returning land cover category stats using a "grouped" reducer:
https://developers.google.com/earth-engine/guides/reducers_grouping

Mismatched coords and da in test and test reference

I can't get tests/test.py to pass. One thing I see is that the coords and da are mismatched between the code used to generate the reference file (tests/basic_no_subbasin_stats.py) and the test file (tests/test.py). The former has:

coords = (32.97287, -88.15829)
da = 18680

while the latter has:

coords = (56.22659, -130.87974)
da = 1994

My investigation shows that the test object matches the info specified in test.py so the basic_no_subbasin_stats.py info should probably be changed to match.

Data folder should be renamed from DEM to MERIT_Hydro

This will be more informative and facilitate reuse with companion code packages.

Unable to collapse subbasin_stats results to take time averages

Right now the code maps over features to create a features * stat * time matrix.

You can add custom functions but I don't think they are properly nested to get time averages. They seem to just calculate additional stats in the features * stat * time matrix.

See: https://gis.stackexchange.com/a/365123/32531

basin_stats trouble

I'm having trouble running basin_stats:

rpo.basin_stats([Dataset("JRC/GSW1_3/GlobalSurfaceWater", "occurrence")])

Computing subbasin stats for JRC/GSW1_3/GlobalSurfaceWater...
Traceback (most recent call last):
File "", line 1, in
File "/home/jemma/Documents/Science/LosAlamos/Projects/rabpro/rabpro/core.py", line 296, in basin_stats
self.stats = ss.main(self.basins, datasets, verbose=self.verbose)
File "/home/jemma/Documents/Science/LosAlamos/Projects/rabpro/rabpro/subbasin_stats.py", line 155, in main
for f, header in reducer_funcs:
TypeError: 'NoneType' object is not iterable

On L296 of core.py the subbasin_stats.main function is run with 3 arguments (reducer_funcs is unspecified) but it seems that subbasin_stats.main expects the reducer_funcs argument to be specified (see subbasin_stats.py#L139).

elev_profile error

Running this formerly working snippet:

import geopandas as gpd
import rabpro
from rabpro import utils
from rabpro.subbasin_stats import Dataset

coords_file = gpd.read_file(r"tests/data/Big Blue River.geojson")
rpo = rabpro.profiler(coords_file)
rpo.delineate_basins()
rpo.elev_profile()

Gives an error:

UnboundLocalError: local variable 'flowpath' referenced before assignment

Remove 'jschwenk' from paths

We should replace 'jschwenk' in utils.py.get_datapaths() with 'rabpro' or something more generic. I would rather just eliminate the extra embedded directly entirely, but if I recall correctly appdirs requires it...

Excessive null values in subbasin stats

We seem to be having the problem described at: https://gis.stackexchange.com/q/407965/32531

Reduce gee metadata pulling to weekly

This would mean replacing the last star in L5 with a number for day of the week (0 3 * * 1).

Handling cases where the underlying raster resolution is ~ or > than the size of the polygon feature

We haven't thought about this much. Sometimes the watershed polygons are much smaller than a single pixel of the requested raster (e.g. GLDAS at 0.25 degrees). When the polygon overlaps only 3-4 of these pixels, originally I had code that would compute an areal-weighted average by intersecting the polygon with all the nearby pixels. I don't know how this could/should be handled in GEE, but it could be important.

See the comment here: https://groups.google.com/g/google-earth-engine-developers/c/2VG0uEFmKcU/m/PH-n8csCAwAJ which suggests weighted reducers. Actually the weighted reducers help page might offer a quick and easy solution: https://developers.google.com/earth-engine/guides/reducers_weighting

Turn off Windows in build workflow

Something is off with the file hash testing on the Windows build workflow causing unittest to fail. I feel like this is unrelated to actual package operation so we can turn it off.

Statistics of categorical GEE datasets

Looks like ee.Reducer.frequencyHistogram is the way to do this. Perhaps it could be added to the built-in list.

Remove RG dependency

Involves copying rivgraph.im_utils.regionprops() and .crop_binary_coords(). The OpenCV dependence might be removable if https://scikit-image.org/docs/dev/api/skimage.measure.html#skimage.measure.find_contours provides the exact functionality of OpenCV's findContours. However, this is a heavily-used operation so lots of testing would need to be done to ensure that it's a 1:1 swap. No need to remove OpenCV, just throwing that bit in.

Unpin geopandas dependency

See VeinsOfTheEarth/RivGraph#60

Don't date-filter image collection bands that are not time indexed

subbasin_stats is working fine for me using the example "image" Dataset ("JRC/GSW1_3/GlobalSurfaceWater", "occurrence")

However, "image_collection" entries like Dataset("JRC/GSW1_3/MonthlyRecurrence", "monthly_recurrence") give me either an empty file on my GDrive or an error:

TypeError: cannot unpack non-iterable NoneType object

depending on if test is True.

Add proxy argument to all requests.get calls

This would mean utils.get_datapaths and all functions that call get_datapaths.

cli downloading outside of the rabpro package

Inside the rabpro package I can download data with:

./rabpro/cli/rabpro download merit n30w090 <username> <password>

However, with rabpro installed in a different environment it seems like I should be able to run:

rabro download merit n30e150 <username> <password>

However, I get an error:

Command 'rabro' not found, did you mean:...

Create github releases

Running get_datapaths requires that gee_datasets.json is downloadable

Running get_datapaths requires that gee_datasets.json is downloadable. However, gee_datasets.json is in a private repo. Perhaps we could adjust get_datapaths to read from disk as a fall-back if the url is unreachable?

Can gee_datasets.json store extent information?

Change default export directory

When I exported the basin shapefiles (self.export('all')), it defaulted to c:\users\jon\results\name_of_run. I think the exports should probably be in the appdirs folder as well, no? Either that or force the user to supply a directory to result outputs...

Commit data directory structure

My preference would be to commit the Data directory structure expected by rabpro. The data itself would be added to the .gitignore and the git tracked folders would only contain .gitkeep files. I plan to open a PR showing this for us to discuss.

merit_dem and merit_hydro have different usernames + passwords

The get_merit_dem script assumes they are identical.

Allow for GEE datasets without bands

For example https://samapriya.github.io/awesome-gee-community-datasets/projects/geomorpho90/ has no bands just a collection name.

Update expected location of gee_catalog.json

When I try to run the following code post-merge of the gee conversion PR I get an error that doesn't seem correct.

rabpro.profiler(coords_file)

https://raw.githubusercontent.com/tzussman/rabpro/gee-conversion/Data/gee_catalog.json returned error status code 404. Download manually into /home/jemma/.local/share/rabpro/gee_datasets.json

The way I see it, gee_catalog.json is in the Data folder of this repo not in Tal's gee-conversion repo/fork.

Don't use dashes in subbasin_stats filenames

Can't use dashed strings as postgres column names

Including a geopackage layer of the MERIT grid

Then we could intersect a given "coords file" to figure out what data needs to be downloaded for a job.

The MERIT data is prepared as 5 degree x 5 degree tiles (6000 pixel x 6000 pixel) but it's packaged as 30 degree x 30 degree "megatiles". These megatile codes are the important piece of information needed to point to a specific data download.

Allow argument passing to build_vrt via get_datapaths

I'm specifically interested in passing an extent argument.

Hoping this will allow consistency in vrt coordinates between systems with the full MERIT available versus only partial MERIT.

Return None data if subbasin_stats output is sent to GDrive

This would make the return syntax identical whether or not test was activated.

Right now there's only a return statement if test is activated:

if test:
            return data, task

I propose adding:

else:
            return None, task

Allow data pathing to work without internet

I propose running this code to check for an internet connection before running du.download_gee_metadata in utils.get_datapaths:

https://stackoverflow.com/a/29854274/3362993

get_rabpropath converts paths to lowercase

The convention on my local machine is to use Title case for high level directories. As written, the rabpro create_folder function is cluttering my file system. Is there a reason for this lower case conversion? See https://github.com/jonschwenk/rabpro/blob/main/rabpro/utils.py#L31

Make data directory structure by default

When a user instantiates the profiler(), if they have not downloaded any data, they will get an error message stating that the ../rabpro/jschwenk data directory doesn't exist. I think we should create the data folders where they belong so that the structure is there if they download e.g. MERIT tiles via web browser. They'll know where to put them instead of trying to figure it out.

So rabpro should just try to create the empty folders and print a message that says "No DEM/HydroBasins data were found. Empty directories have been created at {} to store them. You can download the MERIT data with [name of merit downloading script]".

Or something like that.

Consider returning a getdownloadurl path from subbasin_stats if test is False

Return subbasin_stats files to GDrive without basin geometry information

This seems to be causing an error on large pulls:

ee.ee_exception.EEException: Request payload size exceeds the limit: 10485760 bytes.

Add ability to append a string to the subbasin_stats output file

Would be good to be able to append a "tag" like an id_outlet or id_basin to the file name that gets created on GDrive.