nasaharvest / openmapflow Goto Github PK

View Code? Open in Web Editor NEW

66.0 66.0 81.0 196.2 MB

Rapid map creation with machine learning and earth observation data.

License: Apache License 2.0

Shell 2.95% Dockerfile 0.29% Jupyter Notebook 21.15% Python 75.61%

openmapflow's People

Stargazers

Watchers

Forkers

sbaber1 gabrieltseng adamy03 hannah-rae cnakalembe zetagraph bgb83 lfleon9b b1uec1oud ganeshdg95 jspann paularamo loren-dueck manuelf23 davidwillmes dssauceda-cvx sophianai g-pichler keosat adityamohan29 rnanclares iorch inrabegh renatosad gloryndu anatarano adebowaledaniel mnthnx64 aniket-parlikar imadtyx diegoanguloa mspixels alexvmt byiringirojeandedieu7 hananleb aparnaphalke makarem303 bensonkenduiywo anastasiaw majambo ewangui firmanhadi agronomicforecastinglab hhs16 lolabee90 eoproject2022 iamnrajkumar ponykid aybmhz yonas-g william-ingold naamagal catdbd keivernunez ulisesmoyasanchez normankerandi estherwmaina ivanzvonkov magalidebruyn rajeevgmail shashank-tamaskar yutlani sasan-faraj dataforclimate2023 she-kodes oliverjcc oliver9412 hsuuuuunny shamwq migggzz freddyhvg rao-monu aldiannurcahyo top-data gedeon-m-gedus connorflynn ashleyholen romamalasarte tonigold nadiaoktiarsy

openmapflow's Issues

Exporter Temporal Alignment

Context
For each coordinate a time series is generated starting from start_date and ending at end_date. To obtain each time step the current strategy involves incrementing the date by 31. A time series from Jan 2020 to Dec 2021 then looks like this:

Timestep 0: 2021-01-01 -> 2021-02-01
Timestep 1: 2021-02-01 -> 2021-03-04
Timestep 2: 2021-03-04 -> 2021-04-04
Timestep 3: 2021-04-04 -> 2021-05-05
...
Timestep 23: 2022-12-15 -> 2023-01-15

Issue
A time series starting at a different time does not align with this time series. Eg. one starting Jan 2021:

Timestep 0: 2022-01-01 -> 2022-02-01
...
Timestep 11: 2022-12-08 -> 2023-01-08

Potential Solution
This can be addressed by incrementing by month instead of by 31 days.

Relevant code here

openmapflow/openmapflow/ee_exporter.py

Line 117 in d96c9f1

cur_end_date = cur_date + timedelta(days=days_per_timestep)

Tutorial Mega Issue

Visualize predictions in notebook
Remove visualization from train.ipynb
Test timing
Test with external user

train.py normalization should be valid

Context: The train.py example script contains all the code to train any PyTorch model on available data.

Issue: The current data normalization done inside the model is not valid for the ERA5 bands of the data:

openmapflow/openmapflow/templates/train.py

Line 79 in a0533af

x = x * 1e-4 # TODO Fix

Potential solution: Use per band scaling values.

Use Earth Engine's getDownloadURL to create training data faster

Context: Individual earth engine tasks are launched to get remote sensing data for each training label.

Issue: Earth Engine limits the number of tasks per user to 3000 and each task can take multiple minutes to run. This means that training data creation takes a long time.

Potential solution: Google has recently updated the size of tifs that are downloadable using a URL (see: https://developers.google.com/earth-engine/apidocs/ee-image-getdownloadurl). This means it may be possible to substitute the earth engine tasks with parallelizable URL calls to vastly speed up the remote sensing data acquisition.

More resources: https://gorelick.medium.com/fast-er-downloads-a2abd512aa26

Artifact Registry project should still be created if similar project already exists

Context:
The Artifact Registry stores the docker images used with OpenMapFlow. It is created if it doesn't exist on deployment:

openmapflow/openmapflow/scripts/deploy.sh

Line 50 in e1b3bb9

    
           if [ -z "$(gcloud artifacts repositories list --format='get(name)' --filter "$OPENMAPFLOW_PROJECT")" ]; then

Issue
If a project crop-mask2 already exists, then for a project with a subset name e.g. crop-mask the Artifact Registry project will not be created due to the above line.

Potential Solution
Modify if statement to be more strict.

Move EarthEngineExporter to OpenMapFlow

Context:

CropHarvest will be updated to have data beyond February to February using the data pipeline in OpenMapFlow (LabeledDataset objects, CSVs with earth observation data).
EarthEngineExporter and other earth observation functions are currently imported from CropHarvest:

openmapflow/openmapflow/labeled_dataset.py

Line 14 in 9cfbabe

from cropharvest.eo import EarthEngineExporter

Issue: This will create a circular dependency since current OpenMapFlow imports from CropHarvest and CropHarvest would be importing from OpenMapFlow.

Potential solution: Move EarthEngineExporter and other earth observation functions from CropHarvest to OpenMapFlow. This would also make it possible to add different type of EarthEngineExporters to OpenMapFlow in the future.

Create CLI command for inference

Context: Currently inference is done completely within the Google Colab notebook.

Issue: Checking inference progress sometimes involves rerunning prior cells of the notebook which takes time.

Potential solution: Create CLI command for running inference and checking progress to be less bound to Colab environment

RawLabels should work with custom processing functions

Context: RawLabels transforms a raw labels file to a standard csv using configuration parameters.

openmapflow/openmapflow/raw_labels.py

Line 194 in 9cfbabe

class RawLabels:

Issue: In many cases the way that an individual raw label file is transformed to a standard csv is unique to that specific file. Therefore it is hard to account for all transformations with a configuration file.

Potential solution: Allow the RawLabel class to take in a processing function as input. The function would read in the file and output a standard csv. Also add a csv authenticity check to ensure the generated csv has all the correct columns and data types.

Most recent models are not always deployed

Context: The 3 most recent models are chosen for deployment using this line:

openmapflow/openmapflow/config.py

Line 84 in e1b3bb9

model_files.sort(key=os.path.getmtime)

Issue: The above line uses the file age to sort, however, the file age depends on the latest dvc pull rather than the actual model file age. Therefore the most recent models are not always deployed.

Potential solution: Find another indicator to sort by for deployment.

Windows UnicodeEncodeError

Context
Adding custom labeled dataset to crop-mask project.
Problem:
Running openmapflow create-datasets in crop-mask project on Windows machine results in UnicodeEncodeError with dataset creation failing. Screenshot attached for exact message.

Potential Solution
Add an encoding specification in the context manager in the line below:

openmapflow/openmapflow/labeled_dataset.py

Lines 623 to 624 in 46ee6dd

    
           with (PROJECT_ROOT / dp.REPORT).open("w") as f: 
        
               f.write(report)

Radiant MLHub integration

Context: Currently a user is required to add their own raw label files to generate machine learning ready features.

Issue: This adds to the start-up cost of creating an initial map.

Potential solution: Pull raw labels directly from Radiant MLHub to avoid dealing with raw label files directly.

Required data license parameter in CustomLabeledDataset

mypy + unit test coverage for adding data

labeled_dataset.py
raw_labels.py

Python3 and openmapflow setup on Windows

I have python 3 installed but it cannot be recognized when I run the openmapflow create dataset function (python3 -c "from datasets import datasets; from openmapflow.labeled_dataset import create_datasets; create_datasets(datasets)". )

I am adding a screenshot of what pops up when I print python3.

Thank you

Auto generate documentation

Google Colab GDAL update 3.0.4 -> 3.3.2

Context:
In most recent Google Colab update (2/17/23) gdal was updated from version 3.0.4 to 3.3.2
Issue:
Running build_vrt in create_map.ipynb does not result in vrts being built from .nc prediction files.

build_vrt call to gdal_cmd terminates with exit code 1
Manually calling gdalbuildvrt results in a warning that images are ungeoreferenced and is skipped.

Temporary Fix:
Add following to a cell in create_map.ipynb to downgrade gdal back to 3.0.4 which makes build_vrt work prior to the GC update.

%%shell
yes | add-apt-repository ppa:ubuntugis/ppa
apt-get update
apt-get install python3-gdal=3.0.4+dfsg-1build3
apt-get install gdal-bin=3.0.4+dfsg-1build3
apt-get install libgdal-dev=3.0.4+dfsg-1build3
C_INCLUDE_PATH=/usr/include/gdal 
CPLUS_INCLUDE_PATH=/usr/include/gdal 
python -m pip install GDAL==3.0.4

Potential Solution:
Investigate changed behavior with gdalbuildvrt from version 3.0.4 and 3.3.2.

When you should use OpenMapFlow

An addition to the README.md explaining prerequisites and valid/invalid uses for OpenMapFlow.

CLI should be more robust

Context: The CLI is written in bash to allow for both python and bash commands to be accessible. Arguments are parsed using positional indexes. For example openmapflow cp <src> <dest> is coded as:

"cp")
    cp -r "$(librarydir)"/"$2" "$3"
    ;;

Issue: This command (and others) will break if the <src> and <dest> commands are not the 2nd and 3rd arguments. For example the following commands will fail:

openmapflow cp -r <src> <dest>
cd some_dir && openmapflow cp <src> <dest>

Potential solution: Using something like getopt: https://stackabuse.com/how-to-parse-command-line-arguments-in-bash/

Access blocked: DVC’s request is invalid

I tried running the following scripts :

Pull in data already available

!dvc pull -q

After running the above scripts i got the following output

Go to the following link in your browser:
https://accounts.google.com/o/oauth2/auth?client_id=710796635688-iivsgbgsb6uv1fap6635dhvuei09o66c.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.appdata&access_type=offline&response_type=code&approval_prompt=force

Enter verification code:

When I click the above link i got the following message:

Access blocked: DVC’s request is invalid

[email protected]
You can’t sign in because DVC sent an invalid request. You can try again later, or contact the developer about this issue. Learn more about this error
If you are a developer of DVC, see error details.
Error 400: invalid_request

Please assist

openmapflow create-features should check if train, val, test distribution has changed

Context: openmapflow create-features generates a standardized csv from several raw label files. If a standardized csv already exists, it is not recreated to save time. For each raw label file, it's possible to set a train_val_test parameter to indicate the ratio of the labels which should be used for training, validation, and testing.

Issue: If the train_val_test parameter is updated after the standardized csv has already been created, the new distribution of train, validation, and test will not be propagated. The current way to address this issue is to manually delete the standardized csv to force recreation.

Potential solution: After checking if the standardized csv already exists, verify that the train_val_test ratio specified matches the training, validation, and test distribution in the standardized csv. If train_val_test ratio does not match the standardized csv distribution, recreate the standardized csv.

mypy + unit test coverage for inference

trigger_inference_function/main.py
inference_utils.py
inference_widgets.py
torchserve_handler.py

Ability to view map as it is being generated

Context: The inference colab notebook is setup in such a way that the map is only viewable once it is created for the entire region of interest. When creating large maps it may take more than one day to export all the data for one map.

Issue: If there is something wrong with the model being used to create a large map it will not become apparent until the entire map is generated.

Potential solution: A method for viewing the map as it is being generated would be useful to avoid creating an entire bad map.

Smarter retry prediction mechanism

Context: When the predicted file amount does not match the input file amount the create_map.ipynb will prompt the user to retry on all files with missing predictions.

Issue: When all predictions or most predictions are missing the system is either in the process of making predictions or predictions have failed for a reason other than latency (eg. bug in prediction code, eo files with missing values). When this occurs retrying on files with missing predictions is not effective and is a waste of time.

Potential solution: Instead of allowing the user to retry any time predicted file amount does not match input file amount, only prompt user to retry when predicted file amount is at least 50% of input file amount. Otherwise alert the user that the system may be unstable and link the logs that should be referenced.

Maps should be made for regions of interest instead bounding boxes

Context: The inference colab notebook uses a bounding box to determine which area to export data for (and thereby make a map for).

Issue: The bounding box limitation means that predictions are made for pixels not necessarily inside the region of interest and these predictions are not useful.

Potential solution: A shapefile or something similar could be used to determine which area to export data for and ensure useless predictions are not made.
See: https://developers.google.com/earth-engine/apidocs/export-table-tocloudstorage

Potential gain: A solution to this issue could result in a significant speed up and cost saving for map creation.

Upgrade Cloud Run to gen2 execution environment

Context: Inference is done using the Google Cloud Run service and Google Cloud Run logs are important for debugging any issues that arise during inference.

Issue: Cloud Run logs are cluttered with the following message:

Container Sandbox: Unsupported syscall setsockopt(0x83,0x1,0xd,0x3e8fec5fdf80,0x8,0x4). It is very likely that you can safely ignore this message and that this is not the cause of any error you might be troubleshooting. Please, refer to https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/#setsockopt for more information.

Potential Solution: Upgrading to Cloud Run gen2 execution environment: google/gvisor#1739

Tutorial video for generating a project, training and deploying a model

Script: https://docs.google.com/document/d/10m_Ol_11da_CnHNQGfLLKGdy5jJeJZtHdBRThEp_GfY/edit?usp=sharing

Existing importable LabeledDatasets

Context: To start training a model a user must first add data to the project.

Issue: This adds to the initial start-up cost and may detract future users.

Potential solution: Some method to import ML ready data directly into datasets.py

mypy + unit test coverage for project generation

generate.py
config.py

Tutorial notebook should include ability to make predictions on any region

Context: The tutorial.ipynb notebook includes example Earth observation data from Togo for making an example map. This method uses existing data stored in a public Google Cloud bucket.

Issue: To the user, it is unclear how the data in the bucket is obtained from EarthEngine.

Potential solution: Add example EarthEngine code to pull in data for any region inside the tutorial notebook. This will have to pull the data directly into Colab to avoid a dependency on Google Cloud Storage or Google Drive.

Profile deployment script

Context: Deployment script currently takes 5-7 minutes

Issue: It can probably be faster

Potential solution: Profile deployment to find out the biggest time hoggers and investigate ways of speeding up.

Reduce OpenMapFlow docker image size

Context: A new docker image is released with every package version: https://github.com/nasaharvest/openmapflow/actions/workflows/docker.yaml

Issue: From version 0.1.0 to 0.1.1 the docker image size grew substantially.

https://hub.docker.com/r/ivanzvonkov/openmapflow/tags

Potential solution: Build locally and investigate differences.

Model integration tests

Issue: There are data integration tests but 0 model integration tests.

Potential solution:
Should follow same technique as data integration tests. Example tests:

Test model outputs are non constant
Test model outputs are between 0 and 1

Creating map integration test

Context: The create_map.ipynb notebook runs creates a map using one of the deployed models.

Issue: There's currently no integration test for verifying the notebook successfully runs to completion.

Potential solution: An additional Github action that runs optionally after deployment so verify predictions work as expected.

Visualize cropland map in inference colab notebook

Context: Upon uploading the cropland map to Earth Engine, the user is asked to paste a script into Earth Engine to visualize the generated copland map.

Issue: A smoother user experience would allow for visualizing the map directly in the colab notebook with the option of also visualizing in Earth Engine.

This may be possible to achieve using https://geemap.org/ or a similar tool.

mypy + unit test coverage for engineer.py

Port from cropharvest

mypy + unit test coverage for training models

templates/train.py
templates/evaluate.py
pytorch_dataset.py
train_utils.py

Allow feature caching through Google Cloud Storage

Context: openmapflow create-features is significantly sped up by the fact that tifs are cached on Google Cloud.

Issue: Features (individual pixels) are not stored on Google Cloud and therefore need to be recreated when a new project uses similar labels to another project.

Potential solution: Manually upload features to Cloud Storage periodically and add functionality to check which features are already available before creating new ones.

Openmapflow not recognized on Windows machine

Problem:
Attempting to run openmapflow commands in miniconda/anaconda prompt results in the error:
'openmapflow' is not recognized as an internal or external command, operable program or batch file.

More:

Windows 10 machine
openmapflow project crop-mask
openmapflow version, error, and local location on machine:

	with (PROJECT_ROOT / dp.REPORT).open("w") as f:
	f.write(report)