nasaharvest / openmapflow Goto Github PK
View Code? Open in Web Editor NEWRapid map creation with machine learning and earth observation data.
License: Apache License 2.0
Rapid map creation with machine learning and earth observation data.
License: Apache License 2.0
Context
For each coordinate a time series is generated starting from start_date
and ending at end_date
. To obtain each time step the current strategy involves incrementing the date
by 31. A time series from Jan 2020 to Dec 2021 then looks like this:
Timestep 0: 2021-01-01 -> 2021-02-01
Timestep 1: 2021-02-01 -> 2021-03-04
Timestep 2: 2021-03-04 -> 2021-04-04
Timestep 3: 2021-04-04 -> 2021-05-05
...
Timestep 23: 2022-12-15 -> 2023-01-15
Issue
A time series starting at a different time does not align with this time series. Eg. one starting Jan 2021:
Timestep 0: 2022-01-01 -> 2022-02-01
...
Timestep 11: 2022-12-08 -> 2023-01-08
Potential Solution
This can be addressed by incrementing by month instead of by 31 days.
Relevant code here
openmapflow/openmapflow/ee_exporter.py
Line 117 in d96c9f1
Context: The train.py
example script contains all the code to train any PyTorch model on available data.
Issue: The current data normalization done inside the model is not valid for the ERA5 bands of the data:
Potential solution: Use per band scaling values.
Context: Individual earth engine tasks are launched to get remote sensing data for each training label.
Issue: Earth Engine limits the number of tasks per user to 3000 and each task can take multiple minutes to run. This means that training data creation takes a long time.
Potential solution: Google has recently updated the size of tifs that are downloadable using a URL (see: https://developers.google.com/earth-engine/apidocs/ee-image-getdownloadurl). This means it may be possible to substitute the earth engine tasks with parallelizable URL calls to vastly speed up the remote sensing data acquisition.
More resources: https://gorelick.medium.com/fast-er-downloads-a2abd512aa26
Context:
The Artifact Registry stores the docker images used with OpenMapFlow. It is created if it doesn't exist on deployment:
openmapflow/openmapflow/scripts/deploy.sh
Line 50 in e1b3bb9
Issue
If a project crop-mask2
already exists, then for a project with a subset name e.g. crop-mask
the Artifact Registry project will not be created due to the above line.
Potential Solution
Modify if statement to be more strict.
Context:
Issue: This will create a circular dependency since current OpenMapFlow imports from CropHarvest and CropHarvest would be importing from OpenMapFlow.
Potential solution: Move EarthEngineExporter and other earth observation functions from CropHarvest to OpenMapFlow. This would also make it possible to add different type of EarthEngineExporters to OpenMapFlow in the future.
Context: Currently inference is done completely within the Google Colab notebook.
Issue: Checking inference progress sometimes involves rerunning prior cells of the notebook which takes time.
Potential solution: Create CLI command for running inference and checking progress to be less bound to Colab environment
Context: RawLabels transforms a raw labels file to a standard csv using configuration parameters.
openmapflow/openmapflow/raw_labels.py
Line 194 in 9cfbabe
Issue: In many cases the way that an individual raw label file is transformed to a standard csv is unique to that specific file. Therefore it is hard to account for all transformations with a configuration file.
Potential solution: Allow the RawLabel class to take in a processing function as input. The function would read in the file and output a standard csv. Also add a csv authenticity check to ensure the generated csv has all the correct columns and data types.
Context: The 3 most recent models are chosen for deployment using this line:
openmapflow/openmapflow/config.py
Line 84 in e1b3bb9
Issue: The above line uses the file age to sort, however, the file age depends on the latest dvc pull
rather than the actual model file age. Therefore the most recent models are not always deployed.
Potential solution: Find another indicator to sort by for deployment.
Context
Adding custom labeled dataset to crop-mask project.
Problem:
Running openmapflow create-datasets
in crop-mask project on Windows machine results in UnicodeEncodeError
with dataset creation failing. Screenshot attached for exact message.
Potential Solution
Add an encoding
specification in the context manager in the line below:
openmapflow/openmapflow/labeled_dataset.py
Lines 623 to 624 in 46ee6dd
Context: Currently a user is required to add their own raw label files to generate machine learning ready features.
Issue: This adds to the start-up cost of creating an initial map.
Potential solution: Pull raw labels directly from Radiant MLHub to avoid dealing with raw label files directly.
I have python 3 installed but it cannot be recognized when I run the openmapflow create dataset function (python3 -c "from datasets import datasets; from openmapflow.labeled_dataset import create_datasets; create_datasets(datasets)". )
I am adding a screenshot of what pops up when I print python3.
Thank you
Context:
In most recent Google Colab update (2/17/23) gdal
was updated from version 3.0.4 to 3.3.2
Issue:
Running build_vrt
in create_map.ipynb
does not result in vrts being built from .nc prediction files.
build_vrt
call to gdal_cmd
terminates with exit code 1gdalbuildvrt
results in a warning that images are ungeoreferenced and is skipped.Temporary Fix:
Add following to a cell in create_map.ipynb
to downgrade gdal
back to 3.0.4 which makes build_vrt
work prior to the GC update.
%%shell
yes | add-apt-repository ppa:ubuntugis/ppa
apt-get update
apt-get install python3-gdal=3.0.4+dfsg-1build3
apt-get install gdal-bin=3.0.4+dfsg-1build3
apt-get install libgdal-dev=3.0.4+dfsg-1build3
C_INCLUDE_PATH=/usr/include/gdal
CPLUS_INCLUDE_PATH=/usr/include/gdal
python -m pip install GDAL==3.0.4
Potential Solution:
Investigate changed behavior with gdalbuildvrt
from version 3.0.4 and 3.3.2.
An addition to the README.md explaining prerequisites and valid/invalid uses for OpenMapFlow.
Context: The CLI is written in bash to allow for both python and bash commands to be accessible. Arguments are parsed using positional indexes. For example openmapflow cp <src> <dest>
is coded as:
"cp")
cp -r "$(librarydir)"/"$2" "$3"
;;
Issue: This command (and others) will break if the <src>
and <dest>
commands are not the 2nd and 3rd arguments. For example the following commands will fail:
openmapflow cp -r <src> <dest>
cd some_dir && openmapflow cp <src> <dest>
Potential solution: Using something like getopt: https://stackabuse.com/how-to-parse-command-line-arguments-in-bash/
I tried running the following scripts :
!dvc pull -q
After running the above scripts i got the following output
Go to the following link in your browser:
https://accounts.google.com/o/oauth2/auth?client_id=710796635688-iivsgbgsb6uv1fap6635dhvuei09o66c.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.appdata&access_type=offline&response_type=code&approval_prompt=force
Enter verification code:
When I click the above link i got the following message:
Access blocked: DVC’s request is invalid
[email protected]
You can’t sign in because DVC sent an invalid request. You can try again later, or contact the developer about this issue. Learn more about this error
If you are a developer of DVC, see error details.
Error 400: invalid_request
Please assist
Context: openmapflow create-features
generates a standardized csv from several raw label files. If a standardized csv already exists, it is not recreated to save time. For each raw label file, it's possible to set a train_val_test
parameter to indicate the ratio of the labels which should be used for training, validation, and testing.
Issue: If the train_val_test
parameter is updated after the standardized csv has already been created, the new distribution of train, validation, and test will not be propagated. The current way to address this issue is to manually delete the standardized csv to force recreation.
Potential solution: After checking if the standardized csv already exists, verify that the train_val_test
ratio specified matches the training, validation, and test distribution in the standardized csv. If train_val_test
ratio does not match the standardized csv distribution, recreate the standardized csv.
Context: The inference colab notebook is setup in such a way that the map is only viewable once it is created for the entire region of interest. When creating large maps it may take more than one day to export all the data for one map.
Issue: If there is something wrong with the model being used to create a large map it will not become apparent until the entire map is generated.
Potential solution: A method for viewing the map as it is being generated would be useful to avoid creating an entire bad map.
Context: When the predicted file amount does not match the input file amount the create_map.ipynb
will prompt the user to retry on all files with missing predictions.
Issue: When all predictions or most predictions are missing the system is either in the process of making predictions or predictions have failed for a reason other than latency (eg. bug in prediction code, eo files with missing values). When this occurs retrying on files with missing predictions is not effective and is a waste of time.
Potential solution: Instead of allowing the user to retry any time predicted file amount does not match input file amount, only prompt user to retry when predicted file amount is at least 50% of input file amount. Otherwise alert the user that the system may be unstable and link the logs that should be referenced.
Context: The inference colab notebook uses a bounding box to determine which area to export data for (and thereby make a map for).
Issue: The bounding box limitation means that predictions are made for pixels not necessarily inside the region of interest and these predictions are not useful.
Potential solution: A shapefile or something similar could be used to determine which area to export data for and ensure useless predictions are not made.
See: https://developers.google.com/earth-engine/apidocs/export-table-tocloudstorage
Potential gain: A solution to this issue could result in a significant speed up and cost saving for map creation.
Context: Inference is done using the Google Cloud Run service and Google Cloud Run logs are important for debugging any issues that arise during inference.
Issue: Cloud Run logs are cluttered with the following message:
Container Sandbox: Unsupported syscall setsockopt(0x83,0x1,0xd,0x3e8fec5fdf80,0x8,0x4). It is very likely that you can safely ignore this message and that this is not the cause of any error you might be troubleshooting. Please, refer to https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/#setsockopt for more information.
Potential Solution: Upgrading to Cloud Run gen2 execution environment: google/gvisor#1739
Context: To start training a model a user must first add data to the project.
Issue: This adds to the initial start-up cost and may detract future users.
Potential solution: Some method to import ML ready data directly into datasets.py
Context: The tutorial.ipynb
notebook includes example Earth observation data from Togo for making an example map. This method uses existing data stored in a public Google Cloud bucket.
Issue: To the user, it is unclear how the data in the bucket is obtained from EarthEngine.
Potential solution: Add example EarthEngine code to pull in data for any region inside the tutorial notebook. This will have to pull the data directly into Colab to avoid a dependency on Google Cloud Storage or Google Drive.
Context: Deployment script currently takes 5-7 minutes
Issue: It can probably be faster
Potential solution: Profile deployment to find out the biggest time hoggers and investigate ways of speeding up.
Context: A new docker image is released with every package version: https://github.com/nasaharvest/openmapflow/actions/workflows/docker.yaml
Issue: From version 0.1.0 to 0.1.1 the docker image size grew substantially.
https://hub.docker.com/r/ivanzvonkov/openmapflow/tags
Potential solution: Build locally and investigate differences.
Issue: There are data integration tests but 0 model integration tests.
Potential solution:
Should follow same technique as data integration tests. Example tests:
Context: The create_map.ipynb
notebook runs creates a map using one of the deployed models.
Issue: There's currently no integration test for verifying the notebook successfully runs to completion.
Potential solution: An additional Github action that runs optionally after deployment so verify predictions work as expected.
Context: Upon uploading the cropland map to Earth Engine, the user is asked to paste a script into Earth Engine to visualize the generated copland map.
Issue: A smoother user experience would allow for visualizing the map directly in the colab notebook with the option of also visualizing in Earth Engine.
This may be possible to achieve using https://geemap.org/ or a similar tool.
Port from cropharvest
Context: openmapflow create-features
is significantly sped up by the fact that tifs are cached on Google Cloud.
Issue: Features (individual pixels) are not stored on Google Cloud and therefore need to be recreated when a new project uses similar labels to another project.
Potential solution: Manually upload features to Cloud Storage periodically and add functionality to check which features are already available before creating new ones.
Problem:
Attempting to run openmapflow commands in miniconda/anaconda prompt results in the error:
'openmapflow' is not recognized as an internal or external command, operable program or batch file.
More:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.