Code Monkey home page Code Monkey logo

openmapflow's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

openmapflow's Issues

Exporter Temporal Alignment

Context
For each coordinate a time series is generated starting from start_date and ending at end_date. To obtain each time step the current strategy involves incrementing the date by 31. A time series from Jan 2020 to Dec 2021 then looks like this:

Timestep 0: 2021-01-01 -> 2021-02-01
Timestep 1: 2021-02-01 -> 2021-03-04
Timestep 2: 2021-03-04 -> 2021-04-04
Timestep 3: 2021-04-04 -> 2021-05-05
...
Timestep 23: 2022-12-15 -> 2023-01-15

Issue
A time series starting at a different time does not align with this time series. Eg. one starting Jan 2021:

Timestep 0: 2022-01-01 -> 2022-02-01
...
Timestep 11: 2022-12-08 -> 2023-01-08

Potential Solution
This can be addressed by incrementing by month instead of by 31 days.

Relevant code here

cur_end_date = cur_date + timedelta(days=days_per_timestep)

Tutorial Mega Issue

  • Visualize predictions in notebook
  • Remove visualization from train.ipynb
  • Test timing
  • Test with external user

Use Earth Engine's getDownloadURL to create training data faster

Context: Individual earth engine tasks are launched to get remote sensing data for each training label.

Issue: Earth Engine limits the number of tasks per user to 3000 and each task can take multiple minutes to run. This means that training data creation takes a long time.

Potential solution: Google has recently updated the size of tifs that are downloadable using a URL (see: https://developers.google.com/earth-engine/apidocs/ee-image-getdownloadurl). This means it may be possible to substitute the earth engine tasks with parallelizable URL calls to vastly speed up the remote sensing data acquisition.

More resources: https://gorelick.medium.com/fast-er-downloads-a2abd512aa26

Artifact Registry project should still be created if similar project already exists

Context:
The Artifact Registry stores the docker images used with OpenMapFlow. It is created if it doesn't exist on deployment:

if [ -z "$(gcloud artifacts repositories list --format='get(name)' --filter "$OPENMAPFLOW_PROJECT")" ]; then

Issue
If a project crop-mask2 already exists, then for a project with a subset name e.g. crop-mask the Artifact Registry project will not be created due to the above line.

Potential Solution
Modify if statement to be more strict.

Move EarthEngineExporter to OpenMapFlow

Context:

  • CropHarvest will be updated to have data beyond February to February using the data pipeline in OpenMapFlow (LabeledDataset objects, CSVs with earth observation data).
  • EarthEngineExporter and other earth observation functions are currently imported from CropHarvest:
    from cropharvest.eo import EarthEngineExporter

Issue: This will create a circular dependency since current OpenMapFlow imports from CropHarvest and CropHarvest would be importing from OpenMapFlow.

Potential solution: Move EarthEngineExporter and other earth observation functions from CropHarvest to OpenMapFlow. This would also make it possible to add different type of EarthEngineExporters to OpenMapFlow in the future.

Create CLI command for inference

Context: Currently inference is done completely within the Google Colab notebook.

Issue: Checking inference progress sometimes involves rerunning prior cells of the notebook which takes time.

Potential solution: Create CLI command for running inference and checking progress to be less bound to Colab environment

RawLabels should work with custom processing functions

Context: RawLabels transforms a raw labels file to a standard csv using configuration parameters.

class RawLabels:

Issue: In many cases the way that an individual raw label file is transformed to a standard csv is unique to that specific file. Therefore it is hard to account for all transformations with a configuration file.

Potential solution: Allow the RawLabel class to take in a processing function as input. The function would read in the file and output a standard csv. Also add a csv authenticity check to ensure the generated csv has all the correct columns and data types.

Most recent models are not always deployed

Context: The 3 most recent models are chosen for deployment using this line:

model_files.sort(key=os.path.getmtime)

Issue: The above line uses the file age to sort, however, the file age depends on the latest dvc pull rather than the actual model file age. Therefore the most recent models are not always deployed.

Potential solution: Find another indicator to sort by for deployment.

Windows UnicodeEncodeError

Context
Adding custom labeled dataset to crop-mask project.
Problem:
Running openmapflow create-datasets in crop-mask project on Windows machine results in UnicodeEncodeError with dataset creation failing. Screenshot attached for exact message.
image

Potential Solution
Add an encoding specification in the context manager in the line below:

with (PROJECT_ROOT / dp.REPORT).open("w") as f:
f.write(report)

Radiant MLHub integration

Context: Currently a user is required to add their own raw label files to generate machine learning ready features.

Issue: This adds to the start-up cost of creating an initial map.

Potential solution: Pull raw labels directly from Radiant MLHub to avoid dealing with raw label files directly.

Python3 and openmapflow setup on Windows

I have python 3 installed but it cannot be recognized when I run the openmapflow create dataset function (python3 -c "from datasets import datasets; from openmapflow.labeled_dataset import create_datasets; create_datasets(datasets)". )

I am adding a screenshot of what pops up when I print python3.

image

Thank you

Google Colab GDAL update 3.0.4 -> 3.3.2

Context:
In most recent Google Colab update (2/17/23) gdal was updated from version 3.0.4 to 3.3.2
Issue:
Running build_vrt in create_map.ipynb does not result in vrts being built from .nc prediction files.

  1. build_vrt call to gdal_cmd terminates with exit code 1
  2. Manually calling gdalbuildvrt results in a warning that images are ungeoreferenced and is skipped.
    image

Temporary Fix:
Add following to a cell in create_map.ipynb to downgrade gdal back to 3.0.4 which makes build_vrt work prior to the GC update.

%%shell
yes | add-apt-repository ppa:ubuntugis/ppa
apt-get update
apt-get install python3-gdal=3.0.4+dfsg-1build3
apt-get install gdal-bin=3.0.4+dfsg-1build3
apt-get install libgdal-dev=3.0.4+dfsg-1build3
C_INCLUDE_PATH=/usr/include/gdal 
CPLUS_INCLUDE_PATH=/usr/include/gdal 
python -m pip install GDAL==3.0.4

Potential Solution:
Investigate changed behavior with gdalbuildvrt from version 3.0.4 and 3.3.2.

CLI should be more robust

Context: The CLI is written in bash to allow for both python and bash commands to be accessible. Arguments are parsed using positional indexes. For example openmapflow cp <src> <dest> is coded as:

"cp")
    cp -r "$(librarydir)"/"$2" "$3"
    ;;

Issue: This command (and others) will break if the <src> and <dest> commands are not the 2nd and 3rd arguments. For example the following commands will fail:

openmapflow cp -r <src> <dest>
cd some_dir && openmapflow cp <src> <dest>

Potential solution: Using something like getopt: https://stackabuse.com/how-to-parse-command-line-arguments-in-bash/

Access blocked: DVC’s request is invalid

I tried running the following scripts :

Pull in data already available

!dvc pull -q

After running the above scripts i got the following output

Go to the following link in your browser:
https://accounts.google.com/o/oauth2/auth?client_id=710796635688-iivsgbgsb6uv1fap6635dhvuei09o66c.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.appdata&access_type=offline&response_type=code&approval_prompt=force

Enter verification code:

When I click the above link i got the following message:

Access blocked: DVC’s request is invalid

[email protected]
You can’t sign in because DVC sent an invalid request. You can try again later, or contact the developer about this issue. Learn more about this error
If you are a developer of DVC, see error details.
Error 400: invalid_request

Please assist

openmapflow create-features should check if train, val, test distribution has changed

Context: openmapflow create-features generates a standardized csv from several raw label files. If a standardized csv already exists, it is not recreated to save time. For each raw label file, it's possible to set a train_val_test parameter to indicate the ratio of the labels which should be used for training, validation, and testing.

Issue: If the train_val_test parameter is updated after the standardized csv has already been created, the new distribution of train, validation, and test will not be propagated. The current way to address this issue is to manually delete the standardized csv to force recreation.

Potential solution: After checking if the standardized csv already exists, verify that the train_val_test ratio specified matches the training, validation, and test distribution in the standardized csv. If train_val_test ratio does not match the standardized csv distribution, recreate the standardized csv.

Ability to view map as it is being generated

Context: The inference colab notebook is setup in such a way that the map is only viewable once it is created for the entire region of interest. When creating large maps it may take more than one day to export all the data for one map.

Issue: If there is something wrong with the model being used to create a large map it will not become apparent until the entire map is generated.

Potential solution: A method for viewing the map as it is being generated would be useful to avoid creating an entire bad map.

Smarter retry prediction mechanism

Context: When the predicted file amount does not match the input file amount the create_map.ipynb will prompt the user to retry on all files with missing predictions.

Issue: When all predictions or most predictions are missing the system is either in the process of making predictions or predictions have failed for a reason other than latency (eg. bug in prediction code, eo files with missing values). When this occurs retrying on files with missing predictions is not effective and is a waste of time.

Potential solution: Instead of allowing the user to retry any time predicted file amount does not match input file amount, only prompt user to retry when predicted file amount is at least 50% of input file amount. Otherwise alert the user that the system may be unstable and link the logs that should be referenced.

Maps should be made for regions of interest instead bounding boxes

Context: The inference colab notebook uses a bounding box to determine which area to export data for (and thereby make a map for).

Issue: The bounding box limitation means that predictions are made for pixels not necessarily inside the region of interest and these predictions are not useful.

Potential solution: A shapefile or something similar could be used to determine which area to export data for and ensure useless predictions are not made.
See: https://developers.google.com/earth-engine/apidocs/export-table-tocloudstorage

Potential gain: A solution to this issue could result in a significant speed up and cost saving for map creation.

Upgrade Cloud Run to gen2 execution environment

Context: Inference is done using the Google Cloud Run service and Google Cloud Run logs are important for debugging any issues that arise during inference.

Issue: Cloud Run logs are cluttered with the following message:

Container Sandbox: Unsupported syscall setsockopt(0x83,0x1,0xd,0x3e8fec5fdf80,0x8,0x4). It is very likely that you can safely ignore this message and that this is not the cause of any error you might be troubleshooting. Please, refer to https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/#setsockopt for more information.

Potential Solution: Upgrading to Cloud Run gen2 execution environment: google/gvisor#1739

Existing importable LabeledDatasets

Context: To start training a model a user must first add data to the project.

Issue: This adds to the initial start-up cost and may detract future users.

Potential solution: Some method to import ML ready data directly into datasets.py

Tutorial notebook should include ability to make predictions on any region

Context: The tutorial.ipynb notebook includes example Earth observation data from Togo for making an example map. This method uses existing data stored in a public Google Cloud bucket.

Issue: To the user, it is unclear how the data in the bucket is obtained from EarthEngine.

Potential solution: Add example EarthEngine code to pull in data for any region inside the tutorial notebook. This will have to pull the data directly into Colab to avoid a dependency on Google Cloud Storage or Google Drive.

Profile deployment script

Context: Deployment script currently takes 5-7 minutes

Issue: It can probably be faster

Potential solution: Profile deployment to find out the biggest time hoggers and investigate ways of speeding up.

Model integration tests

Issue: There are data integration tests but 0 model integration tests.

Potential solution:
Should follow same technique as data integration tests. Example tests:

  • Test model outputs are non constant
  • Test model outputs are between 0 and 1

Creating map integration test

Context: The create_map.ipynb notebook runs creates a map using one of the deployed models.

Issue: There's currently no integration test for verifying the notebook successfully runs to completion.

Potential solution: An additional Github action that runs optionally after deployment so verify predictions work as expected.

Visualize cropland map in inference colab notebook

Context: Upon uploading the cropland map to Earth Engine, the user is asked to paste a script into Earth Engine to visualize the generated copland map.

Issue: A smoother user experience would allow for visualizing the map directly in the colab notebook with the option of also visualizing in Earth Engine.

This may be possible to achieve using https://geemap.org/ or a similar tool.

Allow feature caching through Google Cloud Storage

Context: openmapflow create-features is significantly sped up by the fact that tifs are cached on Google Cloud.

Issue: Features (individual pixels) are not stored on Google Cloud and therefore need to be recreated when a new project uses similar labels to another project.

Potential solution: Manually upload features to Cloud Storage periodically and add functionality to check which features are already available before creating new ones.

Openmapflow not recognized on Windows machine

Problem:
Attempting to run openmapflow commands in miniconda/anaconda prompt results in the error:
'openmapflow' is not recognized as an internal or external command, operable program or batch file.

More:

  • Windows 10 machine
  • openmapflow project crop-mask
  • openmapflow version, error, and local location on machine:
    image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.