carbonplan / trace Goto Github PK

working repo for carbonplan's climate trace project

License: MIT License

Python 5.45% Jupyter Notebook 94.54% Dockerfile 0.01% Shell 0.01%

trace's Introduction

carbonplan / trace

working repo for carbonplan's climate trace project

This repository includes example Jupyter notebooks and other utilities for a collaborative project CarbonPlan is working on involving tracking emissions related to biomass losses from forests.

This project is a work in progress. Nothing here is final or complete.

We have completed the scripts and notebooks for delivery of version 0 of the dataset (carbonplan_trace/v0) which largely reproduces and extends work by Zarin et al (2016) as hosted on the Global Forest Watch platform.

Input datasets include:

aboveground biomass for year 2000 (Zarin et al., 2016)
binary masks of tree cover loss year for 2001-2020 (Hansen et al (2013))
Suomi NPP (VIIRS) Fire Masks for 2011-2020 (Schroeder et al., 2014)
country boundary shapefile from the Database of Global Administrative Areas (GADM) version 3.6. Note: Geopolitical boundaries that have changed over the period of record will be tagged to the static country designation as defined in GADM v3.6.

Some tips for reproducing this effort:

The scripts/aggregate_emissions.v0.py script can be run to reproduce both the 3 km global emissions raster dataset and the country-average estimates. As a warning, depending on the size of the machine you're running on, you might encounter memory issues when dealing with the 30m datasets. For that reason, we opted to process the 30m tiles in serial. If you are struggling, you might want to check to ensure that your cluster isn't getting overloaded.
We recommend using the sample notebook notebooks/blogpost_sample_notebook.ipynb as a starting point to introduce yourself to the structure of the data and how to work with a high resolution global product. We emphasize that the 3km product may be sufficient for some users.

license

All the code in this repository is MIT-licensed. When possible, the data used by this project is licensed using the CC-BY-4.0 license. We include attribution and additional license information for third party datasets, and we request that you also maintain that attribution if using this data.

about us

CarbonPlan is a nonprofit organization that uses data and science for climate action. We aim to improve the transparency and scientific integrity of climate solutions with open data and tools. Find out more at carbonplan.org or get in touch by opening an issue or sending us an email.

trace's People

Contributors

Stargazers

Watchers

Forkers

orianac cuulee mastercoder96 keshava balonchino wanwanliang

trace's Issues

CI is broken

It appears the CI is broken due to a dockerhub authentication problem (https://github.com/carbonplan/trace/actions/runs/3528432422/jobs/5918529625). @norlandrhagen, do you happen to have admin permissions on this repository? if so, can you can grant me admin permissions when you get a chance? i'd like to try reset the action secrets used to authenticate against dockerhub, and see if this address the CI failures.

consolidate tile utilities

@tcchiao and I have both been writing some useful utilities for parse 10x10-degree tile ids. We should consider consolidating these into carbonplan_trace.tiles.

30m data export

We're interested in testing a 30m web map. That'll require generating the pyramid starting from a much higher resolution version of the data (either 30m, or something slightly coarser). We can start experimenting with this and document any challenges here (or over on ndpyramid).

Running list of TODOs

For MVP for Washington:

basic form of training dataset:
- All glas shots translated to biomass using one allometric equation (Cindy) [done]
- Look up sampling strategy of GLAS and allometric equation assumptions wrt leaf conditions (Cindy/Ori) [done]
- Calculate seasonal average for each year from Landsat with spatially continuous map for WA (Ori + Joe) (relatedly, decide on Landsat data structure) (snap to a uniform Hansen 30m grid x annually)
- Extract Landsat variables to use into a tabular format (all raw bands)
set up ML model for training (Cindy)
- random forest + XGBoost!
- set up inference function
Set up inference inputs
- extract the same landsat variables into tabular format for all of washington
Plotting function from ML model output (altair)(Ori) (lat/lon/time)
- spatial maps
- time series
Set up validation dataset
- Find 4 well-respected datasets

To expand to global:

Transforming Harris et al spreadsheet into python
- Mask of column 2 (ecoregion + NLCD) -> allometric equation
- allometric equation = dictionary of functions
- height metrics = another dictionary of functions [done]
- parameter to indicate whether to preprocess (whether input is smooth or raw)

Improvements by April:
GLAS/biomass:

apply glas filtering based on Harris et al (Cindy) [done]
double check how GLAS elevation should be calculated from GLAH14 data
decide whether we should use smoothed or raw wf to make height metric calculations
Double check terrain calculations by reading Duncanson et al more closely
potentially change the raw extracted glas data into the original variable name
interpolate between bins (currently at 15cm intervals)
double check that compression ratio does not change during the valid signal part (between sig beg and sig end)
Figure out which allometric equations can be used for leaf off conditionsAllometric equations are trained predominantly upon leaf-on conditions, so we should determine whether estimates for leaf-off conditions are valid. This is relevant for our reporting/updating interval- proposal: update bi-annually after the end of the growing season in each hemisphere (September and March(?)).

Landsat

Masking clouds (potentially via https://github.com/ubarsc/python-fmask or potentially using *_BQA.TIF files in LANDSAT archive
Smoothing LANDSAT images using CCDC
Grabbing multiple LANDSAT pixels for each GLAS record? GLAS has 70 m diameter and LANDSAT is 30m so could use 4 LANDSAT? Bounding box of all LANDSAT pixels?

ML model

Training different model for each ecoregion
Incorporating a climate dataset into the training of the model (Others have used Worldclim, though we could use Terraclim)
out of sample validation

v0 and v1 data cleanup

We've been pushing hard on both v0 and v1 lately. As these pushes wrap up, we've likely left quite of bit of data on s3 and gcs that needs to be cleaned up.

@orianac and @tcchiao, can you each provide a short writeup below outlining the data we want to keep/delete from v0 and v1 respectively?

add methods docs to this repo

We currently have two draft google docs that we've been using to summarize the technical methods. We should move these documents to this repo in order to properly version control them and to document the methods deployed here.

Recreate V0 and do some QA/QC

This would be a three step process:

run the global_emission.v0.py script to recreate all emission tiles (zarr) (~200 tiles)
run the aggregate_emissions.v0.py script to create global coverage gridded (1 zarr file for sharing) and country-total timeseries (1 json file)
Visualization notebook (the gridded analysis is so nicely done through the website so this notebook might just focus on the country roll-ups)

data exports for v1 explainer

We had a very productive call today discussing the data needs for the v1 explainer. This issue documents the outputs of that call:

Figures 1 (global maps) a zarr group with a 2d "biomass" variable (coarsened to 0.5 deg) and 1d "lat" and "long" variables as coordinates (~700x1 and ~300x1)

Figure 5 (differences maps) a zarr group with a 3d "difference" variable (coarsened to 0.5 deg) with the three difference maps we want to show and 1d "lat" and "long" variables as coordinates (~700x1 and ~300x1)

GeoJSON for the "study domain"

R2 and other stats in a JSON

Alternatives and/or additions to Landsat collection 2 for biomass prediction

For the V1 data produced here, we've used Landsat 7 ETM Analysis-Ready-Data (Collection 2, Level 2). Along the way, we've had plenty of discussions about the potential to pull in alternate and/or additional datasets for biomass model training/inference. I'd like to use this issue to enumerate alternatives and discuss their potential inclusion in future work.

I'm specifically hoping to see some discussion the following datasets (please add to this list):

Harmonized Landsat and Sentinel-2 (https://hls.gsfc.nasa.gov/)
MODIS
Planet's Fusion Monitoring Surface Reflectance

pinging @tcchiao, @orianac, and @badgley