The phenology_forecasts from sdtaylor

updates to observation data

I'm making the switch to using the "Individual Phenometrics" data instead of the "Status and Intensity" data. The former summarizes things into first "yes" dates for all individual trees, which I was doing manually myself. This NPN summarized data also has better conflict flags which can be used to filter out most of the problematic group sites.

Data download for all prior data used in model buidling (2008 - 2017)
https://data.usanpn.org/observations?search=3dd370f197c6ae95e881f0e93cc56ae8

oak leaf image

in case anyone asks, this was a public domain image from here https://openclipart.org/detail/110371/oak-leaf-silhouette

PrettyBigData Issues

On running into bottlenecks dealing with 100's of GB (several TB's in the future) of weather forecast data.

Using chunks in NetCDF
https://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters

Xarray discussion on aligning dask and netcdf chunks
pydata/xarray#1440

example of using apply_ufunc in downscaling observed and modelled arrays
https://groups.google.com/forum/#!topic/xarray/eyWr_ajTmL4

evaluation paper todo

Writing

confirm the 3 methodologies (climatology, climatology + current year temp, current year temp + temp forecasts) are consistently named and lettered (a,b,c) throughout methods,results,discussion,supplement.
abstract
npn data cite
all software citations
acknowledgements (not in main paper, they go in a special submission box)
moore funding statement in acknowledgements for bioarxiv version
ethan affiliations
do latex /refs so peerj will have an easier time. (though not technically needed till resubmission)

Logistics

clean up eval code on serenity
upload zenodo https://doi.org/10.5281/zenodo.3990010
collect / confirm all references
cover page
save all figures to files

required packages

NOAA CFSv2 forecasts are in GRIB2 format. grib files in xarray requires pynio. pynio was only recently ported to python3 and requires a dev version. Here's how I got it to work.

From a fresh anaconda3 install

conda install -c ncar -c conda-forge pynio=dev python=3

Then download the latest xarray master and install with python setup.py install

In a few months this will hopefully just be

conda install xarray pynio

large bottleneck in np.exp

profiling shows np.exp() to be taking quit a long time. in the forecast runs.

Testing shows the following on serenity (ubuntu 16.04, conda install python3.6)

shawn@serenity:~$ python
Python 3.6.3 | packaged by conda-forge | (default, Nov  4 2017, 10:10:56) numpy 1.15
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.timeit('np.exp(d)', setup='import numpy as np;d=np.ones((420,1405,620))',number=1)
27.85343541414477

A windows machine in the library runs the same command in 1.5 seconds.

This issue potentially points to a glibc issue .

do cutoff for minimum number of observations

I think I chose species based on having >1000 observations in the raw data. but after processing this boiled down to <10 usable observations for some species. Need to fix this.

archive forecasts

from Ethan:
At this size we can store 1000 sets of forecasts in a single Zenodo archive, which is over a decade of forecasts. I'd recommend starting to automatically archive there, either by pushing forecasts to a directory in https://github.com/weecology/forecasts (you already have write access) or by us setting up a similar system for just the phenology forecasts. This would address Deitze's interest in downloading forecasts and you could (at some point in the future) add a feature to the website that can download forecasts for a selected species and forecast date.

constrain fall models to only predict into fall

ie look at this picture. how are the high elevation places getting predictions in may?

consoliate R tools

various R scripts with their own helper functions are hanging out in different places, namely model_building/phenology and the map building R scripts

use container

conda-pack looks like a good solution here

https://conda.github.io/conda-pack/

make the cfs stuff into a package

This seems like a good option to break out and clear a lot of code out in this repo

pyDownscaledCFSv2

-create a downscale model (or download a premade one)
option to downloscale via PRISM or DayMet

-downloads cfs data
-converts them to netcdf format
-downscales via

need classes with generalized methods for
PRISM data
daymet data

temperature data

Weather forecasts are available from the NOAA CFSv2 model. A new run is released every 6 hours and forecasts 9 months out, with slight variations on output depending on the time of day.
These are deterministic forecasts, so to create an ensemble I'd need to use several days worth and combine them.

Downscaling requirements

This will involve taking the CFSv2 reanalysis and combining it with PRISM to get comparisons.

Potential steps

On each forecast day (ie. 1st and 15th of the month)

Download latest 10 (?) model runs.
Also download latest operations re-analysis.
(could also use the op. analysis that NPN uses)
Extract North America.
Downscale each one.
combine time series of the obs. temp from some cutoff thru the spring/summer

End up with:
An nc file of 10 runs of cuttoff - end of forecast

scope

Goals

using species/phenophase specific phenology models, making forecasts several months out of the spring/summer season
do this using true weather forecasts from the CFSv2 model.
automatically repeat forecasts every 2 weeks or so

Needed

Automatic download, injest, and downscaling of CFSv2 data. (probably the hardest part)
fitting of model to large spatial dataset and producing maps.
potentially combining observed data with forecast data (like forecast made on Feb 1)
potentially using phenology database compiled by brian stucky

change scale in uncertainty map

too much variation in this between species. Fix the scale to something like 0 - 20+

make log file

lots of output to log

data provenance

Use netcdf attributes to record some history in the files

recent forecast files

url of forecast file
downscaled using such and such method, observed data
date range of prism data, prism homepage url

plant forecast files

weather forecast note above
data from NPN.
using model XXX
weecology

have all or nothing create for API client

Ran into an issue where forecasting entries from the static image metadata file were duplicated, causing the API update to error out from duplicate entries, causing an imcomplete API update and website errors. Not sure where the duplicates came from, but a good guard regardless is having an all or nothing update for every forecast iteration inside api_client.py. There is likely something for this built into the django stuff.

account for timezone

A fairly important step thats easily overlooked. All the times in the CFS forecast are GMT. Need to convert things to their own timezone.

Or ... with daily mean temperature it mayyyy be fine.

key forecast dates

Key issue dates where new things were implemented.

2018-01-05 - First full automated run
2018-01-20 - first time having issue_date and crs in attributes (easily fixable)
2018-01-23 - first time using the larger species set (66 instead of 44) by having a larger set of range masks

PRISM connection issue

currently cannot connect to the PRISM ftp server. running this on a node just hanges

In [3]: from ftplib import FTP

In [4]: ftp_con = FTP(host='prism.nacse.org', user='anonymous',passwd='abc123')

In [5]: ftp_con.nlst('/')

BUT, running the same on the login node is fine.

In [1]: from ftplib import FTP                                                                                                                                           

In [2]: ftp_con = FTP(host='prism.nacse.org', user='anonymous',passwd='abc123')                                                                                          

In [3]: ftp_con.nlst('/')                                                                                                                                                
Out[3]: 
['/PRISM_datasets.pdf',
 '/normals_800m',
 '/normals_4km',
 '/daily',
 '/monthly',
 '/data_archive']

forecast archiving

figshare - unlimited data
zenodo - 50GB limit (but can ask for more), versioning (very important)

Current daily forecasts are ~31mb compressed.

Infrastructure paper eco-apps final checklist

Highlighted notes from staff

Please provide the main manuscript in Word or PDF+LaTeX format. Word is preferred. Figures and appendices may remain in PDF format. See Items 1, 8, 9, and 10 in the Checklist for Authors below.
List the Running Head on the title page of the manuscript file, matching the entry in the corresponding field of the online submission form.
Please provide Figs. 1, 2, and 3 sized for PDF publication at no larger than portrait layout (maximum 6 inches wide x 8 inches high) or landscape layout (maximum 8.75 inches wide x 5.25 inches high). All text must be sized between 6 and 10 point when the image is sized for publication. For readability, we suggest using a text size hierarchy, sizing axis numbers between 6-7 point, axis labels between 8-9 point, panel labels that consist of words between 7-8 point, and panel labels that consist of a single letter at 10 point.
Remove line numbers from Appendix S1 to prepare it for posting online.

DEFAULT CHECKLIST FOR AUTHORS
How to Prepare Your Accepted Manuscript for Publication

updates to species_list.csv

Update the range_map_made column in misc_data_prep/create_speces_masks.py

remove occurences_downloaded column

Phenophase_ID -> phenophase

this column name has stuck around for some reason, need to change it everywhere in one go

hindcasting notes

dates to hindcast from = jan 1 - June 30, every 4 days
45 hindcasts / year
~ 100 mb per species/phenophase (compressed) = 4.5GB * ~~138~~ 27 spp = 121 GB
~ 30 min per species/phenophase = 14 hours * 45 hindcast dates = 630 total hours
(~ 2 days w/ 16 cores. but will likely need 25-30GB ram each)
(note the 30 min time was with ThermalTime model, a Uniforc model took 2 hours)

for each hindcast_date
  obtain current_weather_observation nc file (or just use the one already built)
  cutoff the current_weather_observation file to hincast_date - 1 day
  get latest forecasts for hindcast_date
  make folder of "current forecasts"
  pass that folder to apply_phenology_forecasts, but this must use the bootstrap versions of all models

infrastructure paper final todo

Manuscript Logistics

zenodo
confirm all references
npn data citation
all software refs in final methods paragraph.
get nice X's in prediction paragraph (latitude X longitude X time)
Use $\times$
cited papers by white or taylor from the past 12 months.
portal forecasting paper (MEE)
pyphenology (JOSS)
npn-lter paper
portal data paper (PLOS BIO)
figures and figure legends, with references right after main text.
potentially have to edit the latex template: https://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html
fiddle with figure font sizes. shoot for final widths from the instructions:

Figures should be drawn/submitted at their smallest practicable size (to fit a single column (82 mm), two-thirds page width (110 mm) or full page width (173 mm).

Writing Stuff

different types of forecasts

peak flower forecasts - some papers have looked into this
community forecasts - like a mountain meadow, i'm very interested in this

infrastructure paper first resubmission todo

cite little dataset for range maps
in supplement include stars on Janets and Lori's species

other downscaling methods

Looks like a nice package here https://github.com/JiaweiZhuang/xESMF, for the initial interpolation only

need some sort of model log file

for every species/phenophase give some status update on the model

have ability to link to specific forecast

Perm links to a specific species/phenophase/issue date OR the latest issue date

project folder structure

Things are getting a bit crowded, so...

phenology_forecasts/
  tools/
    ...
  model_building/  
    phenology/  
      build_phenology_models.py
      download_species_observations.R
      download_species_observation_temperature.R
      phenology_observation_functions.R
    climate/
      download_historic_observations.py
      download_historic_reanalysis.py
      download_historic_forecasts.py
      build_downscaling_model.py
  automated_forecasting/
    climate/
      download_latest_forecasts.py
      download_latest_observations.py
    phenology/
      apply_phenology_models.py
    presentation/
      build_maps.R
  misc_data_prep/
    create_species_masks.py
    create_mask.py

add crs to derived climate & phenology forecasts

it's in the current_season_observation.nc file

USGS Tree ranges

There is some data processing for this done by hand.

wget https://www.fs.fed.us/nrs/atlas/littlefia/species_table.html
grep IV_Little species_table.html | cut -d"<" -f15 | cut -d"_" -f 3,4 > species_list_cut_output.csv

Go thru and x out the 8 intermediate characters on each line in vim

sort species_list_cut_output.csv | uniq > species_list.csv

go back thru and put in commas. (quicker than it sounds)

April 9th forecast for some forecasts shows incorrect range

Looks like there's something funny going on with at least the Sugar Maple forecasts which went from this on April 5th:

To this on April 9th:

I've checked about half a doze other species and don't see anything comparable, which is weird.

which time in reanalysis to use?

The CFSv2 reanalysis has 6 hour timesteps, but between those timesteps it has hourly "forecasts" where the model is run with no assimilation. Since I'm just getting daily means I only want to the the primary timesteps at 6 hour intervals. But this description says the 00 forecast for some things is essential invalid, but I'm not sure if it applies to the tmp2m that I'm using.

Text from the pdf

Important Note: The forecast at the first time step (f00) of 3 minutes constitutes a spin up of the model hysics, and extreme care should be taken when using it as a proxy of any type of validation. IT IS NOT THE ANALYSIS.

The tmp2m files I'm using don't have this 3 minute initial timestep so for the time being I'm using the initial timestep.

add data integration/assimalation via model ensemble weights

From @ethanwhite. Instead of picking one model for each species/phenophase, use an ensemble with appropriate weights. As new observations come in adjust the weights somehow for the next forecasts.

This will essentially be observations in the southern/lower elev. range of species affecting the forecast for more northern/upper elev individuals.

species notes

some species in the Tree Atlas data have synonym issues

Alnus tenuifolia and Alnus rugosa -> A. incana
Cornus stolonifera -> C. sericea
Sambucus sp. -> S. nigra (this genus is a mess, see below)

Others are just neat plants that have the NPN data available but just need the range map

From Janet Prevey

Vaccinium membranaceum (black huckleberry)
mean flowering GDD: 698, mean fruiting GDD: 953
Gaultheria shallon (salal)
mean flowering GDD: 1113, mean fruiting GDD: 1790
Berberis aquifolium (Oregon grape)
mean flowering GDD: 460, mean fruiting GDD: 1447
Corylus cornuta (Hazelnut)
mean flowering GDD: 418, mean fruiting GDD: 1914

website

a fancy website

making selectable leaflet maps https://github.com/stefanocudini/leaflet-panel-layers

epic tutorial on leaflet/js everying online mapping https://www.e-education.psu.edu/geog585/node/776

custom tile creation with gdal (and made very complex with AWS) https://hi.stamen.com/stamen-aws-lambda-tiler-blog-post-76fc1138a145, and the code https://github.com/hotosm/oam-dynamic-tiler

leaflet color scale https://gis.stackexchange.com/questions/193161/add-legend-to-leaflet-map

sdtaylor / phenology_forecasts Goto Github PK

phenology_forecasts's People

Contributors

Stargazers

Watchers

Forkers

phenology_forecasts's Issues