phenology_dataset_study's People
Forkers
ajijohnphenology_dataset_study's Issues
basler 2016 discussion
potential new analysis writeup
We draw heavily from Basler 2016, who did a comprehensive model comparison using the budburst date for six tree species collected over 40 years from 139-283 sites across central Europe. The best performing models, with a RMSE of 4-6 days, were fit and validated with data from only a single site. Another suite of models fit by "pooling" all site data into a single model had, at best, RMSE error values of 7-9 days among the 6 species. A third suite performed the worst, with RMSE values of 9-10 days, where models were fit using information from a single site, and used to make predictions across many sites.
Here we outline 4 scenarios of how the LTS and NPN datasets can be evaluated. A) LTS derived models used to predict LTS observations, B) LTS derived models used to predict NPN observations, C) NPN derived models used to predict NPN observations, and D) NPN derived models used to predict LTS observations. Scenario A is analogous to the best performing models in Basler 2016, thus we expect this to perform the best (ie. have the lowest error). Scenario B is analogous to the worst performing models in Basler 2016 and we expect them to perform the worst. Scenario C is analogous to the "pooling" models which had performance midway between the other two in Basler 2016. There is no equivalent comparison that matches scenario D here, but we expect this situation to perform the better than scenario C since the LTS observations have less variance than NPN observations.
Expected ranking regardless of model or species:
- Scenario A
- Scenario D
- Scenario C
- scenario B
confirm data cleaning choices
All the long term datasets have various protocols which required decisions on when to assign budburst/flowering. Need to go thru all this again and document what I did for methods and confirm that decisions make logical sense and match the NPN data as closely as possible.
switch to sqlite
the results file containing all the predictions is 1.5gb now and just too big. the model parameter file at 35mb could probably be put into one too
fix npn temperature data
currently there a lot of nans in the temp data when converted to a numpy array in models.py. Due to some missing data and many sites just missing the data for Nov 1
differential evolution maximum iterations
DE has an argument called maxiter
which is the maximum number of iterations to run. I've always had this set to None
assuming that the algorithm would keep going until a solution was reached which met other criteria. In some related testing I figured out setting maxiter
to None actually means it defaults to 1000, so will quit regardless at 1000 iterations.
Note this is documented in newer versions but not older ones
Probably doesn't make a different for most of the models, but for the complex models it might.
send paper to data providers
- NPN
- Harvard Forest
- Jornada
- Hubbard Brook
- HJ Andrews
re submission #1 checklist
still to do
- find home for "rules of thumb paragraph"
- fix supp image absolute_scenario_rmse
- write reviewer reply into paragraph
Sign off from co-authors
- Joan
- Kristina
- Michael
- Ethan
Manuscript upload checklist
- main
- reviewer reply
- supp image
- supp methods
- main text diff pdf
Misc
- update all images on github
- upload new zenodo version
- upload new bioarxiv version (main, & 2 supplements)
do a true out of sample comparison
Use some hold out data in the NPN and LTS datasets to actual out of sample comparisons at the end
journal
- peerj
- mdpi plants
- mdpi forests
- ecology (as report)
manuscript authorship
We originally had this for author order
Taylor, Meiners, Orr, Riemer, White
But Michael has suggested switching places with Kristina. Does that sound fine to everyone?
Taylor, Meiners, Riemer, Orr, White
final submission edits
SME's Comments to Author:
- l. 29. Clarify what "this difference" refers to.
I've changed this from "This difference is likely due to variation in phenological requirements within species." to "This difference in the cross-scale model comparison is likely due to variation in phenological requirements within species."
- ll. 137-138. These numbers don't seem to match what's reported in Fig. S1 (35 vs. 33?) - please check.
Thank you for catching this. This is part typo, part complex analysis. I've reworded the Figure S1 caption to be more clear. 33 in the Fg. S1 caption should be 32. Figure S1 only uses USA-NPN data and thus has 32 comparisons to make. The main text Line 137 stating 35 comparisons is correct since 3 species are in both the Harvard and Hubbard Brook datasets, thus there are 3 more comparisons to use in the primary analysis comparing USA-NPN and LTER datasets.
- Table 1. What is the logic behind the ordering of models here? It doesn't match that in the text (ll. 143-164). I suggest reordering to match text descriptions, or explaining the rationale in the table caption.
I have reordered this table to the same order the models are introduced in the text.
- Also, please define the parameters in the table caption.
I have added parameter descriptions in the table 2 captions.
Everything below has been corrected.
-
l. 224. comma should be semi-colon.
-
l. 240. Fix "two models sets".
should be "two model sets" -
l. 260. Should be "Uniforc or GDD model".
-
l. 262 and 352. Data = plural.
-
l. 351. Insert commas before and after "though".
check on Crimmins et al. values
At first glance, the AGDD values (which should match my gdd_fixed model values) are pretty off (much higher in their paper).
Their sample size from NPN dataset seems to be a lot lower. for example 461 observations for acer rubrum when I have > 1000
Get bootstrapped errors
Get doy estimates and variances by running all 500 bootstrapped models.
Will have to do this in the python code
incorperate parameterization into model structure
the current setup works well for parametrizing the complex (uniforc, unichill, gdd) models using scipy.optimize
. But scipy.optimize is overkill for the new simpler models (naive, linear_temp). So a solution is to switch the optimizer from the current location in scipy_bootstrapped_estimates.py
to models.py
, allowing for a simple solution for the simple models.
This would also allow different scipy.optimize.differential_evolution
arguments for the more complex models.
final submission formatting/logistics checklist
Specific things called for by review specialist
-
Please provide the main text/tables in Word or PDF+LaTeX format. Word is preferred. See Item 1 in the Checklist for Authors below.
-
Rather than rotating the bodies of Tables 1 and 2 to read from bottom-to-top, the page layouts should instead be rotated from portrait to landscape view, with the tables set reading left-to-right. The same applies for figure images (Figs. 1, 3, and 4).
-
Present the Figure Legends together as a group, on the pages following the last table and preceding the first figure.
-
Replace the current version of Fig. 1 with a higher resolution version suitable for publication.
-
Figs. 2, 3, and 4 will need sizing/layout adjustments to meet the page requirements. Please provide these figures sized for PDF publication, with each entire image sized no larger than portrait layout (maximum 6 inches wide x 8 inches high) or landscape layout (maximum 8.75 inches wide x 5.25 inches high). All text must be sized between 6 and 10 point when the image is sized for publication. For readability, we suggest using a text size hierarchy, sizing axis numbers between 6-7 point, axis labels between 8-9 point, panel labels that consist of words between 7-8 point, and panel labels that consist of a single letter at 10 point.
-
Relabel the Supplementary Methods as Appendix S1 (including the title at the top of the document, the name of the file itself, and references to the material).
-
Relabel the Supplementary Figures/Tables file as Appendix S2 (again, including the title at the top of the document, the name of the file itself, and references to the material).
-
Prepare Appendices S1 and S2 in the format intended for publication. Note this material is not copy edited or typeset. Please check it carefully; I see the initial word is missing from the manuscript title given at the of both files and that in some cases page numbering (page 5 of Appendix S2, for example).
-
In the Data Archiving section of the online submission form you've indicated that all code, original data, and derived data will be posted on Zenodo. Although data archiving is not a requirement of this journal, if you are planning to archive data it should be completed at this stage because the production process moves quickly. If archiving data with GitHub/Zenodo, please do so at this time and update this section of the online form to provide the Zenodo DOI assigned to the data deposit. If you choose not to archive data at this time, please update the form to indicate this.
on when parameters predict no budburst/flowering
It's possible, with all the models, to have a set of parameters which predicts an event to happen well into the winter. This is obviously unrealistic. In the model fitting routines if the differential evolution algorithm tries out parameters which predict an event any time past around ~Oct 15, the doy estimate is instead set to 1000, which returns a very large error and those parameters are seen as non-optimal by DE.
This works very well until the fitted model is used on out of sample data. For example when I take NPN built models and try to predict Jornada data I sometimes get doy 1000, which is basically saying the model predicts the event will not happen that year.
How to deal with this? I don't know. In all the phenology literature virtually everyone uses the same setup (usually simulated annealing to minimize RMSE), but only one (Olsson & Jonsson 2014) mention this issue (but not how they deal with resulting predictions).
interpolate budburst doy
As suggested by Gerst et al. 2016, instead of using the first date when budburst was observed, make it the midpoint between that date and the most previous observation
- NPN
- harvard
- Hubbard Brook
- jornada
- RMBL
final model runs
Will be going through modelling 1 more time to deal with #22, #17, and #23
- implement hold out data for #22
Without grouping sites together
- process data from scratch
- run models on hipergator
- do predictions
- make graphs
With grouping sites together
- process data from scratch
- run models on hipergator
- do predictions
- make graphs
* minimum observation cutoff for Harvard with this is set to 20 instead of 30
* total species comparisons drops to 20 from 38 because of tighter restrictions
add spatial models
The spatial component is a large deal in the NPN dataset, so I feel I should include some spatially explicit models to address that. There are some straight forward ones I can use.
Macro-scale specific budburst model (MSB)
This is an extension of the alternating model with has a correction using the mean Jan-Feb temp. From Jeong et al. 2013
M1 model
Extension of classic GDD model which includes a daylength component. from Blumel & Chmielewski 2012
rmbl issue
Taraxacum and heracleum have blooming dates past doy 180, which is the limit of some of the models. Need to increase that in the parameter bounds. possibly after doing #15
acknowledgements
Use of data and software, either by citation or end of manuscript acknowledgement
Data
Acknowledgements
- PRISM
- Harvard Phenology Data
- Jornada
- Hubbard Brook
- National Phenology Network
- HJ andrews experimental forest
Citations
- PRISM
- Harvard Phenology Data
- Jornada
- Hubbard Brook
- National Phenology Network
- HJ andrews experimental forest
Software
- R
- Python
Packages (R)
- dplyr
- tidyr
- ggplot2
- lubridate
- prism
- raster
- sp
Packages (Python)
- pandas
- scipy
- numpy
- mpi4py
supplimentary materials
model_parameters.csv
predictions.csv
predictions_large.csv
- predictions from all 250 bootstrapped models, a 3gb file
model_errors.csv
- derived from predictions
Github repo deposited somewhere, which includes all the raw data.
confirm model parameters
basler 2016 has some weird parameter ranges in the supplement, some are restricted to > 0 where I have them ranging neg. and pos. Double check on those. Specifically the alternating model and the uniforc model (they use unified, but the sigmoid function is the same)
On grouping sites together
Crimmens et al does this with the NPN dataset. If multiple individuals are observed at a single site, they take the mean of their budburst/flowering, etc. DOY. I'm not sure if I want to copy it though, as it takes away affects of plasticity.
Solution: do both. More than likely overall results will be exactly the same.
paper to do
- abstract
- finalize title
- update citation ordering (year -> author last name)
- data/software citations formatted nicely in references
- details on LTER data processing in suppliment (#17)
- parameter comparison between 3 common species of Hubbard/Harvard (for supplement)
- split the above figure into 2
- all model name italicized
- active voice throughout
- deal with species/phenophase, species x phenophase, species phenophase combinations, blah
- all figures/tables to the end
- make a separate suppliment pdf
- acknowledgements #11
- rename all image filenames to their order (ie fig1, figS1 etc) in text and code
- LTS -> LTER in figures
- fixed all figure references. with the map included in the beginning they are now all off
- orcids from everyone
- put the NPN data I used into the raw_data folder so it can go up on zenodo. but don't commit it cause it's so big
breakout models to their own repo
Make just the models their own python functions with an api like scikit-learn
uniforc = phenology_models(type='uniforc')
uniforc.fit(temp_data, doy_data, method=['simulated_annealing','basin_hopping','DE'])
uniforc.predict(temp_data)
uniforc.get_params()
uniforc.set_params(**dict of params)
uniforc.score(metric=['aic','r2','rmse'])
built in travis testing
installer?
packaging instructions
https://python-packaging.readthedocs.io/en/latest/
switch from MPI to dask
Because MPI is impossible to debug
try different models
Probably the unichill model, a simplified GDD model, and a correlative model (ie. doy ~ mean_winter_temp)
compare datasets as out of sample verifications
Fit harvard data with NPN models and vice versa
include conifers
Harvard has pinus and tsuga. they are being excluded in the NPN data because I filter for phenophase ID==371, which only works for broadleaf deciduous.
do doy estimate better
Potentially better (quicker) way to calculate doy_estimate in models.py. Must deal with #6 first.
def doy_estimator2(self, gdd, doy_index, F):
doy_index = doy_index.copy()
doy_index = np.tile(doy_index, (gdd.shape[0],1))
doy_index[gdd<F] = 10000
return doy_index.argmin(1)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.