Code Monkey home page Code Monkey logo

bbs-forecasting's Introduction

weecology

bbs-forecasting's People

Contributors

davharris avatar ethanwhite avatar sdtaylor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

bbs-forecasting's Issues

Daily weather variables

Paper at: http://onlinelibrary.wiley.com/doi/10.1111/ecog.02321/full

R Package is on CRAN under the name "RFc"

Here's a list of what they have available (note that the NCEP/NCAR Reanalysis has 4x daily data going back to 1948 for temp and precip; prate is short for "precipitation rate").

Joan and I played with it a bit today, and the API seems very good (it's from Microsoft Research).

image

Example script I used:

x = fcTimeSeriesDaily("airt", latitude = c(35, 36), longitude = c(-117, -118), firstYear = 1989)

Decide on variable selection

Better to do it once for all models than to let each model choose its own optimal predictors?

Certainly shouldn't let models see the test set when deciding which predictors to use...

Add ensemble forecasting

It's known that ensembles often perform better for forecasting/prediction. We should be trying them.

Add data on bodies of water

As an alternative to excluding water birds @davharris has suggested adding something like the proportion of area with a 40km buffer that is water as a predictor for the spatial models (and possibly the temporal ones if this changes much over the time period of the data, which it might).

train-test split

What did we decide about the train-test split? Was it the last 5 years that were reserved for evaluation? Sorry if I'm just missing it---I don't see anything about it in either of the notebooks or in the core functions.

Add data on land cover as predictors

Ideally we'd like fairly finely resolved land cover data going back to the 1970s.

The National Land Cover Database is one option, but its big limitation is that there are only 4 time points: 1992, 2001, 2006, 2011. If the algorithms are available we could think about processing the full time-series (though this could be a pretty massive computational task). We could also interpolate between the neighboring points. This has some risks, but I suspect that land cover change is gradual in most cases. It still leaves us without data prior to 1992 and post 2011. NLCD also only covers the United States.

Other sources of land cover data:

Predict to 2050

Pure time series:

  • auto-arima
  • naive

Pure environment

  • mistnet
  • rf
  • gbm

Other:

  • average

filter_ts with shorter training time

Right now, we're filtering the sites so that we have at least 25 observations between 1982 and 2013. Does that need to change if we're only doing 10 years of training? This could give us some sites with only 4 years of training data.

Maybe the criterion should be more like "visited during at least 70% of the training years"? Then we could use the same criterion for both.

As far as I can tell, there's no downside to including a site that doesn't get visited much in the test set, right?

Predict with 10 years' training data

I'm thinking about adding a longer-term forecast analysis to push the longest forecast out to 20+ years so that we get a better idea of whether the spatial and time-series approaches really end up converging. I'm envisioning leaving the bulk of the analysis as is, but adding one additional analysis where we train on the first 5-10 years and forecast on the last 20-25.

Make graphs that compare predictions of future state to predictions of changes

This is potentially a nice way to communicate what the benchmark comparisons tell us.

I'm basically envisioning an observed-predicted plot of forecast vs. observed richness side-by-side with an observed-predicted plot of forecast change in richness vs. observed change in richness. The first will look good and demonstrates that if we just need to know what richness will look like in 20 years, we do that, but the second will look bad showing that we don't really benefit from the more sophisticated models over just assuming that things don't change.

This ties back to the results in Rapacciuolo 2012, which shows that SDMs work well for forecasting species locations because they are good at predicting the areas that don't change.

Add summed SDM models to analysis

@davharris started some work on this in #2, but it is waiting on some fixes.

@sdtaylor is doing a lot of this modeling for population level work already, so we could also integrate that code at some point if it's sufficiently modularized to let us use it easily.

Graphs and tables in paper

Tables

  • List of models and their attributes. Example:
Model Site effect Env. Vars Species Specific
TS No No No
JSDM No Yes Yes
SSDM No Yes Yes
GBM No Yes Yes

Main Results figures

  • error over time for both RMSE & deviance (A/B plot)
  • total error over the forecast period for both RMSE & deviance (A/B plot)
  • Model coverage / Calibration
  • Comparing using high resolution environmental data to no env. data or low resolution data. (Potentially a table)
  • Conceptual image of different model inputs, with an example forecast of an individual site.

Use retriever to download PRISM data

get_prism_data currently uses the prism R package. Since we'll be combining data from lots of different sources let's keep the data acquisition part straight forward by using the retriever for everything.

I don't have strong opinions on using prism vs. Postgres for merging the data with BBS at the moment, but that may change as the number of different sources for predictors increases.

Add post-2013 NDVI data

The GIMMS NDVI data we are using at the moment is only available through 2013. We need to fill this in moving forward, probably using MODIS or LTDR.

From #7:

LTDR appears to provide daily NDVI data back to 1981 using a combination of AVHRR and MODIS data. Data is available via FTP as one hdf file/day (25-150 MB/file depending on the satellite) and are split up by satellite (with different satellites covering different time periods). The NDVI data is in the AVH13C1 files. Description of the NDVI product. Unfortunately these are daily data which would force us to do the compositing ourselves.

pre-1982 predictors?

It looks like our NDVI and climate/weather data sets both start in 1982. I was wondering what our options were going before 1982 (especially for the bioclim variables). Do we have that written down somewhere?

Move all database work to either SQLite or PostgreSQL

We currently have one (and maybe soon 2) calls to a postgres database. Since we're already using SQLite to store and pass around environmental data we should go ahead and add the BBS data to that database as well and extract it from there.

oddball species

How do we want to handle oddball "species" (e.g. unidentified species, hybrid species & subspecies)? They're usually pretty identifiable in the data using regexes, but it's not obvious to me how best to handle them.

My proposal:

  • For subspecies, it might be best to lump them in with the rest of their species.
  • For hybrids and unidentified species, we might have to drop them.

On a related note, did we make a decision somewhere about whether to remove super-rare species from the data set?

Get data on climate forecasts

To start making actual forecasts using models that involved exogenous variables, we will need forecasts for those variables. This issue is a place for us to start looking into options for doing this.

Get longer-term NDVI data

Either landsat or AVHRR.

Initial work on this is started in setup_ndvi_data.R and the README in ./data/ndvidata/. This uses NDVI data from AVHRR acquired from EarthExplorer. This data comes in already composited forms meaning they are relatively plug and play. Data goes back to 1989 and is available as GeoTIFF's.

LTDR appears to provide daily NDVI data back to 1981 using a combination of AVHRR and MODIS data. Data is available via FTP as one hdf file/day (25-150 MB/file depending on the satellite) and are split up by satellite (with different satellites covering different time periods). The NDVI data is in the AVH13C1 files. Description of the NDVI product. Unfortunately these are daily data which would force us to do the compositing ourselves.

add config file

Add config.R file for common variables like: years of analysis and path to sqlite database file,.

Incorporate observer effects

I'd feel better if I knew what was going on with big richness jumps like the one plotted below (site_id 27057).

image

Is that a real change in richness, or did a new observer just start counting the birds in 1993? These big jumps have a big impact on the tails of the random walk, so I think we'll want to see if we can understand them better.

My RPostgres isn't behaving at the moment, so I can't easily access the observer data right now.

filter_ts() not actually filtering

If I call get_richness_ts_env_data(start_yr, end_yr, min_num_yrs) %>% na.omit(), there are sites within it that have less than `min_num_yrs'. Looks like it has to do with the ordering of things.

This is from forecast-bbs-core.R line 186. complete(site_id, year) fills in all possible years for all sites, then filter_ts() a few lines down thinks that all sites then have plenty of years in their timeseries. If I take out complete(site_id, year) it seems to work but I'm not sure what that might be breaking elsewhere.

get_richness_ts_env_data <- function(start_yr, end_yr, min_num_yrs){
  bbs_data <- get_bbs_data(start_yr, end_yr, min_num_yrs)
  richness_data <- bbs_data %>%
    group_by(site_id, year, lat, long) %>%
    dplyr::summarise(richness = n_distinct(species_id)) %>%
    ungroup() %>%
    complete(site_id, year)
  richness_ts_env_data <- richness_data %>%
    add_env_data() %>%
    filter_ts(start_yr, end_yr, min_num_yrs)
}

Make figures and tables

At our last meetings we planned out the following figures and tables for the ms:

  • Table 1: Model methods
  • Figure 1: Example time-series w/forecasts
  • Figure 2: Observed-predicted plot for 3 time lags
  • Figure 3: Time lag results figure (Main results)
  • Figure 4: Violins of fit at 3 time lags
  • Figure 5: Observation model figure

The sketches for each of these figures is below. If you are already working on or want to work on a figure just leave a note in the the comments (and maybe ping the Slack channel) to avoid duplicate work.

2017-05-11 13 53 27

2017-05-11 13 53 21

New sketch for the observation model figure from 2017-05-18 meeting:

img_0862

inconsistent use of `species_id`

In the retriever, species_id refers to the column before AOU. In this project, we use species_id as a synonym for AOU.

What's the best way to avoid inconsistency when we try to do things like join the species table to the abundance table? We could change get_species_data with one line of code, but we'd still have this inconsistency between our database and our R code. Or we could change all instances of species_id to match the database, but that seems like a pain.

test filter_species with combine_subspecies

Currently, filter_species throws out anything with unid. in the name, on the theory that it can't be identified to species. But the following three species (and possibly some others?) are identified to species, just not to subspecies. I think that means we're throwing out observations for them incorrectly.

1    (unid. Red/Yellow Shafted) Northern Flicker
2                   (unid. race) Dark-eyed Junco
3 (unid. Myrtle/Audubon's) Yellow-rumped Warbler

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.