srivastavalab / bwgtools Goto Github PK

2.0 2.0 2.0 802 KB

This is an R package that validates data and reproduces all analyses for the bromeliad working group rainfall experiment.

License: Other

R 100.00%

bwgtools's People

Contributors

Stargazers

Watchers

Forkers

ethanwhite yosaralu

bwgtools's Issues

Argentina: incorrect values in Oxygen columns

as of this writing, the offending values are in cells AC15 and AD15 of the bromeliad.physical tab.
The values of the cells are "20,2/10,4" and "7,2/3,9" respectively
they should actually be a number (50, 20.3, etc)

fix the names from bwg_names

the names of the data was changed by Srivastavalab/bwg_names@e73ac3396540b8a7f2c4de6ca841d11457ac9ef2

need to update the package for dealing with these new names!

explain hydrological measurements

especially driedout and overflow. ie new variables from Paraty

Calculate water summary statistics -- schedules

We would like to take the 60 day schedule and condense it into a few numbers:

gi

Duplicate columns in Cardoso

There are two columns in the Cardoso bromeliad.inverts.final sheet that are duplicated:

"Anopheles sp. - early instars" & "Leptagrion sp. - early instars"

These are exactly the same as others with the same name immediately to the left, except they lack biomass measurements

correct filtering of long invert data

long_out %>% dplyr::filter(abundance != 0 | biomass != 0) is what it should be.

I had thought that abundance could be recorded without biomass, but the opposite was impossible. Such hubris! that's not true at all 😳

read_site_sheet() vs. combine_tab()

Arg <- read_site_sheet("Argentina","leaf.waterdepths")
Arg1 <- combine_tab("Argentina","leaf.waterdepths")

firstday(Arg) => it works fine!
firstday(Arg1) => it doesn't work! when you call for combine_tab(): "bromeliad.id" column is not found in the sheetname=leaf.waterdepths

the question is why they call for different columns even though it's the same sheetname?
This the same issue with the function: from_start()!

Incorporating a few of the Paraty fixes

We made a lot of progress at Paraty, it would be nice sometime to have these incorporated into the BWGtools pkg as we now have snippets of code on some papers' R scripts and not others. Only when you have time @aammd !

(1) make site a factor within BWGtools (I need to do it manually outside)
(2) make this summarize code below part of the hydro function, not having it in the function threw me for a loop by creating duplication of rownames (as two leaves per bromeliad) in the final fulldata:
mean_hydro <- hydro %>%
ungroup %>%
select(c(2,7:17, long_dry, long_wet, n_driedout, n_overflow)) %>%
group_by(site_brom.id) %>%
summarise_each(funs(median(., na.rm = TRUE))) %>%
replace_na(list(long_dry = 0, long_wet = 0, n_driedout = 0, n_overflow = 0))

when_last <- hydro %>%
ungroup %>%
select(site_brom.id, last_dry, last_wet) %>%
group_by(site_brom.id) %>%
summarise_each(funs(min(., na.rm = TRUE))) %>%
replace_na(list(last_dry = 65, last_wet = 65))

brom_hydro <- mean_hydro %>% left_join(when_last)

(3) adding this code to summarize the ibutton data, which I think is not yet part of BWGtools:
ibuttons <- combine_tab(sheetname = "bromeliad.ibuttons")
ibutton_data <- ibuttons %>%
group_by(site, site_brom.id) %>%
summarise(mean_max = mean(max.temp, na.rm = TRUE), mean_min = mean(min.temp, na.rm = TRUE),
mean_mean = mean(mean.temp, na.rm = TRUE), sd_max = sd(max.temp, na.rm = TRUE),
sd_min = sd(min.temp, na.rm = TRUE), sd_mean = sd(mean.temp, na.rm = TRUE),
cv_max = 100_(sd_max/mean_max), cv_min = 100_(sd_min/mean_min),
cv_mean = 100*(sd_mean/mean_mean)) %>%
ungroup %>%
gather(variable, observed, 3:11) %>%
replace_na(list(observed = "NA")) %>%
select(-site) %>%
spread(variable, observed, fill = 0) %>%
rename(max_temp = mean_max, min_temp = mean_min, mean_temp = mean_mean,
sd_max_temp = sd_max, sd_min_temp = sd_min, sd_mean_temp = sd_mean,
cv_max_temp = cv_max, cv_min_temp = cv_min, cv_mean_temp = cv_mean)

Improve the documentation

Right now, bwgtools is a tool that only Andrew can use (in its entirety), when actually it is a tool for everyone! The package should be accompanied by clear documentation. The first priority is clear communication with all members of the BWG research group. The second is communication with reviewers, editors and readers of the manuscripts we produce.

I would like this issue to be a place to hold all the discussion we have on this topic. We can open separate ones for specific tasks once they arise.

I really encourage everyone who is using this package to consider contributing to documentation! I will help settle any doubts you might have about exactly how to do it (e.g. if you want to edit the documentation of a function).

here are some things I think could be better:

Function names -- many functions are named different things, even though they are related. For example, functions which download data are called read_sheet, read_site_sheet, get_all_sites and combine_tab. Should they all be renamed to download_x, where x is something specific?
Function documentation -- there is minimal documentation, but lots could be improved there. Examples would be great, plus an expanded "details" section
README -- is OK, but could have more developed sections.
- Accessing data
- Combining data
- Calculations
- Analyses (when we have those)
- Plots
Translations -- Should we have a French/Spanish/Portuguese translation of the README? Of any part of this? Would anyone like that?

Ideas for better offline use

read_site_sheet needs better error catching if the path is wrong. right now, if you give it the wrong path it tries to go to Dropbox anyway. This gives an unhelpful error and wastes time.

so just let it check, not for the existing file, but for being a certain length.

if it is long, must be a path, look for file

if file is absent exit with warning.

function to run hydro calculations for all sites

This function should have an argument which permits users to specify whether

the centre leaf is included (default = FALSE)
the calculations are performed on leaves first (default) or on bromeliad.averages

data validation for each tab

check column names
are data values out of range -- loo large or too many very small
are any columns filled with NAs
is the site variable missing any values
does the site value contain all the same word
are any observational rows filled with NAs

using `match.fun()` to find functions with the same name as the sheet being read

Is it dangerous or risky to use match.fun() to find a package's own functions? small example of code in question here

background The package has a specialized purpose: reading data in from excel files which are stored in dropbox. the backstory is that we have results from many international replicates of an experiment. the package uses Karthik Ram's rdrop2 and Hadley Wickham's readxl packages. This way scientists on the project can get their data directly into R from the excel files in their dropbox, without ever actually opening Dropbox.

All the excel files are made from the same template, so they should be in a standardized form. Therefore a function that works for one will work for all.

Within each excel file there are multiple tabs. each tab is different, and needs to be read in with specific arguments.

the problem: how can I allow users to simply say which tab (aka "sheet") they want, without having to set default arguments manually every time?

Andrew's solution: create a function with the exact same name as the tab in question (here leaf.waterdepths()). the function reads in the tab of the same name with all the correct arguments. When users ask for a sheet, my function read_sheet() uses match.fun() to find the correct reading function.

`site.info` in Colombia has multiple rows

unclear how these should be interpreted. Should they have been averaged into a single value?

from_start()

This function does not work properly for French Guiana as the experiment last from 2012 to 2013. So the nday column gives weird values in 2013!

Upload hydrological stability scripts

save and store the existing analysis for hydrological stability
adapt script for use with this package?

Script for importing bromeliad.final.inverts in tidy format, merging with species info (Distribution organisms)

This script is currently written for accessing the invert ad distribution organism file from offline locations, but can be modified to access these from BWG Dropbox locations as soon as we have bwgtools setup to read these files directly. Diane

invert.final<-read.table("Drought_data_PuertoRico_bromeliad.final.inverts.csv", sep=",", header=TRUE)
head(invert.final)
library(tidyr)
library(dplyr)
library(magrittr)
invert.long<-invert.final %>% gather(species, quantity,Diptera.292:Ostracoda.7)%>%
spread(abundance.or.biomass, quantity)%>%
separate(trt.name, c("mu", "k"), "k")%>%
mutate(mu = extract_numeric(mu))
head(invert.long)

input error to fix with names in Distributions_organisms_full file, perhaps the apostrophe in line 93?

dist.org<-read.table("Distributions_organisms_full_nonames.csv", sep=",", na.strings="",header=TRUE)
head(dist.org); tail(dist.org)
invert.full <- merge(invert.long, dist.org, by.x = "species", by.y = "nickname")

Line endings

May I push a .gitattributes file to keep linux line endings?

Colombia biomass measurements

we need at least some blank cells; having no biomass rows changes the shape of the data later

add the latest trait data

the forupload file contains our most recent bwgdb insect trait data. These should be combined with this package. add these data to the package, and get get_bwg_tools

function to read in terrestrial data

sheet name is bromeliad.terrestrial
data to be used by @DEZERALDOlivier

Duplicate insect in Costa Rica

Diptera.192 is mentioned twice in bromeliad.inverts.final

Create RLQ matricies for Regis

Regis requires three matricies for his analysis:

a species x traits matrix (fuzzy coding) = matrix Q
a species x bromeliad matrix (abundance data) = matrix L
a bromeliad x environmental variables (plant specific data, including
physical, hydrological, ..) = matrix R

The rows of Q and L must be identical, and the columns of L and the rows of R must be identical

Perhaps the easiest thing is to write a function that does this for only one site, then give it all sites.

start with trait data merged on abundance data, and the physical data too.
filter and spread trait data
separate into matricies.
filter and arrange the physical data.

Calculate water summary statistics -- water depths

this involves calculations performed with observed measurements of bromeliad water depth

check for duplicate bromeliads

We need a test for duplicate bromeliads. What if somebody has inadvertently typed in the same code? We need to find this before it causes problems.
Which sheets need to be checked?
Are bromeliad.id s always unique within each of those sheets, or are there some that have duplicate values by design (ie repeated rows)?
Should we actually be checking for consistency with treatment?

format errors in bromeliad.final.inverts tab:

Argentina columns 19-42 contain a single row of blank cells
PuertoRico 18 to 42 are blank NA columns
CostaRica has old names in columns

NAs in water measurements

As reported by @nacmarino , @dsrivast :

both the cv.depth as well as the wetness for argentina_15 are missing. This shouldn't be so, as there is a mean and a maximum value on the spreadsheet.

And time.since.minimum is entirely NA...all bromelaids all sites.

argentina_15 is not NA but NaN, suggesting that there is illegal math
issue.

Argentina duplicate species name

in bromeliad.inverts.final