jonesor / rcompadre Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 12.0 4.69 MB

R tools for obtaining and manupulating data from the COMPADRE and COMADRE Plant and Animal Matrix Databases

Home Page: https://jonesor.github.io/Rcompadre/

R 100.00%

rcompadre's People

Contributors

Stargazers

Watchers

Forkers

kkougiou iainmstott patrickbarks gesaroemer ecosantos chelseacthomas levisc8 farcego datosforall wpetry darrennorris shafieisabets

rcompadre's Issues

Handling NAs in identifyReproStages

Add functionality to provide the user different options of dealing with values of NA within matF. Currently, the function assumes NA indicates positive fecundity.

Create "meanF" function

Function to produce mean fecundity matrices from individual matrices.

Create cleanDB function

Function to remove matrices that do not pass a series of (independent) tests.
e.g. ergodicity, primitivity, reducible, splitness etc.

Rather than the nested structure of the current mat slot (or "column", à la #27), I propose having separate list-columns for matA, matU, matF, matC, MatrixClassOrganized, and MatrixClassAuthor. The reason is that most MPM functions (e.g. in popdemo, popbio) act on a matrix, so we should make it easier for users to access matrices, and particularly, to vectorize over a set of matrices.

Vectorizing

Currently, vectorizing with popdemo or popbio functions requires writing a custom function to access the slots within mat, e.g.

ergodic <- sapply(db@mat, function(x) popdemo::isErgodic(x@matA))
lambda <- sapply(db@mat, function(x) popbio::lambda(x@matA))

With a flat structure a user could vectorize over matA without having to write a custom function, e.g.

ergodic <- sapply(db$matA, popdemo::isErgodic)
lambda <- sapply(db$matA, popbio::lambda)

The situation is more nuanced when it comes to Rage, because Rage functions could eventually work with CompadreM objects, in which case a user could vectorize over db@mat without writing custom functions. However, for the Rage functions that take a single matrix (i.e. kEntropy, qsdConverge, reprodStages, identfityReproStages, splitMatrix), vectorizing with the flat version would be just as easy, e.g.

# CompadreM version
k_ent <- sapply(db@mat, Rage::kEntropy)

# flat version
k_ent <- sapply(db$matU, Rage::kEntropy)

For some Rage functions that take multiple matrix arguments (e.g. matrixElementPerturbation), vectorizing with the CompadreM object would admittedly be nicer, e.g.

# CompadreM version
mat_pert <- lapply(db$mat, Rage::matrixElementPerturbation)

# flat version
mat_pert <- mapply(Rage::matrixElementPerturbation, db$matU, db$matF, SIMPLIFY = FALSE)

For Rage functions that will often be used with additional row-specific arguments (e.g. longevity, rearrangeMatrix, reprodStages), vectorizing with the flat version is just as easy, (e.g. calculating lifespan wrt the first non-propagule stage)

# CompadreM version
start_life <- sapply(db@mat, function(x) min(which(x@matrixClass$MatrixClassOrganized == "active")))
lifespan <- mapply(Rage::longevity, db@mat, startLife = start_life, SIMPLIFY = FALSE)

# flat version
start_life <- sapply(db$MatrixClassOrganized, function(x) min(which(x == "active")))
lifespan <- mapply(Rage::longevity, db$matU, startLife = start_life, SIMPLIFY = FALSE)

Finally, there are a few Rage functions (R0, dEntropy, lifeTimeRepEvents, and makeLifeTable) for which having a CompadreM method would make the function overly difficult to document and understand (in my opinion), because the user may wish to apply them to matF-only, matC-only, or matF and matC (or in the extreme case of makeLifeTable, also matU-only).

As outlined in jonesor/Rage#19, I think we should simplify these functions so that they only take one 'reproductive matrix' argument (e.g. matR), and correspondingly make it easier for users to derive and store matFC (= matF + matC) within db.

Printing

The flat version makes it easier to examine a series of matrices from the same study (e.g. reflecting different years or populations)

# CompadreM version
db@mat[[450]]@matA
db@mat[[451]]@matA
db@mat[[452]]@matA

# flat version
db$matA[450:452,]

Though the CompadreM version makes it easier to examine all the matrices for a given row

# CompadreM version
db@mat[[450]]

# flat version
db$matA[[450]]
db$matU[[450]]
db$matF[[450]]
db$matC[[450]]

Matrix validation

I think the matrix-validation function of the CompadreM class could be moved to a new Rcompadre function (e.g. db_validate). I don't see too much benefit of 'built-in' validation of things like matrix dimension, non-negative values, etc. Definitely those things should be validated on the COMPADRE side, but after that I think it's fine to leave things to the user.

update to S4 addressing

Current functions use $ to access elements of compadre database objects. With the new S4 class for these data, the following functions (and possibly others I've missed) will need updating with @ to access S4 slots:

Relates to #1 and #14

Ternary plots throwing odd error when called from Knitr

Related to content of d88789d - The ternaryplot() function in the vignette returns the following when run from Knitr:

Error in unit(x, default.units) : 'x' and 'units' must have length > 0 but not when called interactively. I haven't found a decent explanation for why this is online so I've set the chunks options to eval = FALSE until I can figure out what's going on. At the very least, users can cut and paste that section to produce figures in their own interactive sessions.

Cryptic warning msg if db doesn't have SpeciesAccepted column

The CompadreDB show method prints a cryptic warning if db doesn't contain the column SpeciesAccepted. E.g.

Compadre[,c("mat", "Genus")]
## A COM(P)ADRE database ('CompadreDB') object with 0 SPECIES and 150 MATRICES.
##
## # A tibble: 150 x 2
##   mat               Genus       
##   <list>            <chr>       
## 1 <S4: CompadreMat> Setaria 
## 2 <S4: CompadreMat> Lechea
...
## Warning message:
## Unknown or uninitialised column: 'SpeciesAccepted'.

This is ultimately coming from the function NumberAcceptedSpecies().

Suggest adding a more helpful message to the NumberAcceptedSpecies() function, and modifying the show method to prevent any warning message in this instance. Though I'm not sure exactly what we should do with the header message if db doesn't contain SpeciesAccepted. Maybe just replace the number with a question mark? E.g.

Compadre[,c("mat", "Genus")]
## A COM(P)ADRE database ('CompadreDB') object with ? SPECIES and 150 MATRICES.
##
...

Create function "merge" for compadre.

Merge 2 compadre databases together.

development package name

Suggestion: for the popdemo development package branch I use the package name 'popdemoDev' instead: perhaps this would be an idea for Rcompadre and Rage too. It means you can have both packages installed on your machine at the same time: one which works and one which may be buggy.

Before merging the development branch into master you need to make sure you change it, though. And then change it back once the merge is completed.

This would be a problem if code included any RcompadreDev::function(... but at least in popdemo this isn't the case. Maybe I'm not doing things 'properly' (although it got past the CRAN pearly gates so I guess it should be good enough)

Function to obtain 1 matrix per species at random

Maybe called "sampleMatrices"?

Non-html version of Plot-life-cycle

I will give this a go, otherwise I will attempt to improve usability of the current function.

Vignette on vectorizing with Rcompadre

I think it would be useful to have a vignette devoted solely to vectorizing with Rcompadre. Could have some 'basic vectorizing' material, some material on using accessor functions within vectorized code, and also some 'avoiding pitfalls' type material.

For instance, the fact that accessors can be used both with CompadreDB and CompadreMat makes it easy to write vectorized code that's very inefficient. E.g.

# different ways to calcuate stage-specific survival, and time to run in seconds
db <- cdb_fetch("~/COMPADRE_v.X.X.X.RData")
x <- lapply(seq_along(db$mat), function(i) colSums(matU(db)[[i]]))     # 56 s
x <- lapply(seq_along(db$mat), function(i) colSums(matU(db[i,])[[1]])) # 5 s
x <- lapply(db$mat, function(x) colSums(matU(x)))                      # 0.2 s
x <- lapply(matU(db), colSums)                                         # 0.04 s

Create checkSpecies function

Create a function to check whether a particular species is present in the database, and if it is to return that subset of the database.

matrix accessor functions return lists for all CompadreDB objects

I think matA, matU, matF, matC and any of the others in the ClassUnionMethods.R file always return a list, even if the database passed is just one row. Intuitively one might want to use e.g.
matA(compadre[123, ]) or matA(compadre[123]) expecting a matrix, but getting a list of length 1 containing the matrix. We can get it with e.g. matA(compadre$mat[[123]]), but would be good for the others to work too.

subsetDB not subsetting slot matrixClass

Function subsetDB does not currently subset slot matrixClass. This was my fault... I mistakenly removed the line:

ssdb$matrixClass <- ssdb$matrixClass[subsetID]

Standardize style of documentation

When documenting functions, begin title with verb (e.g. "Calculate life table entropy" rather than "A function that calculates life table entropy").

For descriptions, can use longer form (e.g. "This function calculates...").

Add Travis/Appveyor to `master` branch?

Since we need a stable master branch for workshops (and since the master branch is now stable after a807f7a!), does it make sense to add continuous integration to make sure all pushes/pull requests to master are checked properly? If you're interested, you can read more on what continuous integration does in jonesor/Rage#9 and here.

If so, @jonesor (or someone else with administrative privileges) will need to activate it from the Github page (Settings -> Integrations & Services). From there, you can grant access to Travis-CI and it should redirect you to the Travis homepage so you can activate it from that end as well. Once that account is activated, I can set up the .travis.yml file.

Thoughts?

Add fetchDB function?

I only recently learned that it's possible to download and load COM(P)ADRE directly from R (can't remember where I first saw this, but props to that person):

url_dl <- 'https://www.compadre-db.org/Data/CompadreDownload'
load(url(url_dl))

By loading db in a separate environment (e.g.), a fetchDB function could allow the user to assign db to an object of their chosen name, rather than getting an already-named object with load.

compdb <- fetchDB('compadre')  # from web
compdb <- fetchDB('my_path/COMPADRE_v.X.X.X.RData')  # from file

This would also enable piping right from the initial load operation, e.g.

compdb <- fetchDB('compadre') %>% 
  tidyDB() %>% 
  cleanDB() %>% 
  subset(check_ergodic == TRUE)

Adding derived vectors/matrices to a CompadreData object

Related to #25, I think it would be useful if there was a way for a user to add new slots containing either vectors, matrices, or lists to a CompadreData object.

In many demographic analyses users will want to derive vectors or matrices from an MPM (e.g. population vector, stable distribution, sensitivity matrix, collapsed or rearranged matrix), and it would be ideal if these remained part of the CompadreData object to maintain the mapping with the metadata and original MPM. This becomes particularly important if the user wishes to subsequently subset the db based on some derived value(s).

Many of the functions in Rcompadre/Rage return vectors or matrices derived from an MPM (e.g. reprodStages, identifyReprodStages, rearrangeMatrix, splitMatrix, collapseMatrix), and I gather these currently need to be stored separately from the db.

I don't know much about S4 and so don't have a sense of whether this is feasible, but perhaps @tdjames1 and @iainmstott can weigh in.

Add brief overview of tibbles to documentation

Probably in 'Getting Started' vignette and maybe README. Could include link to tibble vignette, perhaps brief summary of differences from data.frame, and tips for changing default setting, e.g.

options(tibble.print_max = Inf) # always print all rows
options(tibble.width = Inf)     # always print all columns

Rename classes

Following discussions with Owen, we felt it may make sense to change the names of the classes. The database is called CompadreData, but all of our functions use the syntax "DB" (subsetDB, cleanDB, compareDBs, etc.). Similarly, we use "mat" to refer to matrices (matA, matU, etc.) but the matrix-focused class is CompadreM. The proposal is to change these: CompadreDB and CompadreMat respectively.

This may confuse us developers initially, and although users may not have to interact so much with the classes directly, it could make life better for anyone who wants to build on the package, specifically the classes, as well as anyone wanting to convert between the old and the new data structures.

Continuity is king

Add dplyr::summarize method?

I didn't initially add a method for summarize alongside the other tidyverse functions because summarize won't ever return a valid CompadreDB object. But now I keep forgetting that I have to use CompadreData() or as_tibble() before calling summarize().

Compadre %>% 
  CompadreData() %>%
  group_by(OrganismType, Continent) %>% 
  summarize(n = n())

In hindsight, I don't think the 'must return a CompadreDB object' requirement was necessary, and a summarize method that returns a regular tibble would be rather convenient. E.g.

Compadre %>% 
  group_by(OrganismType, Continent) %>% 
  summarize(n = n())

Everyone ok with this?

new function to extract metadata

I think it would be useful to have a function that only extracts the metadata (without the mat-column). The as_tibble() method converts the database to a tibble that includes mat. But it would be convenient to have a separate function that also drops mat.

as_cdb bug?

From Haydee:

We noticed that the the function as_cdb changes the amount of data stored in the R objects. Eg. When the function as_cdb is not used to convert the loaded r objects, Clarkia_xantiana_subsp._xantiana exists (at least with one matrix), if the as_cdb function is run, then the species disappears

rename "data" generic?

I noted that #32 introduces a data generic to extract the metadata w/ list column from the CompadreData structure. However, data() is also a function from utils used to load packages' internal data sets (e.g. data(mtcars)).

I don't think it would cause any actual problems to have the method named as such, but it may be confusing for users interpreting code that employs it because it doesn't really do the same thing as the utils function and that is often used in teaching materials (e.g. vignettes, online tutorials, etc). maybe calling it extract_data or something is safer?

Speed up validCompadreM() checks

... and therefore also asCompadreData() and convertLegacyDB().

The conversion for COMPADRE_v.X.X.X takes 11-12 seconds on my computer, ~70% of which is due to the validCompadreM() checks. The use of data.frame for the dims object is causing most of the overhead. If dims is instead constructed as a matrix, the conversion time on my computer is reduced from ~11.5 to ~4.5 seconds. E.g.

dims <- cbind(matA = dim(object@matA),
              matU = dim(object@matU),
              matF = dim(object@matF),
              matC = dim(object@matC))

I think we can get it down to about 3.5 s with additional optimization.

Help users avoid pseudoreplication

This is something I still struggle with. For most (but not all) analyses I would like a single 'grand mean matrix' for each MatrixPopulation or SpeciesAuthor.

This gets challenging because for a given MatrixPopulation, there are cases for every possible combination of MatrixComposite (multiple Individual + Mean, multiple Individual no Mean, multiple Mean, multiple Mean and Pooled, etc.). Plus, there are instances where matrix dimension or stage class definitions vary within populations.

Rather than a function that finds and subsets to the highest-level matrix/matrices for a given population or species, I think the best/only-feasible appoach is a function that collapses a database by averaging matrices over levels of one or more grouping variables supplied by the user. For example, a sensible set of grouping variables might be c('SpeciesAuthor', 'id_stages'), where id_stages is a column identifying rows with the same set of MatrixClassAuthor (or perhaps MatrixClassOrganized).

It's not entirely satisfying to average over a set of matrices that may include Individual and Mean/Pooled matrices, but on the other hand, adding the average to a set being averaged shouldn't much change the outcome:

x <- 1:4
mean(x) == mean(c(x, mean(x)))

Here's the logic I propose. For each group (i.e. 1+ rows of the db; as defined by the user):

if only 1 row in group, return that row; else...
fail if matrix dimension differs among rows
warn if AnnualPeriodicity or paste(MatrixClassAuthor) differs
warn if any MatrixComposite == "Seasonal" (these should be matrix-multiplied)
collapse mat by averaging matrices (and reconstructing CompadreMat object)
collapse MatrixComposite by returning "Collapsed"
collapse SurvivalIssue by re-calculating from collapsed matU
collapse all other columns: if single unique value return that value, else collapse all unique values with paste(x, collapse = "; ")
(any list-columns apart from mat will have to be removed before collapsing)

It's admittedly a bit dangerous, and will require good documentation and user caution. I've implemented a version that I'll submit in a PR shortly. Let me know if you have thoughts/alternatives.

Restructure the CompadreData class?

This has kind of been bugging me recently when working on some internal code. Basically, I think the current data structure is a bit more complicated than it needs to be from a programming and usability perspective. The points below outline some potentially beneficial alterations which would require re-writing the CompadreData class (but not the CompadreM class! :) ), but here's why I think it is a good idea.

We can still use S4 to ensure rigidity to a point, but use S3 at lower levels that provide users a more straightforward interface. For example, implementing an S4 CompadreData class with 2 slots (call them something like data and version) would likely work. The data slot could be a data.frame containing a list column of S4 CompadreM objects and their associated metadata. This is not an original idea of mine (@patrickbarks in #25 ), but it could simplify a number of features we've discussed (e.g. @patrickbarks suggestions in #26, reordering data for phylogenetic analysis (e.g. @iainmstott 's usage in #22), adding columns, etc) and make methods easier to write (for example, the [ method would basically just extract the data slot, use [.data.frame, and return the subsetted object (e.g. #24)). We'd also still retain the beneficial rigidity of the CompadreM class in that list-column so that we don't let people accidentally include negative matrix elements, etc. I think this implementation would also reflect the ideas that @iainmstott and @tdjames1 are discussing towards the end the thread in #22.
We aren't really retaining the structure of the SQL database with the current CompadreData structure anyway. The SQL database schema breaks the metadata up into a variety of other, smaller tables while we lump it all together into the metadata slot (which I think is a good thing). Additionally, the SQL database itself is designed to use the publication as the "atomic unit" of organization while our current CompadreData structure uses the matrix itself (also a good thing - that's way more intuitive). The SQL database has a very rigid structure once it's implemented, but we can do whatever we want with that structure once it's pulled into R (and we already are taking that approach with the metadata table). I don't think we should constrain ourselves to a rigid approach if a more flexible (and simpler) one is available. In short, SQL is rigid, but R doesn't have to be. We should take advantage of that!
The current validCompadreData method only checks to make sure each row of metadata has a matrix associated with it in the mat slot. We could change that to check for NAs in the list-column. Otherwise, I'm not sure we actually need to change very much beyond tweaking a few functions.
Finally, most other database API packages are built on top of S3 (e.g. BIEN, rnoaa) and Compadre doesn't exist in a vacuum; there are real use cases for cross-database analysis (e.g. @robcito 's talk at ESA). I'm not suggesting we abandon S4 entirely (our data is a bit more constrained than others), but it would help with cross-database compatibility to make our interface as similar as possible.

I've started to experiment with this on my fork (hopefully will push tonight/tomorrow). I've been using dplyr syntax for a lot if it, but that's by no means a requirement (I worry less about depending on tidyverse packages - they're generally pretty stable with the introduction of quasi-quotation in rlang 0.2.0, but I understand the concern in general). Using functions like map_* from purrr work really well in this set-up and don't constrain the user on what data can be added/subtracted from their analysis.

Sorry for the wall of text! tl;dr - I think it'll save us a bunch of pain further down the line to change the internal CompadreData structure now. Curious to here other thoughts on this approach.

Error in getMeanMatF (dev branch)

Error is produced:

Error in eval(e, x, parent.frame()) : object 'PopId' not found

I think this is caused by a problem building the newdata object (line 55-64).

Move matrix-level functions to Rage?

Four functions in Rcompadre act on individual matrices (i.e. their first argument is matA, or matU, etc.), and so I think would fit more naturally into Rage. These are:

collapseMatrix, identifyReprodStages, rearrangeMatrix, and splitMatrix.

Two of these (collapseMatrix and rearrangeMatrix) are used by Rage::standardizedVitalRates, so having them in the same package would be nice. None of the four functions are called by other functions in Rcompadre, so moving them shouldn't break anything.

Online documentation

I think lots of R users only read vignettes online, rather than within R.

Until we're on CRAN (which publishes HTML vignettes), and since GitHub no longer renders Rmarkdown, I think it would be good to 'manually' publish formatted versions of our vignettes (perhaps including the workshop materials, @jonesor).

One option is to just render the vignettes as regular markdown and host them on the Rcompadre repo, and perhaps include links in the README. Though I'm not sure exactly how to do this... I think we'd have to specify multiple output formats in the Rmd header.

Some quick Googling makes me think it's possible.

Current COMPADRE release and development versions fail to meet CompadreData/CompadreM definition

Currently, Compadre v4.0.1 and Compadre vX.X.X (unreleased version) fail to meet the CompadreData class definition. It seems that 24 of the A and/or U matrices have negative elements which are currently rejected during validation of CompadreM class objects. For example:

load('COMPADRE_V.4.0.1.RData')
Compadre <- asCompadreData(compadre)
Error in validObject(.Object) : invalid class “CompadreM” object: matU is not nonnegative

I've extracted the DOIs for the offending studies and have found the papers.

In the Jongejans 2010 paper (DOI: 10.1111/j.1365-2745.2009.01612.x, Carlina vulgaris, matrix b2), there was no data entry error - the element as calculated by them is slightly negative.

In the Kesler et al 2008 paper (DOI: 10.1007/s00442-008-1022-1, 23 U matrices with slightly negative elements), I am guessing the negative matrix elements are a byproduct of the regressions they used. However, as noted in the Observation column of the metadata, the matrices were provided by the author so I can't check them by hand at the moment.

I remember discussing this issue in Ghent and seem to recall that we wanted to retain the matrices as reported by the authors. I think this would mean we need to loosen our definition of the class (perhaps throw a warning instead of an error) or subset the problem studies out each time a user submits a query that includes them (which would largely defeat the purpose of retaining them in the first place). Alternative solutions that don't require loosening the class definition or getting rid of the offending studies would be ideal, but I can't think of any at the moment.

@jonesor @robcito @tdjames1 @iainmstott - thoughts on how to correct this?

tldr: 2 studies with a total of 24 matrices in Compadre have negative matrix elements and prevent it from satisfying our current class definition. Relax the class definition, get rid of the studies, clarify with authors, or something else entirely?

DOIs for the studies in question: 10.1111/j.1365-2745.2009.01612.x and 10.1007/s00442-008-1022-1

Replace matFmu arguments

Replace matFmu argument in rearrangeMatrix with more generic argument to identify reproductive stages, because identifying reproductive stages using the mean fecundity matrix is just one possible approach to identifying reproductive stages.

accessor functions for S4 classes

We need accessor functions for the S4 classes. e.g. matA(CompadreMObject) gives a single A matrix, matA(CompadreDataObject) gives a list of matA, SpeciesAccepted(CompadreDataObject) gives the character vector of SpeciesAccepted, metadata(CompadreDataObject) gives the metadata, and so on.

I've forked the repo and have been working on this there (perhaps should have just created a new branch, but hey. Actually made good progress but R CMD check not working with vignettes, tests and so on. I'll submit a pull request when it's more ready.

Add check_singular_U flag to cleanDB

Rage functions longevity (life expectancy component), R0, and lifeTimeRepEvents will fail if matU is singular (i.e. non-invertable). This usually indicates infinite life expectancy due to a 100% stasis loop in the final stage(s). Adding a flag will make it easier for users to remove rows with singular matU ahead of time, if desired.

# try calculating fundamental matrix
N <- try(solve(diag(nrow(matU)) - matU), silent = TRUE)

# flag if singular
if (class(N) == 'try-error' && grepl('singular', N[1])) {
  check_singular_U <- TRUE
} else {
  check_singular_U <- FALSE
}

Make db@data printable

Picking up from #32, I think print(db@data) throwing an error is an issue that should be solved, even if we don't want users to access the data slot that way. Presumably many users will use db@data$ to create derived columns. I think some users will realize db@data is just a data.frame, and that's where all the important stuff is, and then be confused when normal data frame stuff fails (this is exactly what happened to me). The options that I can think of (there may well be others):

make db@data a tibble
move db@data$mat back to it's own slot (db@mat)
make a custom class and print method for db@data
drop the CompadreM class and revert db@data$mat to a list of lists (that way print(db@data) won't throw an error, though the printing won't be pretty)

I think option 1 or 2 are the least worst.

Bug in cleanDB (dev branch)

Using cleanDB produces Errors.

Error in subset.default(newdata_sub, select = c("index", "check_ergodic",  : 
  argument "subset" is missing, with no default

I think this is caused by the way that data(db) and data(db_sub) work.
e.g. lines 59-63 are supposed to build a new object with a column called index but (I think) this doesn't happen.

Cut species/study/matrix counts from db@version

They...

have no impact on reproducibility
can be calculated from db@data using existing Rcompadre fns
can become outdated with the use of [, subset(), merge(), etc. on db@data
complicate CompadreDB methods such as [, and methods we might add in the future for subset(), merge(), etc.
- e.g. if a user's column selection with [ or subset doesn't include YearPublication, then NumberStudies can't be updated
- incidently, the current [ method doesn't update species/study/matrix counts based on the given row subset, whereas subsetDB does

Also, the species and matrix counts (i.e. the underlying calculations) are already part of the show method, which at least ensures those counts are accurate at the time of printing.

The counts will still exist in the version element of legacy dbs, but I suggest we drop them during the asCompadreData conversion (and from other Rcompadre functions).

(I still think we should consider removing the whole version slot, but that's a separate issue)

Define S4 class

@tdjames1 and @iainmstott are handling this

Making Rcompadre pipe-friendly

As noted briefly in PR #24, I'm interested in making Rcompadre work nicely with dplyr's pipe operator (%>%). The pipe operator passes an object on the left side to a function on the right side (e.g. x %>% mean() == mean(x)), and when used in series can make code more readable.

A particular piping sequence I often perform with compadre is to calculate some quantity for every row of the db, add it as a column to the metadata, and then subset the db based on that new column (and repeat). In the past I've worked with a tibble version of the db (i.e. metatadata + list-columns for matA, matU, ..., matrixClass) to make this sequence easier:

library(dplyr)

compadre_tb <- as_tibble(compadre$metadata) %>% 
  mutate(matA = lapply(compadre$mat, function(x) x$matA),
         matU = lapply(compadre$mat, function(x) x$matU),
         matF = lapply(compadre$mat, function(x) x$matF),
         matC = lapply(compadre$mat, function(x) x$matC),
         matrixClass = compadre$matrixClass)

For instance, say I want to work with a set of matrices reflecting populations in decline (lambda < 1), and I only want ergodic matrices with no NAs. With the tibble version I can use a sequence of dplyr and purrr functions to repeatedly add columns (mutate) and subset (filter) based on those new columns.

library(purrr)

compadre_use <- compadre_tb %>% 
  mutate(na_matA = map_lgl(matA, ~ any(is.na(.x)))) %>% 
  filter(na_matA == FALSE) %>% 
  mutate(ergodic = map_lgl(matA, popdemo::isErgodic)) %>% 
  filter(ergodic == TRUE) %>% 
  mutate(lambda = map_dbl(matA, popbio::lambda)) %>% 
  filter(lambda < 1)

With a CompadreData object, the equivalent sequence might look something like this:

compadre_s4_use <- compadre_s4 %>% 
  cleanDB() %>% 
  subsetDB(check_NA_A == FALSE & check_ergodic == TRUE)

compadre_s4_use@metadata$lambda <- sapply(compadre_s4_use@mat,
                                          function(x) popbio::lambda(x@matA))

compadre_s4_use <- subsetDB(compadre_s4_use, lambda < 1)

So piping works fine with subsetting (and cleanDB), but I can't replicate the fully-piped sequence without an Rcompadre equivalent to dplyr::mutate(). What we would need is a function that takes a CompadreData object as the first argument, and returns a CompadreData object with an additional metadata column (based on some transformation specified in the second argument). I don't know what this function would entail in practice, but I think this general type of functionality would be desirable (to me anyway). Thoughts?

Vignette

We need a vignette that guides users through a workflow of using the package.
This can draw on materials used for COMPADRE workshops that are in our compadreDB repository.

More than one vignette is OK.

Write split function

Split - to split the database into two database objects, one matching a criteria, the other not.

Alternative to 47 accessor functions for the metadata columns

I'm wondering if there's an alternative to having accessor functions for every single metadata column (the mat-related ones I like)? A few (fairly minor) downsides I see:

they pollute the namespace and make Rcompadre:: less useful for finding functions (I assume the :: autcomplete is limited to RStudio)
relatedly, they reduce the ease of using RStudio's column autocomplete options for piped expressions to subset(), mutate(), etc. (though this feature is currently finicky anyway)
based on my limited experience, other S4 classes don't seem to use accessors for columns within a data slot (e.g. sp)

One alternative (based on the sp package) would be to have CompadreData methods for $, $<-, and names that directly access the data slot, e.g. so db$SpeciesAuthor is equivalent to db@data$SpeciesAuthor. This enables autocomplete in RStudio using db$. Even apart from autocomplete, I think it would be useful in that it gives users quicker access to the good stuff (i.e. everything except version).

Unit testing

At some point in the not so distant future, we should begin to develop unit tests for as many of these functions as we can. I'm happy to lead the charge on this one, just need to know which functions are completed and which are still in progress.

"show" method bug

The show method should produce something like this:

>compadre
A com(p)adre database ('CompadreData') object with 695 SPECIES and 7024 MATRICES.
See ?CompadreData and ?CompadreUnionMethods for methods of accessing data.

but produces this:

>compadre
A com(p)adre database ('CompadreData') object with 695 SPECIES and 7024 MATRICES.
See ?CompadreData and ?CompadreUnionMethods for methods of accessing data.

 A com(p)adre database ('CompadreData') object with 695 SPECIES and 1 MATRICES.
See ?CompadreData and ?CompadreUnionMethods for methods of accessing data.

The second part here should obviously be removed.

Create a sub-sample of COM(P)ADRE to distribute with the package?

Would it be worthwhile to include a subsample of one or both databases as example data sets? I suspect it will make our lives easier in the long run.

For example, we could then use something like data(compadre) in the vignettes and examples rather than relying on file paths to the objects (which change on every machine).

If others agree with the general concept, I'm happy to work on identifying a good subsample from each data base to distribute with the package.

Function re-naming proposals

(1) Change verb in cleanDB?
'Clean' to me implies that something will be changed or removed, but cleanDB just adds columns flagging potential issues with the MPMs. I suggest changing to something like flagDB (or perhaps something more verbose, like flagMatrixIssues).

(2) Change verb in mergeDBs?
mergeDBs is essentially rbind.data.frame. We have a separate CompadreDB method for the base R merge function (which performs a join operation), so I suggest changing the verb in mergeDBs to 'bind' or 'rbind', or moving this functionality to an rbind.CompadreDB method.

(3) Switch to snake_case for all non-accessor functions?
To avoid the awkwardness of camelCase with acronyms ('cleanDB' or 'cleanDb'?), and the mild ambiguity of names like 'DBToFlat' (is the object 'DB' or 'DBT'?).

I think the object_verb format (recommended by rOpenSci) would work really well in Rcompadre. I'd suggest switching the object from db_ to cdb_ (COMPADRE Database) because dplyr has a bunch of db_ functions. E.g.

fetchDB         ->  cdb_fetch
cleanDB         ->  cdb_flag
compareDBs      ->  cdb_compare (or cdbs_compare)
mergeDBs        ->  cdb_rbind (or cdbs_rbind, or rbind method)
DBToFlat        ->  cdb_flatten
checkSpecies    ->  cdb_check_species (or cdb_species; or remove)
getMeanMatF     ->  cdb_get_mean_F (or mean_mat_F)
asCompadreDB    ->  as_cdb
convertLegacyDB ->  (remove and just use as_cdb)
stringToMatrix  ->  string_to_mat

If we make that change, we could consider changing the 'CompadreDB' class to 'cdb', and then for consistency, the 'CompadreMat' class to something like 'cmat' (COMPADRE Matrix) or 'cmpm' (COMPADRE Matrix Population Model). We could alternatively use a longer object name to avoid potential collisions with other packages, like 'compdb' or 'compadre'. Or keep the current class names and use the cdb_ convention for functions anyway.

Thoughts?

Error produced by mergeDBs() (dev branch)

To reproduce the error:

x <- subsetDB(compadre,SpeciesAccepted == "Alaria nana")
y <- subsetDB(compadre,SpeciesAccepted == "Ziziphus jujuba")
z <- mergeDBs(x,y)

The error message is:

Error in VersionData(db1) : could not find function "VersionData"
In addition: Warning messages:
1: In data(db1) : data set ‘db1’ not found
2: In data(db2) : data set ‘db2’ not found
3: In data(db1) : data set ‘db1’ not found
4: In data(db2) :
 Error in VersionData(db1) : could not find function "VersionData"

indexing within subsetDB

It would be good to have the ability in subsetDB to index using numbers or variable values that aren't Boolean, i.e. rather than SpeciesAccepted == "Acinonyx_jubatus" | SpeciesAccepted == "Panthera_leo", to use SpeciesAccepted = c("Panthera_leo", "Acinonyx_jubatus"). Note these aren't in alphabetical order; it would be nice if the function returned them in the user's desired order.

Otherwise (and perhaps easier to implement / more useful), rather than a Boolean variable it would be good to be able to use e.g. subsetDB(comadre, c(990, 646, 461, 1754, 1927) ) and the function would return a database with those numbers, in that order.

Create branch to preserve current (non-dev) version?

We are getting close to the point where we will want to merge the changes we have made on the dev branch to those on the master branch. It has occurred to me that people might still be using or want to use the current version (e.g. they are currently working on analyses).

We could, I think, create a new branch from the master before we merge the dev branch, and call it "v.0.1", or whatever, so that to install that old version you would simply need to use:

install_github("jonesor/Rcompadre",ref = "v.0.1")

What do you all think?