jonesor / rcompadre Goto Github PK
View Code? Open in Web Editor NEWR tools for obtaining and manupulating data from the COMPADRE and COMADRE Plant and Animal Matrix Databases
Home Page: https://jonesor.github.io/Rcompadre/
R tools for obtaining and manupulating data from the COMPADRE and COMADRE Plant and Animal Matrix Databases
Home Page: https://jonesor.github.io/Rcompadre/
Add functionality to provide the user different options of dealing with values of NA within matF. Currently, the function assumes NA indicates positive fecundity.
Function to produce mean fecundity matrices from individual matrices.
Function to remove matrices that do not pass a series of (independent) tests.
e.g. ergodicity, primitivity, reducible, splitness etc.
(With apologies for the length)
Rather than the nested structure of the current mat
slot (or "column", à la #27), I propose having separate list-columns for matA
, matU
, matF
, matC
, MatrixClassOrganized
, and MatrixClassAuthor
. The reason is that most MPM functions (e.g. in popdemo
, popbio
) act on a matrix, so we should make it easier for users to access matrices, and particularly, to vectorize over a set of matrices.
Currently, vectorizing with popdemo
or popbio
functions requires writing a custom function to access the slots within mat
, e.g.
ergodic <- sapply(db@mat, function(x) popdemo::isErgodic(x@matA))
lambda <- sapply(db@mat, function(x) popbio::lambda(x@matA))
With a flat structure a user could vectorize over matA
without having to write a custom function, e.g.
ergodic <- sapply(db$matA, popdemo::isErgodic)
lambda <- sapply(db$matA, popbio::lambda)
The situation is more nuanced when it comes to Rage
, because Rage
functions could eventually work with CompadreM
objects, in which case a user could vectorize over db@mat
without writing custom functions. However, for the Rage functions that take a single matrix (i.e. kEntropy
, qsdConverge
, reprodStages
, identfityReproStages
, splitMatrix
), vectorizing with the flat version would be just as easy, e.g.
# CompadreM version
k_ent <- sapply(db@mat, Rage::kEntropy)
# flat version
k_ent <- sapply(db$matU, Rage::kEntropy)
For some Rage
functions that take multiple matrix arguments (e.g. matrixElementPerturbation
), vectorizing with the CompadreM
object would admittedly be nicer, e.g.
# CompadreM version
mat_pert <- lapply(db$mat, Rage::matrixElementPerturbation)
# flat version
mat_pert <- mapply(Rage::matrixElementPerturbation, db$matU, db$matF, SIMPLIFY = FALSE)
For Rage
functions that will often be used with additional row-specific arguments (e.g. longevity
, rearrangeMatrix
, reprodStages
), vectorizing with the flat version is just as easy, (e.g. calculating lifespan wrt the first non-propagule stage)
# CompadreM version
start_life <- sapply(db@mat, function(x) min(which(x@matrixClass$MatrixClassOrganized == "active")))
lifespan <- mapply(Rage::longevity, db@mat, startLife = start_life, SIMPLIFY = FALSE)
# flat version
start_life <- sapply(db$MatrixClassOrganized, function(x) min(which(x == "active")))
lifespan <- mapply(Rage::longevity, db$matU, startLife = start_life, SIMPLIFY = FALSE)
Finally, there are a few Rage
functions (R0
, dEntropy
, lifeTimeRepEvents
, and makeLifeTable
) for which having a CompadreM
method would make the function overly difficult to document and understand (in my opinion), because the user may wish to apply them to matF-only, matC-only, or matF and matC (or in the extreme case of makeLifeTable
, also matU-only).
As outlined in jonesor/Rage#19, I think we should simplify these functions so that they only take one 'reproductive matrix' argument (e.g. matR
), and correspondingly make it easier for users to derive and store matFC
(= matF
+ matC
) within db
.
The flat version makes it easier to examine a series of matrices from the same study (e.g. reflecting different years or populations)
# CompadreM version
db@mat[[450]]@matA
db@mat[[451]]@matA
db@mat[[452]]@matA
# flat version
db$matA[450:452,]
Though the CompadreM
version makes it easier to examine all the matrices for a given row
# CompadreM version
db@mat[[450]]
# flat version
db$matA[[450]]
db$matU[[450]]
db$matF[[450]]
db$matC[[450]]
I think the matrix-validation function of the CompadreM
class could be moved to a new Rcompadre function (e.g. db_validate
). I don't see too much benefit of 'built-in' validation of things like matrix dimension, non-negative values, etc. Definitely those things should be validated on the COMPADRE side, but after that I think it's fine to leave things to the user.
Current functions use $
to access elements of compadre database objects. With the new S4 class for these data, the following functions (and possibly others I've missed) will need updating with @
to access S4 slots:
findSpecies
utility function)Related to content of d88789d - The ternaryplot()
function in the vignette returns the following when run from Knitr:
Error in unit(x, default.units) : 'x' and 'units' must have length > 0
but not when called interactively. I haven't found a decent explanation for why this is online so I've set the chunks options to eval = FALSE
until I can figure out what's going on. At the very least, users can cut and paste that section to produce figures in their own interactive sessions.
The CompadreDB show
method prints a cryptic warning if db
doesn't contain the column SpeciesAccepted
. E.g.
Compadre[,c("mat", "Genus")]
## A COM(P)ADRE database ('CompadreDB') object with 0 SPECIES and 150 MATRICES.
##
## # A tibble: 150 x 2
## mat Genus
## <list> <chr>
## 1 <S4: CompadreMat> Setaria
## 2 <S4: CompadreMat> Lechea
...
## Warning message:
## Unknown or uninitialised column: 'SpeciesAccepted'.
This is ultimately coming from the function NumberAcceptedSpecies()
.
Suggest adding a more helpful message to the NumberAcceptedSpecies()
function, and modifying the show
method to prevent any warning message in this instance. Though I'm not sure exactly what we should do with the header message if db
doesn't contain SpeciesAccepted
. Maybe just replace the number with a question mark? E.g.
Compadre[,c("mat", "Genus")]
## A COM(P)ADRE database ('CompadreDB') object with ? SPECIES and 150 MATRICES.
##
...
Merge 2 compadre databases together.
Suggestion: for the popdemo development package branch I use the package name 'popdemoDev' instead: perhaps this would be an idea for Rcompadre and Rage too. It means you can have both packages installed on your machine at the same time: one which works and one which may be buggy.
Before merging the development branch into master you need to make sure you change it, though. And then change it back once the merge is completed.
This would be a problem if code included any RcompadreDev::function(... but at least in popdemo this isn't the case. Maybe I'm not doing things 'properly' (although it got past the CRAN pearly gates so I guess it should be good enough)
Maybe called "sampleMatrices"?
I will give this a go, otherwise I will attempt to improve usability of the current function.
I think it would be useful to have a vignette devoted solely to vectorizing with Rcompadre. Could have some 'basic vectorizing' material, some material on using accessor functions within vectorized code, and also some 'avoiding pitfalls' type material.
For instance, the fact that accessors can be used both with CompadreDB and CompadreMat makes it easy to write vectorized code that's very inefficient. E.g.
# different ways to calcuate stage-specific survival, and time to run in seconds
db <- cdb_fetch("~/COMPADRE_v.X.X.X.RData")
x <- lapply(seq_along(db$mat), function(i) colSums(matU(db)[[i]])) # 56 s
x <- lapply(seq_along(db$mat), function(i) colSums(matU(db[i,])[[1]])) # 5 s
x <- lapply(db$mat, function(x) colSums(matU(x))) # 0.2 s
x <- lapply(matU(db), colSums) # 0.04 s
Create a function to check whether a particular species is present in the database, and if it is to return that subset of the database.
I think matA
, matU
, matF
, matC
and any of the others in the ClassUnionMethods.R file always return a list, even if the database passed is just one row. Intuitively one might want to use e.g.
matA(compadre[123, ])
or matA(compadre[123])
expecting a matrix, but getting a list of length 1 containing the matrix. We can get it with e.g. matA(compadre$mat[[123]])
, but would be good for the others to work too.
Function subsetDB
does not currently subset slot matrixClass
. This was my fault... I mistakenly removed the line:
ssdb$matrixClass <- ssdb$matrixClass[subsetID]
When documenting functions, begin title with verb (e.g. "Calculate life table entropy" rather than "A function that calculates life table entropy").
For descriptions, can use longer form (e.g. "This function calculates...").
Since we need a stable master
branch for workshops (and since the master branch is now stable after a807f7a!), does it make sense to add continuous integration to make sure all pushes/pull requests to master
are checked properly? If you're interested, you can read more on what continuous integration does in jonesor/Rage#9 and here.
If so, @jonesor (or someone else with administrative privileges) will need to activate it from the Github page (Settings -> Integrations & Services). From there, you can grant access to Travis-CI and it should redirect you to the Travis homepage so you can activate it from that end as well. Once that account is activated, I can set up the .travis.yml file.
Thoughts?
I only recently learned that it's possible to download and load COM(P)ADRE directly from R (can't remember where I first saw this, but props to that person):
url_dl <- 'https://www.compadre-db.org/Data/CompadreDownload'
load(url(url_dl))
By loading db
in a separate environment (e.g.), a fetchDB
function could allow the user to assign db
to an object of their chosen name, rather than getting an already-named object with load
.
compdb <- fetchDB('compadre') # from web
compdb <- fetchDB('my_path/COMPADRE_v.X.X.X.RData') # from file
This would also enable piping right from the initial load operation, e.g.
compdb <- fetchDB('compadre') %>%
tidyDB() %>%
cleanDB() %>%
subset(check_ergodic == TRUE)
Related to #25, I think it would be useful if there was a way for a user to add new slots containing either vectors, matrices, or lists to a CompadreData object.
In many demographic analyses users will want to derive vectors or matrices from an MPM (e.g. population vector, stable distribution, sensitivity matrix, collapsed or rearranged matrix), and it would be ideal if these remained part of the CompadreData object to maintain the mapping with the metadata and original MPM. This becomes particularly important if the user wishes to subsequently subset the db based on some derived value(s).
Many of the functions in Rcompadre/Rage return vectors or matrices derived from an MPM (e.g. reprodStages, identifyReprodStages, rearrangeMatrix, splitMatrix, collapseMatrix), and I gather these currently need to be stored separately from the db.
I don't know much about S4 and so don't have a sense of whether this is feasible, but perhaps @tdjames1 and @iainmstott can weigh in.
Probably in 'Getting Started' vignette and maybe README. Could include link to tibble vignette, perhaps brief summary of differences from data.frame
, and tips for changing default setting, e.g.
options(tibble.print_max = Inf) # always print all rows
options(tibble.width = Inf) # always print all columns
Following discussions with Owen, we felt it may make sense to change the names of the classes. The database is called CompadreData
, but all of our functions use the syntax "DB" (subsetDB
, cleanDB
, compareDBs
, etc.). Similarly, we use "mat" to refer to matrices (matA
, matU
, etc.) but the matrix-focused class is CompadreM
. The proposal is to change these: CompadreDB
and CompadreMat
respectively.
This may confuse us developers initially, and although users may not have to interact so much with the classes directly, it could make life better for anyone who wants to build on the package, specifically the classes, as well as anyone wanting to convert between the old and the new data structures.
Continuity is king
I didn't initially add a method for summarize
alongside the other tidyverse functions because summarize
won't ever return a valid CompadreDB object. But now I keep forgetting that I have to use CompadreData()
or as_tibble()
before calling summarize()
.
Compadre %>%
CompadreData() %>%
group_by(OrganismType, Continent) %>%
summarize(n = n())
In hindsight, I don't think the 'must return a CompadreDB object' requirement was necessary, and a summarize
method that returns a regular tibble would be rather convenient. E.g.
Compadre %>%
group_by(OrganismType, Continent) %>%
summarize(n = n())
Everyone ok with this?
I think it would be useful to have a function that only extracts the metadata (without the mat
-column). The as_tibble()
method converts the database to a tibble that includes mat
. But it would be convenient to have a separate function that also drops mat
.
From Haydee:
We noticed that the the function
as_cdb
changes the amount of data stored in the R objects. Eg. When the function as_cdb is not used to convert the loaded r objects,Clarkia_xantiana_subsp._xantiana
exists (at least with one matrix), if theas_cdb
function is run, then the species disappears
I noted that #32 introduces a data
generic to extract the metadata w/ list column from the CompadreData
structure. However, data()
is also a function from utils
used to load packages' internal data sets (e.g. data(mtcars)
).
I don't think it would cause any actual problems to have the method named as such, but it may be confusing for users interpreting code that employs it because it doesn't really do the same thing as the utils
function and that is often used in teaching materials (e.g. vignettes, online tutorials, etc). maybe calling it extract_data
or something is safer?
... and therefore also asCompadreData()
and convertLegacyDB()
.
The conversion for COMPADRE_v.X.X.X takes 11-12 seconds on my computer, ~70% of which is due to the validCompadreM()
checks. The use of data.frame
for the dims
object is causing most of the overhead. If dims
is instead constructed as a matrix, the conversion time on my computer is reduced from ~11.5 to ~4.5 seconds. E.g.
dims <- cbind(matA = dim(object@matA),
matU = dim(object@matU),
matF = dim(object@matF),
matC = dim(object@matC))
I think we can get it down to about 3.5 s with additional optimization.
This is something I still struggle with. For most (but not all) analyses I would like a single 'grand mean matrix' for each MatrixPopulation
or SpeciesAuthor
.
This gets challenging because for a given MatrixPopulation
, there are cases for every possible combination of MatrixComposite
(multiple Individual + Mean, multiple Individual no Mean, multiple Mean, multiple Mean and Pooled, etc.). Plus, there are instances where matrix dimension or stage class definitions vary within populations.
Rather than a function that finds and subsets to the highest-level matrix/matrices for a given population or species, I think the best/only-feasible appoach is a function that collapses a database by averaging matrices over levels of one or more grouping variables supplied by the user. For example, a sensible set of grouping variables might be c('SpeciesAuthor', 'id_stages')
, where id_stages
is a column identifying rows with the same set of MatrixClassAuthor
(or perhaps MatrixClassOrganized
).
It's not entirely satisfying to average over a set of matrices that may include Individual and Mean/Pooled matrices, but on the other hand, adding the average to a set being averaged shouldn't much change the outcome:
x <- 1:4
mean(x) == mean(c(x, mean(x)))
Here's the logic I propose. For each group (i.e. 1+ rows of the db; as defined by the user):
AnnualPeriodicity
or paste(MatrixClassAuthor)
differsMatrixComposite == "Seasonal"
(these should be matrix-multiplied)mat
by averaging matrices (and reconstructing CompadreMat object)MatrixComposite
by returning "Collapsed"SurvivalIssue
by re-calculating from collapsed matUpaste(x, collapse = "; ")
mat
will have to be removed before collapsing)It's admittedly a bit dangerous, and will require good documentation and user caution. I've implemented a version that I'll submit in a PR shortly. Let me know if you have thoughts/alternatives.
This has kind of been bugging me recently when working on some internal code. Basically, I think the current data structure is a bit more complicated than it needs to be from a programming and usability perspective. The points below outline some potentially beneficial alterations which would require re-writing the CompadreData
class (but not the CompadreM
class! :) ), but here's why I think it is a good idea.
We can still use S4 to ensure rigidity to a point, but use S3 at lower levels that provide users a more straightforward interface. For example, implementing an S4 CompadreData
class with 2 slots (call them something like data
and version
) would likely work. The data
slot could be a data.frame
containing a list column of S4 CompadreM
objects and their associated metadata. This is not an original idea of mine (@patrickbarks in #25 ), but it could simplify a number of features we've discussed (e.g. @patrickbarks suggestions in #26, reordering data for phylogenetic analysis (e.g. @iainmstott 's usage in #22), adding columns, etc) and make methods easier to write (for example, the [
method would basically just extract the data
slot, use [.data.frame
, and return the subsetted object (e.g. #24)). We'd also still retain the beneficial rigidity of the CompadreM
class in that list-column so that we don't let people accidentally include negative matrix elements, etc. I think this implementation would also reflect the ideas that @iainmstott and @tdjames1 are discussing towards the end the thread in #22.
We aren't really retaining the structure of the SQL database with the current CompadreData
structure anyway. The SQL database schema breaks the metadata up into a variety of other, smaller tables while we lump it all together into the metadata
slot (which I think is a good thing). Additionally, the SQL database itself is designed to use the publication as the "atomic unit" of organization while our current CompadreData
structure uses the matrix itself (also a good thing - that's way more intuitive). The SQL database has a very rigid structure once it's implemented, but we can do whatever we want with that structure once it's pulled into R (and we already are taking that approach with the metadata
table). I don't think we should constrain ourselves to a rigid approach if a more flexible (and simpler) one is available. In short, SQL is rigid, but R doesn't have to be. We should take advantage of that!
The current validCompadreData
method only checks to make sure each row of metadata has a matrix associated with it in the mat
slot. We could change that to check for NAs in the list-column. Otherwise, I'm not sure we actually need to change very much beyond tweaking a few functions.
Finally, most other database API packages are built on top of S3 (e.g. BIEN, rnoaa) and Compadre doesn't exist in a vacuum; there are real use cases for cross-database analysis (e.g. @robcito 's talk at ESA). I'm not suggesting we abandon S4 entirely (our data is a bit more constrained than others), but it would help with cross-database compatibility to make our interface as similar as possible.
I've started to experiment with this on my fork (hopefully will push tonight/tomorrow). I've been using dplyr syntax for a lot if it, but that's by no means a requirement (I worry less about depending on tidyverse packages - they're generally pretty stable with the introduction of quasi-quotation in rlang 0.2.0
, but I understand the concern in general). Using functions like map_*
from purrr
work really well in this set-up and don't constrain the user on what data can be added/subtracted from their analysis.
Sorry for the wall of text! tl;dr - I think it'll save us a bunch of pain further down the line to change the internal CompadreData
structure now. Curious to here other thoughts on this approach.
Error is produced:
Error in eval(e, x, parent.frame()) : object 'PopId' not found
I think this is caused by a problem building the newdata
object (line 55-64).
Four functions in Rcompadre act on individual matrices (i.e. their first argument is matA
, or matU
, etc.), and so I think would fit more naturally into Rage. These are:
collapseMatrix
, identifyReprodStages
, rearrangeMatrix
, and splitMatrix
.
Two of these (collapseMatrix
and rearrangeMatrix
) are used by Rage::standardizedVitalRates
, so having them in the same package would be nice. None of the four functions are called by other functions in Rcompadre, so moving them shouldn't break anything.
I think lots of R users only read vignettes online, rather than within R.
Until we're on CRAN (which publishes HTML vignettes), and since GitHub no longer renders Rmarkdown, I think it would be good to 'manually' publish formatted versions of our vignettes (perhaps including the workshop materials, @jonesor).
One option is to just render the vignettes as regular markdown and host them on the Rcompadre repo, and perhaps include links in the README. Though I'm not sure exactly how to do this... I think we'd have to specify multiple output formats in the Rmd header.
Some quick Googling makes me think it's possible.
Currently, Compadre v4.0.1 and Compadre vX.X.X (unreleased version) fail to meet the CompadreData class definition. It seems that 24 of the A and/or U matrices have negative elements which are currently rejected during validation of CompadreM
class objects. For example:
load('COMPADRE_V.4.0.1.RData')
Compadre <- asCompadreData(compadre)
Error in validObject(.Object) : invalid class “CompadreM” object: matU is not nonnegative
I've extracted the DOIs for the offending studies and have found the papers.
In the Jongejans 2010 paper (DOI: 10.1111/j.1365-2745.2009.01612.x, Carlina vulgaris, matrix b2), there was no data entry error - the element as calculated by them is slightly negative.
In the Kesler et al 2008 paper (DOI: 10.1007/s00442-008-1022-1, 23 U matrices with slightly negative elements), I am guessing the negative matrix elements are a byproduct of the regressions they used. However, as noted in the Observation column of the metadata, the matrices were provided by the author so I can't check them by hand at the moment.
I remember discussing this issue in Ghent and seem to recall that we wanted to retain the matrices as reported by the authors. I think this would mean we need to loosen our definition of the class (perhaps throw a warning instead of an error) or subset the problem studies out each time a user submits a query that includes them (which would largely defeat the purpose of retaining them in the first place). Alternative solutions that don't require loosening the class definition or getting rid of the offending studies would be ideal, but I can't think of any at the moment.
@jonesor @robcito @tdjames1 @iainmstott - thoughts on how to correct this?
tldr: 2 studies with a total of 24 matrices in Compadre have negative matrix elements and prevent it from satisfying our current class definition. Relax the class definition, get rid of the studies, clarify with authors, or something else entirely?
DOIs for the studies in question: 10.1111/j.1365-2745.2009.01612.x
and 10.1007/s00442-008-1022-1
Replace matFmu argument in rearrangeMatrix with more generic argument to identify reproductive stages, because identifying reproductive stages using the mean fecundity matrix is just one possible approach to identifying reproductive stages.
We need accessor functions for the S4 classes. e.g. matA(CompadreMObject)
gives a single A matrix, matA(CompadreDataObject)
gives a list of matA, SpeciesAccepted(CompadreDataObject)
gives the character vector of SpeciesAccepted, metadata(CompadreDataObject)
gives the metadata, and so on.
I've forked the repo and have been working on this there (perhaps should have just created a new branch, but hey. Actually made good progress but R CMD check
not working with vignettes, tests and so on. I'll submit a pull request when it's more ready.
Rage functions longevity
(life expectancy component), R0
, and lifeTimeRepEvents
will fail if matU
is singular (i.e. non-invertable). This usually indicates infinite life expectancy due to a 100% stasis loop in the final stage(s). Adding a flag will make it easier for users to remove rows with singular matU
ahead of time, if desired.
# try calculating fundamental matrix
N <- try(solve(diag(nrow(matU)) - matU), silent = TRUE)
# flag if singular
if (class(N) == 'try-error' && grepl('singular', N[1])) {
check_singular_U <- TRUE
} else {
check_singular_U <- FALSE
}
Picking up from #32, I think print(db@data)
throwing an error is an issue that should be solved, even if we don't want users to access the data
slot that way. Presumably many users will use db@data$
to create derived columns. I think some users will realize db@data
is just a data.frame
, and that's where all the important stuff is, and then be confused when normal data frame stuff fails (this is exactly what happened to me). The options that I can think of (there may well be others):
db@data
a tibbledb@data$mat
back to it's own slot (db@mat
)print
method for db@data
CompadreM
class and revert db@data$mat
to a list of lists (that way print(db@data)
won't throw an error, though the printing won't be pretty)I think option 1 or 2 are the least worst.
Using cleanDB
produces Errors.
Error in subset.default(newdata_sub, select = c("index", "check_ergodic", :
argument "subset" is missing, with no default
I think this is caused by the way that data(db)
and data(db_sub)
work.
e.g. lines 59-63 are supposed to build a new object with a column called index
but (I think) this doesn't happen.
They...
db@data
using existing Rcompadre fns[
, subset()
, merge()
, etc. on db@data
CompadreDB
methods such as [
, and methods we might add in the future for subset()
, merge()
, etc.
[
or subset
doesn't include YearPublication
, then NumberStudies
can't be updated[
method doesn't update species/study/matrix counts based on the given row subset, whereas subsetDB
doesAlso, the species and matrix counts (i.e. the underlying calculations) are already part of the show
method, which at least ensures those counts are accurate at the time of printing.
The counts will still exist in the version
element of legacy dbs, but I suggest we drop them during the asCompadreData
conversion (and from other Rcompadre functions).
(I still think we should consider removing the whole version
slot, but that's a separate issue)
@tdjames1 and @iainmstott are handling this
As noted briefly in PR #24, I'm interested in making Rcompadre work nicely with dplyr's pipe operator (%>%
). The pipe operator passes an object on the left side to a function on the right side (e.g. x %>% mean()
== mean(x)
), and when used in series can make code more readable.
A particular piping sequence I often perform with compadre is to calculate some quantity for every row of the db, add it as a column to the metadata, and then subset the db based on that new column (and repeat). In the past I've worked with a tibble version of the db (i.e. metatadata + list-columns for matA, matU, ..., matrixClass) to make this sequence easier:
library(dplyr)
compadre_tb <- as_tibble(compadre$metadata) %>%
mutate(matA = lapply(compadre$mat, function(x) x$matA),
matU = lapply(compadre$mat, function(x) x$matU),
matF = lapply(compadre$mat, function(x) x$matF),
matC = lapply(compadre$mat, function(x) x$matC),
matrixClass = compadre$matrixClass)
For instance, say I want to work with a set of matrices reflecting populations in decline (lambda < 1), and I only want ergodic matrices with no NAs. With the tibble version I can use a sequence of dplyr and purrr functions to repeatedly add columns (mutate) and subset (filter) based on those new columns.
library(purrr)
compadre_use <- compadre_tb %>%
mutate(na_matA = map_lgl(matA, ~ any(is.na(.x)))) %>%
filter(na_matA == FALSE) %>%
mutate(ergodic = map_lgl(matA, popdemo::isErgodic)) %>%
filter(ergodic == TRUE) %>%
mutate(lambda = map_dbl(matA, popbio::lambda)) %>%
filter(lambda < 1)
With a CompadreData object, the equivalent sequence might look something like this:
compadre_s4_use <- compadre_s4 %>%
cleanDB() %>%
subsetDB(check_NA_A == FALSE & check_ergodic == TRUE)
compadre_s4_use@metadata$lambda <- sapply(compadre_s4_use@mat,
function(x) popbio::lambda(x@matA))
compadre_s4_use <- subsetDB(compadre_s4_use, lambda < 1)
So piping works fine with subsetting (and cleanDB), but I can't replicate the fully-piped sequence without an Rcompadre equivalent to dplyr::mutate(). What we would need is a function that takes a CompadreData object as the first argument, and returns a CompadreData object with an additional metadata column (based on some transformation specified in the second argument). I don't know what this function would entail in practice, but I think this general type of functionality would be desirable (to me anyway). Thoughts?
We need a vignette that guides users through a workflow of using the package.
This can draw on materials used for COMPADRE workshops that are in our compadreDB repository.
More than one vignette is OK.
Split - to split the database into two database objects, one matching a criteria, the other not.
I'm wondering if there's an alternative to having accessor functions for every single metadata
column (the mat
-related ones I like)? A few (fairly minor) downsides I see:
they pollute the namespace and make Rcompadre::
less useful for finding functions (I assume the ::
autcomplete is limited to RStudio)
relatedly, they reduce the ease of using RStudio's column autocomplete options for piped expressions to subset()
, mutate()
, etc. (though this feature is currently finicky anyway)
based on my limited experience, other S4
classes don't seem to use accessors for columns within a data
slot (e.g. sp
)
One alternative (based on the sp
package) would be to have CompadreData
methods for $
, $<-
, and names
that directly access the data
slot, e.g. so db$SpeciesAuthor
is equivalent to db@data$SpeciesAuthor
. This enables autocomplete in RStudio using db$
. Even apart from autocomplete, I think it would be useful in that it gives users quicker access to the good stuff (i.e. everything except version
).
At some point in the not so distant future, we should begin to develop unit tests for as many of these functions as we can. I'm happy to lead the charge on this one, just need to know which functions are completed and which are still in progress.
The show method should produce something like this:
>compadre
A com(p)adre database ('CompadreData') object with 695 SPECIES and 7024 MATRICES.
See ?CompadreData and ?CompadreUnionMethods for methods of accessing data.
but produces this:
>compadre
A com(p)adre database ('CompadreData') object with 695 SPECIES and 7024 MATRICES.
See ?CompadreData and ?CompadreUnionMethods for methods of accessing data.
A com(p)adre database ('CompadreData') object with 695 SPECIES and 1 MATRICES.
See ?CompadreData and ?CompadreUnionMethods for methods of accessing data.
The second part here should obviously be removed.
Would it be worthwhile to include a subsample of one or both databases as example data sets? I suspect it will make our lives easier in the long run.
For example, we could then use something like data(compadre)
in the vignettes and examples rather than relying on file paths to the objects (which change on every machine).
If others agree with the general concept, I'm happy to work on identifying a good subsample from each data base to distribute with the package.
(1) Change verb in cleanDB
?
'Clean' to me implies that something will be changed or removed, but cleanDB
just adds columns flagging potential issues with the MPMs. I suggest changing to something like flagDB
(or perhaps something more verbose, like flagMatrixIssues
).
(2) Change verb in mergeDBs
?
mergeDBs
is essentially rbind.data.frame
. We have a separate CompadreDB method for the base R merge
function (which performs a join operation), so I suggest changing the verb in mergeDBs
to 'bind' or 'rbind', or moving this functionality to an rbind.CompadreDB
method.
(3) Switch to snake_case for all non-accessor functions?
To avoid the awkwardness of camelCase with acronyms ('cleanDB' or 'cleanDb'?), and the mild ambiguity of names like 'DBToFlat' (is the object 'DB' or 'DBT'?).
I think the object_verb format (recommended by rOpenSci) would work really well in Rcompadre. I'd suggest switching the object from db_
to cdb_
(COMPADRE Database) because dplyr has a bunch of db_
functions. E.g.
fetchDB -> cdb_fetch
cleanDB -> cdb_flag
compareDBs -> cdb_compare (or cdbs_compare)
mergeDBs -> cdb_rbind (or cdbs_rbind, or rbind method)
DBToFlat -> cdb_flatten
checkSpecies -> cdb_check_species (or cdb_species; or remove)
getMeanMatF -> cdb_get_mean_F (or mean_mat_F)
asCompadreDB -> as_cdb
convertLegacyDB -> (remove and just use as_cdb)
stringToMatrix -> string_to_mat
If we make that change, we could consider changing the 'CompadreDB' class to 'cdb', and then for consistency, the 'CompadreMat' class to something like 'cmat' (COMPADRE Matrix) or 'cmpm' (COMPADRE Matrix Population Model). We could alternatively use a longer object name to avoid potential collisions with other packages, like 'compdb' or 'compadre'. Or keep the current class names and use the cdb_
convention for functions anyway.
Thoughts?
To reproduce the error:
x <- subsetDB(compadre,SpeciesAccepted == "Alaria nana")
y <- subsetDB(compadre,SpeciesAccepted == "Ziziphus jujuba")
z <- mergeDBs(x,y)
The error message is:
Error in VersionData(db1) : could not find function "VersionData"
In addition: Warning messages:
1: In data(db1) : data set ‘db1’ not found
2: In data(db2) : data set ‘db2’ not found
3: In data(db1) : data set ‘db1’ not found
4: In data(db2) :
Error in VersionData(db1) : could not find function "VersionData"
It would be good to have the ability in subsetDB to index using numbers or variable values that aren't Boolean, i.e. rather than SpeciesAccepted == "Acinonyx_jubatus" | SpeciesAccepted == "Panthera_leo"
, to use SpeciesAccepted = c("Panthera_leo", "Acinonyx_jubatus")
. Note these aren't in alphabetical order; it would be nice if the function returned them in the user's desired order.
Otherwise (and perhaps easier to implement / more useful), rather than a Boolean variable it would be good to be able to use e.g. subsetDB(comadre, c(990, 646, 461, 1754, 1927) )
and the function would return a database with those numbers, in that order.
We are getting close to the point where we will want to merge the changes we have made on the dev branch to those on the master branch. It has occurred to me that people might still be using or want to use the current version (e.g. they are currently working on analyses).
We could, I think, create a new branch from the master before we merge the dev branch, and call it "v.0.1", or whatever, so that to install that old version you would simply need to use:
install_github("jonesor/Rcompadre",ref = "v.0.1")
What do you all think?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.