bcgov / ssdtools Goto Github PK

View Code? Open in Web Editor NEW

29.0 11.0 16.0 73.03 MB

An R package to fit and plot Species Sensitivity Distributions (SSDs)

Home Page: https://bcgov.github.io/ssdtools/

License: Apache License 2.0

R 89.08% TeX 2.41% C++ 8.51%

rstats ecotoxicology species-sensitivity-distribution env

ssdtools's Introduction

ssdtools

ssdtools is an R package to fit and plot Species Sensitivity Distributions (SSD).

SSDs are cumulative probability distributions which are fitted to toxicity concentrations for different species as described by Posthuma et al. (2001). The ssdtools package uses Maximum Likelihood to fit distributions such as the gamma, log-Gumbel (identical to inverse Weibull), log-logistic, log-normal and Weibull to censored and/or weighted data. Multiple distributions can be averaged using Akaike Information Criteria. Confidence intervals on hazard concentrations and proportions are produced by parametric bootstrapping.

Installation

To install the latest version from CRAN

install.packages("ssdtools")

To install the latest development version from GitHub

# install.packages("pak", repos = sprintf("https://r-lib.github.io/p/pak/stable/%s/%s/%s", .Platform$pkgType, R.Version()$os, R.Version()$arch))
pak::pak("bcgov/ssdtools")

Introduction

ssdtools provides a data set for several chemicals including Boron.

library(ssdtools)
ssddata::ccme_boron
#> # A tibble: 28 × 5
#>    Chemical Species                  Conc Group        Units
#>    <chr>    <chr>                   <dbl> <fct>        <chr>
#>  1 Boron    Oncorhynchus mykiss       2.1 Fish         mg/L 
#>  2 Boron    Ictalurus punctatus       2.4 Fish         mg/L 
#>  3 Boron    Micropterus salmoides     4.1 Fish         mg/L 
#>  4 Boron    Brachydanio rerio        10   Fish         mg/L 
#>  5 Boron    Carassius auratus        15.6 Fish         mg/L 
#>  6 Boron    Pimephales promelas      18.3 Fish         mg/L 
#>  7 Boron    Daphnia magna             6   Invertebrate mg/L 
#>  8 Boron    Opercularia bimarginata  10   Invertebrate mg/L 
#>  9 Boron    Ceriodaphnia dubia       13.4 Invertebrate mg/L 
#> 10 Boron    Entosiphon sulcatum      15   Invertebrate mg/L 
#> # ℹ 18 more rows

Distributions are fit using ssd_fit_dists()

fits <- ssd_fit_dists(ssddata::ccme_boron, dists = c("lnorm", "llogis"))

and can be quickly plotted using autoplot

library(ggplot2)

theme_set(theme_bw())

autoplot(fits) +
  scale_colour_ssd()

The goodness of fit can be assessed using ssd_gof

ssd_gof(fits)
#> # A tibble: 2 × 9
#>   dist      ad     ks    cvm   aic  aicc   bic delta weight
#>   <chr>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
#> 1 lnorm  0.507 0.107  0.0703  239.  240.  242.  0      0.73
#> 2 llogis 0.487 0.0994 0.0595  241.  241.  244.  1.99   0.27

and the model-averaged 5% hazard concentration estimated by parametric bootstrapping using ssd_hc.

set.seed(99)
hc5 <- ssd_hc(fits, ci = TRUE, nboot = 100) # 100 bootstrap samples for speed
print(hc5)
#> # A tibble: 1 × 10
#>   dist    percent   est    se   lcl   ucl    wt method     nboot pboot
#>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      <dbl> <dbl>
#> 1 average       5  1.65 0.599 0.923  3.34     1 parametric   100     1

To bootstrap in parallel set future::plan(). For example:

future::multisession(workers = 2)
hc5 <- ssd_hc(fits, ci = TRUE, nboot = 100)

Model-averaged predictions complete with confidence intervals can also be estimated by parametric bootstrapping using the stats generic predict. To perform bootstrapping for each distribution in parallel register the future backend and then select the evaluation strategy.

doFuture::registerDoFuture()
future::plan(future::multisession)

set.seed(99)
boron_pred <- predict(fits, ci = TRUE)

and plotted together with the original data using ssd_plot.

ssd_plot(ssddata::ccme_boron, boron_pred,
  shape = "Group", color = "Group", label = "Species",
  xlab = "Concentration (mg/L)", ribbon = TRUE
) +
  expand_limits(x = 3000) +
  scale_colour_ssd()

References

Posthuma, L., Suter II, G.W., and Traas, T.P. 2001. Species Sensitivity Distributions in Ecotoxicology. CRC Press.

Information

Get started with ssdtools at https://bcgov.github.io/ssdtools/articles/ssdtools.html.

A shiny app to allow non-R users to interface with ssdtools is available at https://github.com/bcgov/shinyssdtools.

For the latest changes visit NEWS.

The citation for the shiny app:

Dalgarno, S. 2021. shinyssdtools: A web application for fitting Species Sensitivity Distributions (SSDs). JOSS 6(57): 2848. https://joss.theoj.org/papers/10.21105/joss.02848.

The ssdtools package was developed as a result of earlier drafts of:

Schwarz, C., and Tillmanns, A. 2019. Improving Statistical Methods for Modeling Species Sensitivity Distributions. Province of British Columbia, Victoria, BC.

For recent developments in SSD modeling including a review of existing software see:

Fox, D.R., et al. 2021. Recent Developments in Species Sensitivity Distribution Modeling. Environ Toxicol Chem 40(2): 293–308. https://doi.org/10.1002/etc.4925.

The CCME data.csv data file is derived from a factsheet prepared by the Canadian Council of Ministers of the Environment. See the data-raw folder for more information.

Getting Help or Reporting an Issue

To report bugs/issues/feature requests, please file an issue.

How to Contribute

If you would like to contribute to the package, please see our CONTRIBUTING guidelines.

Code of Conduct

Please note that the ssdtools project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Licensing

The documentation is released under the CC BY 4.0 License

The code is released under the Apache License 2.0

ssdtools's People

Contributors

Stargazers

Watchers

Forkers

stephhazlitt emiliedouss flor14 atillmanns thanhkhiet84 beckyfisher andschar poissonconsulting zhangjw7sustech joethorley aylapear hadley krlmlr

ssdtools's Issues

remove lgumbel from defaults in shiny app

AIC/BIC table

remove pareto from shiny

set shiny defaults to ssdtools defaults

Unable to plot

Keep getting this error when trying to plot:

"Error in is.finite(x) :
default method not implemented for type 'language'"

This happens when using the script straight from the Shiny App using Boron as the example as well as when I use my own data.
Any suggestions?

implement beta distribution (if possible)

fit standard errors?

I tried to do again all the command to obtain HC5 and I realized I can’t have same number that showed by Joe and team.

> ssd_hc5(boron_dists)
    prop   est    se   lcl   ucl
   <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0500  1.25 0.726 0.623  3.19

Demo is not running

Hi @joethorley. Interesting package.
Please, change line 42 of ssdtools.R to ggplot2::autoplot or add library(ggplot2)

comparative plot all distributions

line for each distribution

what licensing should we use?

parallelization of bootstrapping?

to reduce runtime

increase test coverage

goodness of fit statistics

as tbl data frame

obtain public censored data

gompertz not fitting

hi there, when i try to fit my data set using the R code, it works but i get errors when i attempt to plot, or make predictions. I've attached the data set. Cu_lowdata.xlsx

This is the code and errors:

Cu_low_dists <- ssd_fit_dists(LowerCuMarData)
autoplot(Cu_low_dists)
Error in pgompertz(q = c(0.7, 1.3, 1.94), shape = 0.391386784692936, scale = 0.313425202466778) :
unused argument (q = c(0.7, 1.3, 1.94))
ssd_gof(Cu_low_dists)
Error in pgompertz(q = c(0.7, 1.3, 1.94), shape = 0.391386784692936, scale = 0.313425202466778) :
unused argument (q = c(0.7, 1.3, 1.94))

It works when i do it in the R shiny app.

Is there a way to exclude some fits when using the R code, as you can in the Shiny app?
Thanks,
Jenni

shiny app

add lnormlnorm mixture distribution

complete documentation

diagnostic plots

QQ, PP, residuals etc

Correction to initial values for gamma distribution

Gamma distribution with shape (a) and scale (s) has
"mean and variance are E(X) = as and Var(X) = as^2" according to the help file for dgamma.

Solving for $a$ and $s$ gives
$ a = V(x) / E[x]$
so using method of moments, use a starting value for $a = var(x) / mean(x)$.

Solving for $a$ then gives
$ a = E[X] / s$
and a moment estimator is then $a = mean(x) / s = mean(x) / (V(x)/Mean(x) = mean^2/V(x)$

We have
if (dist$dist == "gamma") {
dist$start <- list(
scale = stats::var(x) / mean(x),
shape = mean(x)^2 / stats::var(x)^2
)
so it appear we have and extra square in the last line. This should read

if (dist$dist == "gamma") {
dist$start <- list(
scale = stats::var(x) / mean(x),
shape = mean(x)^2 / stats::var(x)
)

i.e. remove the ^2 from the var(x) in the initial value for shape.

Fortunately, the fitdist() function appears to be robust to making the shape parameter too small so no obvious user impact.

assign taxonomic groups to ccme species

bootstrap confidence intervals

FAQ answering various questions about issues

save model parameters

in tbl data frame

discrepancies in Table 1 of carl and angeline's report

Namely the HC5s appear to be incorrectly assigned to distributions in Table 1

For example

> ssd_hc5(boron_lnorm)
# A tibble: 1 x 5
   prob      est        se       lcl      ucl
  <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
1  0.05 1.681175 0.6848542 0.8618846 3.574394

the ssdca package with lnorm produces an hc5 estimate of 1.68

and below

> ssd_hc5(boron_dists, average = F)
# A tibble: 6 x 7
      dist  prop      est        se       lcl      ucl weight
     <chr> <dbl>    <dbl>     <dbl>     <dbl>    <dbl>  <dbl>
1    lnorm  0.05 1.681175 0.6874052 0.8851202 3.611764  0.133
2     llog  0.05 1.562052 0.7760264 0.6636144 3.652784  0.049
3 gompertz  0.05 1.299302 0.4419386 0.9581149 2.788982  0.271
4  lgumbel  0.05 1.768918 0.5369794 1.1299498 3.231247  0.010
5    gamma  0.05 1.075854 0.8355977 0.2652210 3.476163  0.268
6  weibull  0.05 1.087170 0.7784698 0.3654033 3.292916  0.269

but in Table 1 1.68 is the hc5 value for log-logistic

note aicc values in Table 1 appear to be correctly assigned

> ssd_gof(boron_dists)
# A tibble: 6 x 7
      dist        ad         ks        cvm      aic     aicc      bic
     <chr>     <dbl>      <dbl>      <dbl>    <dbl>    <dbl>    <dbl>
1    lnorm 0.5070335 0.10651430 0.07033164 239.0284 239.5084 241.6928
2     llog 0.4870694 0.09934088 0.05950233 241.0149 241.4949 243.6793
3 gompertz 0.6019563 0.12018745 0.08221977 237.6112 238.0912 240.2756
4  lgumbel 0.8286390 0.15826649 0.13401800 244.1860 244.6660 246.8504
5    gamma 0.4409319 0.11691481 0.05550569 237.6303 238.1103 240.2947
6  weibull 0.4346273 0.11697580 0.05426727 237.6253 238.1053 240.2897

cullen-fey graphs

Gamma distribution reports NA for standard error estimate; warn user; solution is rescale data

For some extreme sets of data, the gamma distribution reports NA for the standard error. Eg.]

library(ssdtools)
test.csv <- textConnection(
"dummy , Conc , Units
all , 0.1 , ug/L
all , 0.12 , ug/L
all , 0.24 , ug/L
all , 0.42 , ug/L
all , 0.67 , ug/L
all , 0.78 , ug/L
all , 120 , ug/L
all , 2030 , ug/L
all , 9033 , ug/L
all , 15000 , ug/L
all , 15779 , ug/L
all , 20000 , ug/L
all , 31000 , ug/L
all , 40000 , ug/L
all , 105650 , ug/L")
test <- read.csv(test.csv, header=TRUE, strip.white=TRUE, as.is=TRUE)

ssd_fit_dists(test, dists="gamma")

reports

Fitting of the distribution ' gamma ' by maximum likelihood
Parameters:
estimate Std. Error
scale 9.695693e+04 NaN
shape 1.642224e-01 0.04142257

Estimates of scale and shape are ok, but it was unable to compute a standard error for the scale parameter. This appears to be a scaling problem in inverting the hessian. The standard error for the scale parameter is about 67000 and the se for the shape is .04. So the vcov will have terms on the order of 67000^2 and .04^2 which vary by 2.8 x 10^12 (!) so I suspect that it is a numerical issue

If you rescale the data

factor=100
test2 <- test
test2$Conc <- test2$Conc/ factor
ssd_fit_dists(test2, dists="gamma")

all is well
Fitting of the distribution ' gamma ' by maximum likelihood
Parameters:
estimate Std. Error
scale 969.5359604 674.2705097
shape 0.1641717 0.0454135

The pdf for the gamma has a term x/s so if you multiple x by a factor k and multiply s by a factor of k, the pdf is unchanged except for the change in the scale factor.

I would advise user if any estimates have NA for the standard errors and advise them that a rescaling may be appropriate. We could make a specific message for each distribution, i.e. typically advise gamma users to divide by a factor k, or a generic message and the user needs to try both multiplying or dividing the data by a factor until message "disappears".

Estimates are ok, so estimates of HCx are ok. Bootstrapping used to compute SE for HCx so this should also work fine.

Carl

# A tibble: 6 x 7
      dist        ad         ks        cvm      aic     aicc      bic
     <chr>     <dbl>      <dbl>      <dbl>    <dbl>    <dbl>    <dbl>
1    lnorm 0.5070335 0.10651430 0.07033164 239.0284 239.5084 241.6928
2     llog 0.4870694 0.09934088 0.05950233 241.0149 241.4949 243.6793
3 gompertz 0.6024097 0.12019409 0.08229579 237.6112 238.0912 240.2756
4  lgumbel 0.8287721 0.15825915 0.13402809 100.7338 101.2138 103.3982
5    gamma 0.4409319 0.11691481 0.05550569 237.6303 238.1103 240.2947
6  weibull 0.4346273 0.11697580 0.05426727 237.6253 238.1053 240.2897

FIr the future: Please elaborate in the Description field what this is about. Is there some reference about the method you can add in the Description field in the form Authors (year) doi:.....?

Thanks!

  data <- ssdtools::boron_data
  data$Conc <- data$Conc / 100
  ssdtools:::ssd_fit_dist(data, dist = "llog")