model-r / modler Goto Github PK

An ecological niche model workflow based on dismo

Home Page: https://model-r.github.io/modleR

R 71.42% TeX 28.58%

ecological-niche-models dismo r rstats r-package

modler's Introduction

modleR: a workflow for ecological niche models

modleR is a workflow based on package dismo (Hijmans et al. 2017), designed to automatize some of the common steps when performing ecological niche models. Given the occurrence records and a set of environmental predictors, it prepares the data by cleaning for duplicates, removing occurrences with no environmental information and applying some geographic and environmental filters. It executes crossvalidation or bootstrap procedures, then it performs ecological niche models using several algorithms, some of which are already implemented in the dismo package, and others come from other packages in the R environment, such as glm, Support Vector Machines and Random Forests.

Citation

Andrea Sánchez-Tapia, Sara Ribeiro Mortara, Diogo Souza Bezerra Rocha, Felipe Sodré Mendes Barros, Guilherme Gall, Marinez Ferreira de Siqueira. modleR: a modular workflow to perform ecological niche modeling in R. https://www.biorxiv.org/content/10.1101/2020.04.01.021105v1

Installing

Currently modleR can be installed from GitHub:

# Without vignette
remotes::install_github("Model-R/modleR", build = TRUE)
# With vignette
remotes::install_github("Model-R/modleR",
                        build = TRUE,
                        dependencies = TRUE,
                        build_opts = c("--no-resave-data", "--no-manual"),
                        build_vignettes = TRUE)

Note regarding vignette building: the default parameters in build_opts include --no-build-vignettes. In theory, removing this will include the vignette on the installation but we have found that build_vignettes = TRUE is also necessary. During installation, R may ask to install or update some packages. If any of these return an error you can install them apart by running install.packages() and retry. When building the vignette, package rJava and a JDK will be needed. Also, make sure that the maxent.jar file is available and in the java folder of package dismo. Please download it here. Vignette building may take a while during installation.

Packages kuenm and maxnet should be installed from GitHub:

remotes::install_github("marlonecobos/kuenm")
remotes::install_github("mrmaxent/maxnet")

The workflow

The workflow consists of mainly four functions that should be used sequentially.

Setup: setup_sdmdata() prepares and cleans the data, samples the pseudoabsences, and organizes the experimental design (bootstrap, crossvalidation or repeated crossvalidation). It creates a metadata file with details for the current round and a sdmdata file with the data used for modeling
Model fitting and projecting: do_any() makes the ENM for one algorithm and partition; optionally, do_many() calls do_any() to fit multiple algorithms
Partition joining: final_model() joins the partition models into a model per species per algorithm
Ensemble: ensemble_model() joins the different models per algorithm into an ensemble model (algorithmic consensus) using several methods.

Folder structure created by this package

modleR writes the outputs in the hard disk, according to the following folder structure:

models_dir
├── projection1
│   ├── data_setup
│   ├── partitions
│   ├── final_models
│   └── ensemble_models
└── projection2
    ├── data_setup
    ├── partitions
    ├── final_models
    └── ensemble_models

We define a partition as the individual modeling round (one training and test data set and one algorithm)
We define the final models as joining together the partitions and obtaining one model per species per algorithm
Ensemble models join together the results obtained by different algorithms (Araújo and New 2007)
When projecting models into the present, the projection folder is called present, other projections will be named after their environmental variables
You can set models_dir wherever you want in the hard disk, but if you do not modify the default value, it will create the output under the working directory (its default value is ./models, where the period points to the working directory)
The names of the final and ensemble folders can be modified, but the nested subfolder structure will remain the same. If you change final_models default value ("final_model") you will need to include the new value when calling ensemble_model() (final_dir = "[new name]"), to indicate the function where to look for models. This partial flexibility allows for experimenting with final model and ensemble construction (by runnning final or ensemble twice in different output folders, for example).

The example dataset

modleR comes with example data, a list called example_occs with occurrence data for four species, and predictor variables called example_vars.

library(modleR)

str(example_occs)
#> List of 4
#>  $ Abarema_langsdorffii:'data.frame':    104 obs. of  3 variables:
#>   ..$ sp : chr [1:104] "Abarema_langsdorffii" "Abarema_langsdorffii" "Abarema_langsdorffii" "Abarema_langsdorffii" ...
#>   ..$ lon: num [1:104] -40.6 -40.7 -41.2 -41.7 -42.5 ...
#>   ..$ lat: num [1:104] -19.9 -20 -20.3 -20.5 -20.7 ...
#>  $ Eugenia_florida     :'data.frame':    341 obs. of  3 variables:
#>   ..$ sp : chr [1:341] "Eugenia_florida" "Eugenia_florida" "Eugenia_florida" "Eugenia_florida" ...
#>   ..$ lon: num [1:341] -35 -34.9 -34.9 -36.4 -42.1 ...
#>   ..$ lat: num [1:341] -6.38 -7.78 -8.1 -10.42 -2.72 ...
#>  $ Leandra_carassana   :'data.frame':    82 obs. of  3 variables:
#>   ..$ sp : chr [1:82] "Leandra_carassana" "Leandra_carassana" "Leandra_carassana" "Leandra_carassana" ...
#>   ..$ lon: num [1:82] -39.3 -39.6 -40.7 -41.2 -41.5 ...
#>   ..$ lat: num [1:82] -15.2 -15.4 -20 -20.3 -20.4 ...
#>  $ Ouratea_semiserrata :'data.frame':    90 obs. of  3 variables:
#>   ..$ sp : chr [1:90] "Ouratea_semiserrata" "Ouratea_semiserrata" "Ouratea_semiserrata" "Ouratea_semiserrata" ...
#>   ..$ lon: num [1:90] -40 -42.5 -42.4 -42.9 -42.6 ...
#>   ..$ lat: num [1:90] -16.4 -20.7 -19.5 -19.6 -19.7 ...
species <- names(example_occs)
species
#> [1] "Abarema_langsdorffii" "Eugenia_florida"      "Leandra_carassana"   
#> [4] "Ouratea_semiserrata"

library(sp)
par(mfrow = c(2, 2), mar = c(2, 2, 3, 1))
for (i in 1:length(example_occs)) {
  plot(!is.na(example_vars[[1]]),
       legend = FALSE,
       main = species[i],
       col = c("white", "#00A08A"))
  points(lat ~ lon, data = example_occs[[i]], pch = 19)
}
par(mfrow = c(1, 1))

Figure 1. The example dataset: predictor variables and occurrence for four species.

We will filter the example_occs file to select only the data for the first species:

occs <- example_occs[[1]]

Cleaning and setting up the data: `setup_sdmdata()`

The first step of the workflow is to setup the data, that is, to partition it according to each project needs, to sample background pseudoabsences and to apply some data cleaning procedures, as well as some filters. This is done by function setup_sdmdata()

setup_sdmdata() has a large number of parameters:

args(setup_sdmdata)
#> function (species_name, occurrences, predictors, lon = "lon", 
#>     lat = "lat", models_dir = "./models", real_absences = NULL, 
#>     buffer_type = NULL, dist_buf = NULL, env_filter = FALSE, 
#>     env_distance = "centroid", buffer_shape = NULL, min_env_dist = NULL, 
#>     min_geog_dist = NULL, write_buffer = FALSE, seed = NULL, 
#>     clean_dupl = FALSE, clean_nas = FALSE, clean_uni = FALSE, 
#>     geo_filt = FALSE, geo_filt_dist = NULL, select_variables = FALSE, 
#>     cutoff = 0.8, sample_proportion = 0.8, png_sdmdata = TRUE, 
#>     n_back = 1000, partition_type = c("bootstrap"), boot_n = 1, 
#>     boot_proportion = 0.7, cv_n = NULL, cv_partitions = NULL) 
#> NULL

species_name is the name of the species to model
occurrences is the data frame with occurrences, lat and lon are the names of the columns for latitude and longitude, respectively. If they are already named lat and lon they need not be specified.
predictors: is the rasterStack of the environmental variables

There are a couple options for data cleaning:

clean_dupl will delete exact duplicates in the occurrence data
clean_nas will delete any occurrence with no environmental data in the predictor set
clean_uni will leave only one occurrence per pixel

The function also sets up different experimental designs:

partition_type can be either bootstrap or k-fold crossvalidation
boot_n and cv_n perform repeated bootstraps and repeated k-fold crossvalidation, respectively
boot_proportion sets the proportion of data to be sampled as training set (defaults to 0.8)
cv_partitions sets the number of partitions in the k-fold crossvalidations (defaults to 3) but overwrites part when n < 10, setting part to the number of occurrence records (a jacknife partition).

Pseudoabsence sampling is performed by function has also some options:

real_absences can be used to specify a set of user-defined absences, with species name, lat and lon columns
geo_filt will eliminate records that are at less than geo_filt_dist between them, in order to control for spatial autocorrelation
buffer_type: can build a distance buffer around the occurrence points, by taking either the maximal, median or mean distance between points. It can also take a user-defined shapefile as the area for pseudoabsence sampling
env_filter calculates the euclidean distance and removes the closest areas in the environmental space from the sampling of pseudoabsences

Pseudoabsence points will be sampled (using dismo::randomPoints()) within the buffer and outside the environmental filter, in order to control for the area accessible to the species (M in the BAM diagram).

seed: for reproducibility purposes

test_folder <- "~/modleR_test"
sdmdata_1sp <- setup_sdmdata(species_name = species[1],
                             occurrences = occs,
                             predictors = example_vars,
                             models_dir = test_folder,
                             partition_type = "crossvalidation",
                             cv_partitions = 5,
                             cv_n = 1,
                             seed = 512,
                             buffer_type = "mean",
                             png_sdmdata = TRUE,
                             n_back = 500,
                             clean_dupl = TRUE,
                             clean_uni = TRUE,
                             clean_nas = TRUE,
                             geo_filt = FALSE,
                             geo_filt_dist = 10,
                             select_variables = TRUE,
                             sample_proportion = 0.5,
                             cutoff = 0.7)
#> metadata file found, checking metadata
#> running data setup
#> cleaning data
#> cleaning duplicates
#> cleaning occurrences with no environmental data
#> cleaning occurrences within the same pixel
#> 5 points removed
#> 99 clean points
#> creating buffer
#> Applying buffer
#> Warning in RGEOSDistanceFunc(spgeom1, spgeom2, byid, "rgeos_distance"): Spatial
#> object 1 is not projected; GEOS expects planar coordinates
#> Warning: GEOS support is provided by the sf and terra packages among others
#> Warning in rgeos::gBuffer(spgeom = occurrences, byid = FALSE, width =
#> dist.buf): Spatial object is not projected; GEOS expects planar coordinates
#> sampling pseudoabsence points with mean buffer
#> selecting variables...
#> No variables were excluded with cutoff = 0.7
#> saving metadata
#> extracting environmental data
#> extracting background data
#> performing data partition
#> saving sdmdata
#> Plotting the dataset...
#> DONE!

The function will return a sdmdata data frame, with the groups for training and test in bootstrap or crossvalidation, a pa vector that marks presences and absences, and the environmental dataset. This same data frame will be written in the hard disk, as sdmdata.txt
It will also write a metadata.txt with the parameters of the latest modeling round. If there has been a cleaning step, it will show different values in the “original.n” and “final.n” columns.
NOTE: setup_sdmdata will check if there’s a prior folder structure and sdmdata.txt and metadata.txt files, in order to avoid repeating the data partitioning.
- If a call to the function encounters previously written metadata, it will check if the current round has the same parameters and skip the data partitioning. A message will be displayed: #> metadata file found, checking metadata #> same metadata, no need to run data partition
- If a previous metadata file is found but it has different metadata (i.e. there is an inconsistency between the existing metadata and the current parameters), it will run the function with the current parameters.

Fitting a model per partition: `do_any()` and `do_many()`

Functions do_any() and do_many() create a model per partition, per algorithm. The difference between these functions that do_any() performs modeling for one individual algorithm at a time, that can be chosen by using parameter algorithm, while do_many() can select multiple algorithms, with TRUE or FALSE statements (just as BIOMOD2 functions do).

The available algorithms are:

"bioclim", "maxent", "mahal", "domain", as implemented in dismo package (Hijmans et al. 2017),
Support Vector Machines (SVM), as implemented by packages kernlab (svmk Karatzoglou et al. 2004) and e1071 (svme Meyer et al. 2017),
GLM from base R, here implemented with a stepwise selection approach
Random Forests (from package randomForest Liaw and Wiener 2002)
Boosted regression trees (BRT) as implemented by gbm.step() function in dismo package (Hastie, Tibshirani, and Friedman 2001; Elith, Leathwick, and Hastie 2009).

Details for the implementation of each model can be accessed in the documentation of the function.

Here you can see the differences between the parameters of both functions. do_many() calls several instances of do_any() Sometimes you may only want to call do_many() but for better control and parallelization by algorithm it may be better to call do_any() individually.

args(do_any)
#> function (species_name, predictors, models_dir = "./models", 
#>     algorithm = c("bioclim"), project_model = FALSE, proj_data_folder = "./data/proj", 
#>     mask = NULL, write_rda = FALSE, png_partitions = FALSE, write_bin_cut = FALSE, 
#>     dismo_threshold = "spec_sens", equalize = TRUE, sensitivity = 0.9, 
#>     proc_threshold = 0.5, ...) 
#> NULL
args(do_many)
#> function (species_name, bioclim = FALSE, domain = FALSE, glm = FALSE, 
#>     mahal = FALSE, maxent = FALSE, maxnet = FALSE, rf = FALSE, 
#>     svmk = FALSE, svme = FALSE, brt = FALSE, ...) 
#> NULL

Calling do_many() and setting bioclim = TRUE is therefore equivalent to call do_any() and set algorithm = "bioclim".

sp_maxnet <- do_any(species_name = species[1],
                    algorithm = "maxnet",
                    predictors = example_vars,
                    models_dir = test_folder,
                    png_partitions = TRUE,
                    write_bin_cut = FALSE,
                    equalize = TRUE,
                    write_rda = TRUE)

The resulting object is a table with the performance metrics, but the actual output is written on disk

sp_maxnet
#>                kappa spec_sens no_omission prevalence equal_sens_spec
#> thresholds 0.5466117 0.4121507   0.2633284  0.1707096       0.3985237
#>            sensitivity         species_name algorithm run partition presencenb
#> thresholds   0.3257664 Abarema_langsdorffii    maxnet   1         1         20
#>            absencenb correlation    pvaluecor   AUC AUC_pval AUCratio     pROC
#> thresholds       100    0.747932 9.702981e-23 0.971       NA    1.942 1.882305
#>            pROC_pval TSSmax  KAPPAmax dismo_threshold prevalence.value
#> thresholds         0   0.82 0.8043478       spec_sens        0.1666667
#>                  PPP       NPP TPR  TNR  FPR FNR       CCR     Kappa   F_score
#> thresholds 0.6923077 0.9787234 0.9 0.92 0.08 0.1 0.9166667 0.7321429 0.7826087
#>              Jaccard
#> thresholds 0.6428571

The following lines call for bioclim, GLM, random forests, BRT, svme (from package e1071), and smvk (from package kernlab)

many <- do_many(species_name = species[1],
                predictors = example_vars,
                models_dir = test_folder,
                png_partitions = TRUE,
                write_bin_cut = FALSE,
                write_rda = TRUE,
                bioclim = TRUE,
                domain = FALSE,
                glm = TRUE,
                svmk = TRUE,
                svme = TRUE,
                maxent = FALSE,
                maxnet = TRUE,
                rf = TRUE,
                mahal = FALSE,
                brt = TRUE,
                equalize = TRUE)

In addition:

mask: will crop and mask the partition models into a ShapeFile
png_partitions will create a png file of the output

At the end of a modeling round, the partition folder containts:

A .tif file for each partition, continuous, binary and cut by the threshold that maximizes its TSS (TSSmax). Its name will indicate the algorithm, the type of model (cont, bin or cut), the name of the species, the run and partition.
Figures in .png to explore the results readily, without reloading them into R or opening them in a SIG program. The creation of these figures can be controlled with the png_partitions parameter.
A .txt table with the evaluation data for each partition: evaluate_[Species name ]_[partition number]_[algorithm].txt. These files will be read by the final_model() function, to generate the final model per species.
A file called sdmdata.txt with the data used for each partition
A file called metadata.txt with the metadata of the current modeling round.
An optional .png image of the data (controlled by parameter png_sdmdata = TRUE)

Joining partitions: `final_model()`

There are many ways to create a final model per algorithm per species. final_model() follows the following logic:

The partitions that will be joined can be the raw, uncut models, or the binary models from the previous step, they form a raster::rasterStack() object.
The means for the raw models can be calculated (raw_mean)
From raw_mean, a binary model can be obtained by cutting it by the mean threshold that maximizes the selected performance metric for each partition (bin_th_par), this is raw_mean_th. From this, values above the threshold can be revovered (raw_mean_cut).
In the case of binary models, since they have already been transformed into binary, a mean can be calculated (bin_mean). This bin_mean reflects the consensus between partitions, and its scale is categorical.
From bin_mean, a specific consensus level can be chosen (i.e. how many of the models predict an area, consensus_level) and the resulting binary model can be built (bin_consensus). The parameter consensus_level allows to set this level of consensus (defaults to 0.5: majority consensus approach).
NOTE: The final models can be done using a subset of the algorithms avaliable on the hard disk, using the parameter algorithms. If left unspecified, all algorithms listed in the evaluate files will be used.

args(final_model)
#> function (species_name, algorithms = NULL, scale_models = TRUE, 
#>     consensus_level = 0.5, models_dir = "./models", final_dir = "final_models", 
#>     proj_dir = "present", which_models = c("raw_mean"), mean_th_par = c("spec_sens"), 
#>     uncertainty = FALSE, png_final = TRUE, sensitivity = 0.9, 
#>     ...) 
#> NULL

final_model(species_name = species[1],
            algorithms = NULL, #if null it will take all the algorithms in disk
            models_dir = test_folder,
            which_models = c("raw_mean",
                             "bin_mean",
                             "bin_consensus"),
            consensus_level = 0.5,
            uncertainty = TRUE,
            overwrite = TRUE)

final_model() creates a .tif file for each final.model (one per algorithm) under the specified folder (default: final_models)

The raw_mean final models for each algorithm are these:

Algorithmic consensus with `ensemble_model()`

The fourth step of the workflow is joining the models for each algorithm into a final ensemble model. ensemble_model() calculates the mean, standard deviation, minimum and maximum values of the final models and saves them under the folder specified by ensemble_dir. It can also create these models by a consensus rule (what proportion of final models predict a presence in each pixel, 0.5 is a majority rule, 0.3 would be 30% of the models).

ensemble_model() uses a which_final parameter -analog to which_model in final_model() to specify which final model(s) (Figure 2) should be assembled together (the default is a mean of the raw continuous models: which_final = c("raw_mean")).

args(ensemble_model)
#> function (species_name, occurrences, lon = "lon", lat = "lat", 
#>     models_dir = "./models", final_dir = "final_models", ensemble_dir = "ensemble", 
#>     proj_dir = "present", algorithms = NULL, which_ensemble = c("average"), 
#>     which_final = c("raw_mean"), performance_metric = "TSSmax", 
#>     dismo_threshold = "spec_sens", consensus_level = 0.5, png_ensemble = TRUE, 
#>     write_occs = FALSE, write_map = FALSE, scale_models = TRUE, 
#>     uncertainty = TRUE, ...) 
#> NULL
ens <- ensemble_model(species_name = species[1],
                      occurrences = occs,
                      performance_metric = "pROC",
                      which_ensemble = c("average",
                                         "best",
                                         "frequency",
                                         "weighted_average",
                                         "median",
                                         "pca",
                                         "consensus"),
                      consensus_level = 0.5,
                      which_final = "raw_mean",
                      models_dir = test_folder,
                      overwrite = TRUE) #argument from writeRaster
#> [1] "Thu Aug  3 11:36:24 2023"
#> [1] "DONE!"
#> [1] "Thu Aug  3 11:36:36 2023"

plot(ens)

Workflows with multiple species

Our example_occs dataset has data for four species. An option to do the several models is to use a for loop

args(do_many)
args(setup_sdmdata)

for (i in 1:length(example_occs)) {
  sp <- species[i]
  occs <- example_occs[[i]]
  setup_sdmdata(species_name = sp,
                models_dir = "~/modleR_test/forlooptest",
                occurrences = occs,
                predictors = example_vars,
                buffer_type = "distance",
                dist_buf = 4,
                write_buffer = TRUE,
                clean_dupl = TRUE,
                clean_nas = TRUE,
                clean_uni = TRUE,
                png_sdmdata = TRUE,
                n_back = 1000,
                partition_type = "bootstrap",
                boot_n = 5,
                boot_proportion = 0.7
  )
}

for (i in 1:length(example_occs)) {
  sp <- species[i]
  do_many(species_name = sp,
          predictors = example_vars,
          models_dir = "~/modleR_test/forlooptest",
          png_partitions = TRUE,
          bioclim = TRUE,
          maxnet = FALSE,
          rf = TRUE,
          svmk = TRUE,
          svme = TRUE,
          brt = TRUE,
          glm = TRUE,
          domain = FALSE,
          mahal = FALSE,
          equalize = TRUE,
          write_bin_cut = TRUE)
}

for (i in 1:length(example_occs)) {
  sp <- species[i]
  final_model(species_name = sp,
              consensus_level = 0.5,
              models_dir = "~/modleR_test/forlooptest",
              which_models = c("raw_mean",
                               "bin_mean",
                               "bin_consensus"),
              uncertainty = TRUE,
              overwrite = TRUE)
}

for (i in 1:length(example_occs)) {
  sp <- species[i]
  occs <- example_occs[[i]]
  ensemble_model(species_name = sp,
                 occurrences = occs,
                 which_final = "bin_consensus",
                 png_ensemble = TRUE,
                 models_dir = "~/modleR_test/forlooptest")
}

Another option is to use the purrr package (Henry and Wickham 2017).

library(purrr)

example_occs %>% purrr::map2(.x = .,
                             .y = as.list(names(.)),
                             ~ setup_sdmdata(species_name = .y,
                                             occurrences = .x,
                                             partition_type = "bootstrap",
                                             boot_n = 5,
                                             boot_proportion = 0.7,
                                             clean_nas = TRUE,
                                             clean_dupl = TRUE,
                                             clean_uni = TRUE,
                                             buffer_type = "distance",
                                             dist_buf = 4,
                                             predictors = example_vars,
                                             models_dir = "~/modleR_test/temp_purrr",
                                             n_back = 1000))

species %>%
  as.list(.) %>%
  purrr::map(~ do_many(species_name = .,
                       predictors = example_vars,
                       models_dir = "~/modleR_test/temp_purrr",
                       bioclim = TRUE,
                       maxnet = FALSE,
                       rf = TRUE,
                       svme = TRUE,
                       svmk = TRUE,
                       domain = FALSE,
                       glm = TRUE,
                       mahal = FALSE,
                       brt = TRUE,
                       equalize = TRUE))

species %>%
  as.list(.) %>%
  purrr::map(~ final_model(species_name = .,
                           consensus_level = 0.5,
                           models_dir =  "~/modleR_test/temp_purrr",
                           which_models = c("raw_mean",
                                            "bin_mean",
                                            "bin_consensus"),
                           overwrite = TRUE))

example_occs %>% purrr::map2(.x = .,
                             .y = as.list(names(.)),
                             ~ ensemble_model(species_name = .y,
                                              occurrences = .x,
                                              which_final = "raw_mean",
                                              png_ensemble = TRUE,
                                              models_dir = "~/modleR_test/temp_purrr",
                                              overwrite = TRUE))

These workflows can also be paralellized by species or species algorithms

References

Araújo, M, and M New. 2007. “Ensemble Forecasting of Species Distributions.” Trends in Ecology & Evolution 22 (1): 42–47. https://doi.org/10.1016/j.tree.2006.09.010.

Elith, J., J. R. Leathwick, and T. Hastie. 2009. “A Working Guide to Boosted Regression Trees.” Journal of Animal Ecology 77 (4): 802–13. https://doi.org/fn6m6v.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Heidelberg.

Henry, Lionel, and Hadley Wickham. 2017. “Purrr: Functional Programming Tools. R Package Version 0.2.4.”

Hijmans, Robert J., Steven Phillips, John Leathwick, and Jane Elith. 2017. “Dismo: Species Distribution Modeling. R Package Version 1.1-4.”

Karatzoglou, Alexandros, Alex Smola, Kurt Hornik, and Achim Zeileis. 2004. “Kernlab - An S4 Package for Kernel Methods in R.” Journal of Statistical Software 11 (9): 1–20.

Liaw, Andy, and Matthew Wiener. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22.

Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. 2017. “E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien.”

modler's People

Contributors

Stargazers

Watchers

modler's Issues

check pROC implementation

make pROC again from documentation?

overwrite=TRUE

checar consistência de overwrite=TRUE em final e ensemble

adicionar mensagens de erro

em:

setup_sdmdata
do_any
do_many
final_model
ensemble_model

should ensemble_model() have boolean expression for model selection?

in cases that the user may want something like 0.7 <TSS < 1 or TSS > 0.7 & pval_pROC > 0.05?

devtools::load_all() sobre escreve funções de outros pacotes

No computador do Diogo e no meu do jb deu assim, e foi preciso usar export_all = F.
Aqui em casa não estou tendo esse comportamento. Usando devtools v.1.13.6.

objeto coordenadas: formato e retirar espaços e nomes de autores do exemplo

peloamor vamos remover espaços e nome do autor das espécies no objeto coordenadas
se usamos cada espécie em uma rodada, o formato do dado não poderia ser uma lista?

Error with writing rasters with do_any

Hello everyone,
I am running the code with the example dataset to try the package before using it with my data.

The first step works but I get some warnings:

sdmdata_1sp <- setup_sdmdata(species_name = species[1],
occurrences = occs,
predictors = example_vars,
models_dir = test_folder,
partition_type = "crossvalidation",
cv_partitions = 5,
cv_n = 1,
seed = 512,
buffer_type = "mean",
png_sdmdata = TRUE,
n_back = 500,
clean_dupl = FALSE,
clean_uni = FALSE,
clean_nas = FALSE,
geo_filt = FALSE,
geo_filt_dist = 10,
select_variables = TRUE,
sample_proportion = 0.5,
cutoff = 0.7)

Warning messages:
1: In wkt(obj) : CRS object has no comment
2: In wkt(obj) : CRS object has no comment
3: In RGEOSDistanceFunc(spgeom1, spgeom2, byid, "rgeos_distance") :
Spatial object 1 is not projected; GEOS expects planar coordinates
4: In wkt(obj) : CRS object has no comment
5: In wkt(obj) : CRS object has no comment
6: In rgeos::gBuffer(spgeom = occurrences, byid = FALSE, width = dist.buf) :
Spatial object is not projected; GEOS expects planar coordinates

Then I get an error when fitting the models

sp_maxnet <- do_any(species_name = species[1],
algorithm = "maxnet",
predictors = example_vars,
models_dir = test_folder,
png_partitions = TRUE,
write_bin_cut = FALSE,
equalize = TRUE)

maxnet
Ano_exi maxnet run number 1 part. nb. 3
fitting models
projecting the models
evaluating the models
writing evaluation tables
writing raster files
Error in .gd_SetProjectWkt(object, ...) :
STRING_ELT() can only be applied to a 'character vector', not a 'closure'
Además: Warning message:
In .gd_SetProjectWkt(object, ...) : NO WKT AVAILABLE FOR PROJ >= 6

Any help would be appreciated. Thank you very much

if(any()) em clean_na

precisa resolver a indexação em clean_na. Não funciona quando todos os pontos estão fora

Error with index characterization

I've been trying to create basic CJS models, before adding more complex covariates from my dataset. I've been using the following link as guidance:

https://jamesepaterson.github.io/jamespatersonblog/2020-04-26_introduction_to_CJS.html

instead of comparing sexes, I am looking at differences between locations.
When I've tried cjs.m2 on my own dataset, I keep receiving this error that does not allow me to continue:

Error in intI(i, n = x@Dim[1], dn[[1]], give.dn = FALSE) :
index larger than maximal 42

I have not been able to fix this error. This is my code:

cjs.m1 <- crm(fish)

cjs.m1 <- cjs.hessian(cjs.m1)
cjs.m1

exp(cjs.m1$results$beta$Phi)/(1+exp(cjs.m1$results$beta$Phi))

predict(cjs.m1,
newdata = data.frame(location =c("River1", "River2")),
se = TRUE)

cjs.m1.unequaltime <- crm(fish,
time.intervals = c(1.7, 2.9, 2.5, 6.5, 3.99, 3.4))

predict(cjs.m1.unequaltime)

fish.proc <- process.data(fish,
group = "location")

fish.ddl <- make.design.data(fish.proc)

Phi.dot <- list(formula = ~1) # ~1 is always a constant (or single estimate)
Phi.location <-list (formula = ~location)
p.location <-list(formula = ~location)

cjs.m2 <- crm(fish.proc,
fish.ddl,
model.parameters = list(Phi = Phi.dot,
p = p.location),
accumulate = FALSE)

Erro em clean_dupls = TRUE com dados sem coordenadas duplicadas

no procedimento do setup_sdmdata quando temos clean_dupl = TRUE e dados sem duplicadas, dá um erro: 'Error in presvals[dupls, ] : incorrect number of dimensions'. Roda ok com clean_dupl=FALSE, mas precisa incluir um if para quando dupl == 0.

buffer_type = "user"

Está faltando testar com min_dist.

Usar pontos de pseudo-ausência balanceado para alguns algoritmos (e.g. RandomForest)

Para alguns algoritmos (e.g. RF) o ideal é utilizar um número balanceado de pontos de presença e de pontos de ausência. Entretanto, no pacote usamos o mesmo número de ausências para todos os algoritmos.
Então acho que podemos ir na função dos algoritmos que precisam de pontos balanceados e incluir linhas de comando para fazer uma sub-amostra dos pontos de pseudo-ausência com o mesmo número de pontos de presença.

Erro em RandomForest com `equalize = T`

Error in sample.int(length(x), size, replace, prob) :  cannot take a sample larger than the population when 'replace = FALSE' 
Called from: sample.int(length(x), size, replace, prob) Browse[1]> library(purrr) Browse[1]> coordenadas %>% split(.$sp) %>% +     purrr::map(~ do_enm(species_name = unique(.$sp), ...

No código de purrr da vignette, com equalize = T na hora de rodar randomforests.
Commit b6d3af9

explain dir structure in documentation

detail explanation of dir structure: setup and projection_dir

How to set parameters for different models

Hello everyone,
Is it possible to set parameters for different models in modleR? For example, defining "ntree" for random forest model or RM and feature class types for MaxEnt?

Many thanks,
Iman

General check

Precisamos dar uma checada geral às últimas mudanças,
tenho certeza de que o último commit em master e o último commit em clean não estão deixando rodar a vignette,
o commit b6d3af9 está knittando - mas mesmo assim não tenho certeza de que mahal esteja bem cortado (pelo LPT como valor mínimo) e que os demais algoritmos estejam escalando bem entre 0 e 1.
Fui rodar o código de purrr e recebi um erro, Também recebo erros inconsistentes e nem sempre encontro a origem, deveriamos fazer aqui a lista de erros, nas issues.

Já comecei: issue #26
Fechei outras tantas que já tinham sido resolvidas.

min_geog_dist >= dist.buf in create_buffer()

Hello!

In the create_buffer () function, line 161, returns a warning if min_geog_dist> = dist.buf in create_buffer (). But, if this condition is true, the workflow will break, since the r_buffer cutout will not have pixels.
So, I think we could replace the warning with a stop, so the user should review the informed min_geog_dist.

testing connection with slack NCCG-JBRJ

just testing

Error in intI(j, n = x@Dim[2], dn[[2]], give.dn = FALSE) : index larger than maximal 185

desde o commit efcc261 estou encontrando mas tem um comportamento imprevisível

do_many,do_any, and final_model

I was running the following code:
final_model(species_name = species[1],
algorithms = NULL, #if null it will take all the algorithms in disk
models_dir = test_folder,
which_models = c("raw_mean",
"bin_mean",
"bin_consensus"),
consensus_level = 0.5,
uncertainty = TRUE,
overwrite = TRUE)

ens <- ensemble_model

However, I found the following error
Error in final_model(species_name = species[1], algorithms = NULL, models_dir = test_folder, :
could not find function "final_model"
Error in ensemble_model(species_name = species[1], occurrences = occs, :
could not find function "ensemble_model"

I need your help on how to fix this issue.

reestruturar geo_filt: filtro ambiental e geográfico

consertar argumento min_distance

evaluate presencenb e absencenb

Nos arquivos de evaluate tem as colunas presencenb e absencenb, estas colunas seriam os números de registros utilizados para gerar os modelos de cada partição?
Caso sim, os valores não estão corretos, pois mostram o número de pontos que a partição possui (presença e ausência) e não a quantidade de pontos utilizados.

Por exemplo:
Um conjunto de dados com 16 pontos de presença e 100 de ausências, em cada partição vai usar 12 pontos de presença e 75 de ausência para treino. Porém, no evaluate vai aparecer 4 para presença e 25 para ausências.

do_enm()

Na função do_enm() , vamos incluir gerar os ensembles e modelos finais ?
Aí essa função faria o processo da modelagem completo.

Parameter inheritance between functions

today
everything inherits from setup_sdmdata, that has an ellipsis (...) for create_buffer.
organizar isto.
test
create_buffer -> setup_sdmdata -> do_any -> do_enm -> final_model -> ensemble?

distance algorithms

to do or not to do distance algorithms from andrea_tests

Implementar projeções

Implementar a possibilidade de gerar múltiplas projeções (passado, futuro e/ou outra área).
Essa talvez seja a implementação mais complexa que faremos. Então, estou imaginando que seja a longo prazo.

Pseudoausencia

Prezados(as),

Estou com dúvida sobre como escolho as diferentes estratégias de pseudoausencia considerando o artigo de Babet_Massin et al. 2012.
A sugestão seria quando for usar algoritmos estatísticos, devo escolher essas duas estratégias a depender da quantidade de ocorrência: "When 100 or less presence points, a minimum of 10 runs with 100 PA; SRE strategy. When more than 300 presence points, 1000 PA should be select"; disk strategy. Quando for usar algoritmos de aprendizado de máquina o ideal seria usar 10000 PA com apenas 1run.
No modleR devo fazer o setup e o do_any para cada algoritmo, escolhendo as diferentes estratégias? Se eu fizer assim devo mudar/criar diferentes objetos para chamar nas próximas funções ou tem alguma função no próprio modleR que faz isso automaticamente?

Desde já muito obrigada pelo retorno,

Luara Tourinho

make occurrences points on ensemble_model optional

create a new argument plot ans set plot=TRUE as default to plot occurrences values

modificar do_any e do_many para ler o caminho para os arquivos

do_any e do_many devem ler o caminho para os arquivos sdmdata e não receber um objeto

AUC and pROC curve

Hi,
I don't have any issue, your guide line was perfect and very easy to follow, so thank you for that. I was wondering if there is a way to get the graphs of the curves in the final models or for ensembles?

Also, if there is a way to get variable importance of the final models?

Thanks

Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE'

When running random forests with equalize = T
Probably due to cases when the presences outnumber the absences, so equalize cannot sample correctly.

Definir como os dados de exemplo serão disponibilizados

Há 2 maneiras de disponibilizar dados num pacote R:

exported data: usa-se essa forma para disponibilizar objetos R prontos para uso, como data.frames e RasterStacks
raw data: usa-se essa forma para disponibilizar os arquivos brutos, como arquivos CSV ou aqueles .grds que lemos com raster::stack()

Usa-se a 2ª abordagem quando deseja-se mostrar como fazer o carregamento/parsing a partir dos dados brutos. Nesse caso, a função system.file() é usada para retornar o path do arquivo.

Optei pela 1ª maneira (1e18efe) e por mostrar o exemplo de como carregar os dados e usar as funções nas vignettes (que ainda não escrevi).

Concordam com esse caminho? Tomei essa decisão unilateralmente porque o @gomesvilasboas quer fazer o commit de umas coisas e está só aguardando eu terminar isso. Se quiserem, eu mudo depois. Abri essa issue para ler a opinião de vocês.

reproducible data-raw to data files and update CRS information

we need a reproducible script to create our example data and to update the way we handle CRS, as reported in issue #80

Error message with (setup_sdmdata) step

Hello Everyone,

When I run the first code in ModleR, it gives me an error message. I am posting here the R code and error message. Can someone please help me with this.

setup_sdmdata(species_name = "L_agrorensis", occurrences = occ, predictors = stack, lon = "longitude",
lat = "latitude", models_dir = "./models", real_absences = NULL,
buffer_type = NULL, dist_buf = NULL, env_filter = FALSE, env_distance = "centroid",
buffer_shape = NULL, min_env_dist = NULL, min_geog_dist = NULL, write_buffer = FALSE,
seed = NULL, clean_dupl = FALSE, clean_nas = FALSE, clean_uni = FALSE, geo_filt = FALSE,
geo_filt_dist = NULL, select_variables = FALSE, cutoff = 0.8, sample_proportion = 0.8,
png_sdmdata = TRUE, n_back = 1000, partition_type = C("bootstrap"), boot_n = 1,
boot_proportion = 0.7, cv_n = NULL, cv_partitions = NULL)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘res’ for signature ‘"standardGeneric"’

Shapefile to select background points from

Hi,
I tried many different types of shape files as buffers but none of them seems to be the accepted format. Each time I get the following error message.

setup_sdmdata(species_name = "L_agrorensis", occurrences = occ, predictors = stack, lon = "longitude",
lat = "latitude", models_dir = "./models", real_absences = NULL,
buffer_type = "user", buffer_shape = marea, dist_buf = NULL, env_filter = FALSE, env_distance = "centroid",
min_env_dist = NULL, min_geog_dist = NULL, write_buffer = FALSE,
seed = NULL, clean_dupl = FALSE, clean_nas = FALSE, clean_uni = FALSE, geo_filt = FALSE,
geo_filt_dist = NULL, select_variables = FALSE, cutoff = 0.8, sample_proportion = 0.8,
png_sdmdata = TRUE, n_back = 1000, partition_type = c("bootstrap"), boot_n = 1,
boot_proportion = 0.7, cv_n = NULL, cv_partitions = NULL)

sampling pseudoabsence points with user buffer
Error in .doCellFromXY(object@ncols, object@nrows, object@extent@xmin, :
Not compatible with requested type: [type=list; target=double].

Is there any specific format for the user defined M area shapefile?

check omission in GLM

GLM dão muitos valores de NA na omissão, precisa checar de onde vem

Atualizar nome do pacote

Precisamos atualizar o nome oficial do pacote.
A sugestão tinha sido "modelR" .

ROC plot, Variable Contribution and AICc

Estou tentando ver se há uma forma de obter um gráfico da ROC, os valores de contribuição e permuta para as variáveis (especialmente em modelos MaxEnt) e se é possível calcular o valor do AICc.

No caso do MaxEnt, se o retorno de do_any() fosse o próprio modelo, seria fácil resolver as primeiras questões. Alguma forma de fazer isso (memo uma indicação de alguma modificação direta no código).

Obrigado!!

Escolha do threshold

Está implementada a escolha do threshold aplicado para gerar os modelos binários?
Ao meu ver, seria interessante ter a possibilidade de escolher qual será o threshold aplicado nas partições.
Escolher entre:

Fixed cumulative value 1 (valor fixo em 1% de A.A.)
Fixed cumulative value 5 (valor fixo em 5% de A.A.)
Fixed cumulative value 10 (valor fixo em 10% de A.A.)
Minimum training presence (omissão = 0% dos pontos de treino)
10 percentile training presence (omissão = 10% dos pontos de treino)
Equal training sensitivity and specificity (omissão e comissão iguais dos pontos de treino)
Maximum training sensitivity plus specificity (menor omissão de treino na menor área preditiva)
Equal test sensitivity and specificity (omissão e comissão iguais dos pontos de teste)
Maximum test sensitivity plus specificity (menor omissão de teste na menor área prediti

Importar pacotes necessários

Precisamos testar quais pacotes são necessários importar/instalar ao carregar o pacote modelr.
Atualmente falta incluir 'rgdal' e 'maps'.

Renomear Pacote

Vamos eleger um nome para o pacote. Um nome que possa ser utilizado no CRAN.
A proposta que foi levantada há um tempo foi modelR (apenas com R maiúscula).
Eu acho legal esse nome, mas a chance de 'confusão' com outro pacote (modelr), que já existe no CRAN, é grande. Ao mesmo tempo não sei se isso seria um problema.
Por isso, abri esse issue, para fecharmos um nome e ficar documentado aqui.

remove PPP and NPP from confusion matrix

remove extra columns that are not important

projection behavior is different between do_any() and final_model()

a lista de subpastas vs. uma projeção. Isto deveria ser padronizado? talvez sim.

explicar algoritmos em do_any

colocar em details e adicionar refs

Implementar 'rodadas' de partições.

Podemos implementar a possibilidade de fazer várias rodadas de partições, que seria equivalente a uma cross-validation com bootstrap. Pois, isso pode dar uma confiabilidade maior aos modelos finais.

O procedimento é basicamente fazer os modelos das partições, em seguida criar novas partições e gerar os modelos.
Exemplo:
Temos 9 pontos e escolhemos fazer 3 partições, então, na primeira rodada temos:

partition 1 = pts 1,2,3,
partition 2 = pts 4,5,6
partition 3 = pts 7,8,9

Na segunda rodada poderia ficar assim:

partition 1 = pts 1,4,7,
partition 2 = pts 2,5,8
partition 3 = pts 3,6,9

E podemos repetir cada rodada diversas vezes.
Desta forma, 5 algoritmos, 5 rodadas e 3 partições, teremos 5x5x3 = 75 modelos por espécie.

Lista de características para implementar

Issue with randomforest

Hi. I'm trying to model the distribution of an invasive fish. ModleR is using 900 points and 4 predictors. When getting to the RandomForest bit, it stops and returns this warning:

>rf
Remember a variable selection was performed 
 retained variables: wc2.1_30s_bio_2-wc2.1_30s_bio_3-wc2.1_30s_bio_4-srad_mean-tmax 

Pseudoxiphophorus bimaculatus rf run number 1 part. nb. 3
fitting models
Error in sample.int(length(x), size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'

How can I fix this?

separar resultados do setup de modelos

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.