fitzlab-al / gdm Goto Github PK

View Code? Open in Web Editor NEW

33.0 9.0 11.0 13.72 MB

R package for Generalized Dissimilarity Modeling

License: GNU General Public License v3.0

C 3.36% C++ 45.75% R 50.89%

gdm's Introduction

gdm

The gdm package provides functions to fit, plot, summarize, and apply Generalized Dissimilarity Models.

Installation

The gdm package is available on CRAN, development versions are available on GitHub.

Install from CRAN:

install.packages("gdm")

Install latest development version from GitHub (requires devtools package):

if (!require("devtools")) {
  install.packages("devtools")
}
devtools::install_github("fitzLab-AL/gdm")

Package Citation

Fitzpatrick MC, Mokany K, Manion G, Nieto-Lugilde D, Ferrier S. (2021) gdm: Generalized Dissimilarity Modeling. R package version 1.5.

New update of v1.6

The gdm package has been updated to leverage the terra package as its raster processing engine, leading to faster raster file processing. Preferably, inputs should be provided as SpatRaster objects, or any convertible object to terra, such as raster package objects or stars objects.

With the transition to terra, the gdm package is now capable of efficiently handling very large raster files, thanks to the underlying terra functionalities. Memory management is handled automatically by terra, but in the event of encountering out-of-memory errors, you can utilize terra::terraOptions(steps = ...) to increase the number of processing steps for large files.

Getting Started

GDM has been used in many published studies. In addition to working through the examples here and those throughout the package documentation, we recommend reading these publications for background information:

Ferrier S, Manion G, Elith J, Richardson, K (2007) Using generalized dissimilarity modelling to analyse and predict patterns of beta diversity in regional biodiversity assessment. Diversity & Distributions 13: 252-264.https://doi.org/10.1111/j.1472-4642.2007.00341.x

Mokany K, Ware C, Woolley, SNC, Ferrier S, Fitzpatrick MC (2022) A working guide to harnessing generalized dissimilarity modelling for biodiversity analysis and conservation assessment. Global Ecology and Biogeography, 31, 802– 821. https://doi.org/10.1111/geb.13459

Introduction

The R package gdm implements Generalized Dissimilarity Modeling Ferrier et al. 2007 to analyze and map spatial patterns of biodiversity. GDM models biological variation as a function of environment and geography using distance matrices – specifically by relating biological dissimilarity between sites to how much sites differ in their environmental conditions (environmental distance) and how isolated they are from one another (geographical distance). Here we demonstrate how to fit, apply, and interpret GDM in the context of analyzing and mapping species-level patterns. GDM also can be used to model other biological levels of organization, notably genetic Fitzpatrick & Keller 2015, phylogenetic Rosauer et al. 2014, or function/traits Thomassen et al. 2010, and the approaches for doing so are largely identical to the species-level case with the exception of using a different biological dissimilarity metric depending on the type of response variable.

Preparing the data for GDM: The site-pair table.

The initial step in fitting a generalized dissimilarity model is to combine the biological and environmental data into “site-pair” table format using the formatsitepair function.

GDM can use several data formats as input. Most common are site-by-species tables (sites in rows, species across columns) for the response and site-by-environment tables (sites in rows, predictors across columns) as the predictors, though distance matrices and rasters also are accommodated as demonstrated below.

The gdm package comes with two example biological data sets and two example environmental data sets in a number of formats. Example data include: - southwest: A data frame that contains x-y coordinates, 10 columns of predictors (five soil and five bioclimatic variables), and occurrence data for 900+ species of plants from southwest Australia (representing a subset of the data used in [@fitzpatrick_2013]). Note that the format of the southwest table is an x-y species list (i.e., bioFormat = 2, see below) where there is one row per species record rather than per site. These biological data are similar to what would be obtained from online databases such as GBIF. - gdmDissim: A pairwise biological dissimilarity matrix derived from the species data provided in southwest. gdmDissim is provided to demonstrate how to proceed when you when you want to fit GDM using an existing biological distance matrix (e.g., pairwise Fst) as the response variable (i.e., bioFormat = 3, see below). Note however that distance matrices can also be used as predictors (e.g., to model compositional variation in one group as a function of compositional variation in another group Jones et al 2013. - swBioclims: a raster stack of the five bioclimatic predictors provided in the southwest data.

Note that for all input data the rows and their order must match in the biological and environmental data frames and must not include NAs. This is best accomplished by making sure your tables have a column with a unique identifier for each site and that the order of these IDs are the same across all tables.

To build a site-pair table, we need individual tables for the biological and environmental data, so we first index the southwest table to create a table for the species data and a second for the environmental data:

library(gdm)
# have a look at the southwest data set
str(southwest)
#> 'data.frame':    29364 obs. of  14 variables:
#>  $ species   : Factor w/ 974 levels "spp1","spp10",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ site      : int  1066 1026 1025 1026 1027 1047 1048 1066 1066 1067 ...
#>  $ awcA      : num  14.5 16.3 23.1 16.3 17 ...
#>  $ phTotal   : num  546 471 460 471 489 ...
#>  $ sandA     : num  71.3 68.9 71.5 68.9 74.7 ...
#>  $ shcA      : num  178.9 105.8 88.4 105.8 147.2 ...
#>  $ solumDepth: num  875 928 892 928 952 ...
#>  $ bio5      : num  31.4 33.1 32.8 33.1 33.2 ...
#>  $ bio6      : num  5.06 4.85 4.82 4.85 4.59 ...
#>  $ bio15     : num  40.4 48.2 53.9 48.2 44 ...
#>  $ bio18     : int  0 0 43 0 0 0 0 0 0 0 ...
#>  $ bio19     : num  133 140 145 140 136 ...
#>  $ Lat       : num  -33 -32 -32 -32 -32.1 ...
#>  $ Long      : num  119 118 118 118 119 ...

# biological data
# get columns with xy, site ID, and species data
sppTab <- southwest[, c("species", "site", "Long", "Lat")]

# # columns 3-7 are soils variables, remainder are climate
# get columns with site ID, env. data, and xy-coordinates
envTab <- southwest[, c(2:ncol(southwest))]

Because the southwest data is x-y species list format, we use bioFormat=2. Otherwise, we just need to provide the required column names to create the site-pair table:

# x-y species list example
gdmTab <- formatsitepair(bioData=sppTab, 
                         bioFormat=2, #x-y spp list
                         XColumn="Long", 
                         YColumn="Lat",
                         sppColumn="species", 
                         siteColumn="site", 
                         predData=envTab)

#>        distance weights s1.xCoord s1.yCoord s2.xCoord s2.yCoord s1.awcA
#> 132   0.4485981       1   115.057 -29.40472  115.5677 -29.46599 23.0101
#> 132.1 0.7575758       1   115.057 -29.40472  116.0789 -29.52556 23.0101
#> 132.2 0.8939394       1   115.057 -29.40472  116.5907 -29.58342 23.0101
#>       s1.phTotal s1.sandA  s1.shcA s1.solumDepth s1.bio5 s1.bio6 s1.bio15
#> 132     480.3266 83.99326 477.5656      1129.933  34.668   8.908    86.64
#> 132.1   480.3266 83.99326 477.5656      1129.933  34.668   8.908    86.64
#> 132.2   480.3266 83.99326 477.5656      1129.933  34.668   8.908    86.64
#>       s1.bio18 s1.bio19 s2.awcA s2.phTotal s2.sandA  s2.shcA s2.solumDepth
#> 132          0   267.44 22.3925   494.1225  76.6900 357.7225     1183.9025
#> 132.1        0   267.44 17.0975   415.1275  70.0175 112.4800      985.5300
#> 132.2        0   267.44 17.0300   333.4400  71.5950 165.7250      956.5425
#>        s2.bio5  s2.bio6 s2.bio15 s2.bio18 s2.bio19
#> 132   35.50571 7.448572 75.37143        0 228.6572
#> 132.1 36.05000 6.605882 64.52941        0 168.8824
#> 132.2 36.18750 6.131250 58.75000        0 141.1250

The first column of a site-pair table contains a biological distance measure (the default is Bray-Curtis distance though any measure scaled between 0-1 is acceptable). The second column contains the weight to be assigned to each data point in model fitting (defaults to 1 if equal weighting is used, but can be customized by the user or can be scaled to site richness, see below). The remaining columns are the coordinates and environmental values at a site (s1) and those at a second site (s2) making up a site pair. Rows represent individual site-pairs. While the site-pair table format can produce extremely large data frames and contain numerous repeat values (because each site appears in numerous site-pairs), it also allows great flexibility. Most notably, individual site pairs easily can be excluded from model fitting.

A properly formatted site-pair table will have at least six columns (distance, weights, s1.xCoord, s1.yCoord, s2.xCoord, s2.yCoord) and some number more depending on how many predictors are included. See ?formatsitepair and ?gdm for more details.

Formatting a site-pair table using a distance matrix.

What if you already have a biological distance matrix because you are working with, say, genetic data? In that case, it is simple as changing the bioFormat argument and providing that matrix as the bioData object to the sitepairformat function. However, in addition to the pairwise dissimilarity values, the object must include a column containing the site IDs. Let’s have a quick look at gdmDissim, a pairwise biological distance matrix provided with the package (note that the first column contains site IDs):

# Biological distance matrix example
dim(gdmDissim)
#> [1] 94 95
gdmDissim[1:5, 1:5]
#>    site         1         2         3         4
#> V1  881 0.0000000 0.4485981 0.7575758 0.8939394
#> V2  882 0.4485981 0.0000000 0.5837563 0.8170732
#> V3  883 0.7575758 0.5837563 0.0000000 0.4782609
#> V4  884 0.8939394 0.8170732 0.4782609 0.0000000
#> V5  885 0.9178082 0.8202247 0.5813953 0.4375000

We can provide the gdmDissim object to formatsitepair as follows:

gdmTab.dis <- formatsitepair(bioData=gdmDissim, 
                             bioFormat=3, #diss matrix 
                             XColumn="Long", 
                             YColumn="Lat", 
                             predData=envTab, 
                             siteColumn="site")

#>        distance weights s1.xCoord s1.yCoord s2.xCoord s2.yCoord s1.awcA
#> 132   0.4485981       1   115.057 -29.40472  115.5677 -29.46599 23.0101
#> 132.1 0.7575758       1   115.057 -29.40472  116.0789 -29.52556 23.0101
#> 132.2 0.8939394       1   115.057 -29.40472  116.5907 -29.58342 23.0101
#>       s1.phTotal s1.sandA  s1.shcA s1.solumDepth s1.bio5 s1.bio6 s1.bio15
#> 132     480.3266 83.99326 477.5656      1129.933  34.668   8.908    86.64
#> 132.1   480.3266 83.99326 477.5656      1129.933  34.668   8.908    86.64
#> 132.2   480.3266 83.99326 477.5656      1129.933  34.668   8.908    86.64
#>       s1.bio18 s1.bio19 s2.awcA s2.phTotal s2.sandA  s2.shcA s2.solumDepth
#> 132          0   267.44 22.3925   494.1225  76.6900 357.7225     1183.9025
#> 132.1        0   267.44 17.0975   415.1275  70.0175 112.4800      985.5300
#> 132.2        0   267.44 17.0300   333.4400  71.5950 165.7250      956.5425
#>        s2.bio5  s2.bio6 s2.bio15 s2.bio18 s2.bio19
#> 132   35.50571 7.448572 75.37143        0 228.6572
#> 132.1 36.05000 6.605882 64.52941        0 168.8824
#> 132.2 36.18750 6.131250 58.75000        0 141.1250

In addition to starting with tablular data, environmental data can be extracted directly from rasters, assuming the x-y coordinates of sites are provided in either a site-species table (bioFormat=1) or as a x-y species list (bioFormat=2).

# environmental raster data for sw oz
swBioclims <- terra::rast(system.file("./extdata/swBioclims.grd", package="gdm"))

gdmTab.rast <- formatsitepair(bioData=sppTab, 
                              bioFormat=2, # x-y spp list
                              XColumn="Long", 
                              YColumn="Lat", 
                              sppColumn="species",
                              siteColumn="site",
                              predData=swBioclims) #raster stack

Because some sites might not overlap with the rasters, we should check for and remove NA values from the site-pair table:

sum(is.na(gdmTab.rast))
#> [1] 465
gdmTab.rast <- na.omit(gdmTab.rast)

Note that the formatsitepair function assumes that the coordinates of the sites are in the same coordinate system as the rasters. At present, no checking is performed to ensure this is the case. Note also that if your site coordinates are longitude-latitude that the calculation of geographic distances between sites will have errors, the size of which will depend on the geographic extent and location of your study region. We hope to deal with this in a later release, but for now you can avoid these problems by using a projected coordinate system (e.g., equidistant).

Dealing with biases associated with presence-only data

The ideal biological data for fitting a GDM are occurrence records (presence-absence or abundance) from a network of sites where all species (from one or more taxonomic groups) have been intensively sampled such that compositional dissimilarity can be reliably estimated between sites. However most species data are collected as part of ad hoc surveys and are presence-only. Under these circumstances, there is no systematic surveying and no sites per se, but rather grid cells with some number of occurrence records depending on the number of species observed, with many grid cells having none, a few, or even a single species record. When these data are used to calculate compositional dissimilarity, erroneously high values will result, which will bias the model.

The formatsitepair function provides a few options for dealing with this potential bias, including (i) weighting sites relative to the number of species observed (weightType="richness"), (ii) removing sites with few species (e.g., speciesFilter=10) or (iii) both. Decisions regarding which approach to use will depend on the nature of the data and study system. See Ferrier et al. (2007) for further discussion.

# weight by site richness using weightType="richness"
gdmTab.rw <- formatsitepair(bioData=sppTab, 
                            bioFormat=2, 
                            XColumn="Long", 
                            YColumn="Lat",
                            sppColumn="species", 
                            siteColumn="site", 
                            predData=envTab, 
                            weightType="richness")

# weights based on richness (number of species records)
gdmTab.rw[1:5, 1:5]
#>        distance   weights s1.xCoord s1.yCoord s2.xCoord
#> 132   0.4485981 0.2449866   115.057 -29.40472  115.5677
#> 132.1 0.7575758 0.1916207   115.057 -29.40472  116.0789
#> 132.2 0.8939394 0.1635852   115.057 -29.40472  116.5907
#> 132.3 0.9178082 0.1858930   115.057 -29.40472  117.1029
#> 132.4 0.9787234 0.1337957   115.057 -29.40472  117.6156

# remove sites with < 10 species records using
# sppFilter = 10
gdmTab.sf <- formatsitepair(bioData=sppTab, 
                            bioFormat=2, 
                            XColumn="Long", 
                            YColumn="Lat",
                            sppColumn="species", 
                            siteColumn="site", 
                            predData=envTab, 
                            sppFilter=10)

GDM fitting

GDM is a nonlinear extension of permutational matrix regression that uses flexible splines and generalized linear modeling (GLM) to accommodate two types of nonlinearity common in ecological datasets: (1) variation in the rate of compositional turnover (non-stationarity) along environmental gradients, and (2) the curvilinear relationship between biological distance and environmental and geographical distance.

The function gdm fits generalized dissimilarity models and is simple to use once the biological and predictor data have been formatted to a site-pair table. In addition to specifying whether or not the model should be fit with geographical distance as a predictor variable, the user has the option to specify (i) the number of I-spline basis functions (the default is three, with larger values producing more complex splines) and (ii) the locations of “knots” along the splines (defaults to 0 (minimum), 50 (median), and 100 (maximum) quantiles when three I-spline basis functions are used). Even though these option are available, using the default values for these parameters will work fine for most applications. In other words, unless you have a good reason, you should probably use the default settings for splines and knots. The effects (and significance) of altering the number of splines and knot locations has not been systematically explored.

Here we fit GDM with geo=T and default settings for all other parameters.

gdm.1 <- gdm(data=gdmTab, geo=TRUE)

The summary function provides an overview of the model, the most important items to note are:

Percent Deviance Explained: goodness-of-fit
Intercept: expected dissimilarity between sites that do not differ in the predictors
Summary of the fitted I-splines for each predictor, including the values of the coefficients and their sum. The sum indicates the amount of compositional turnover associated with that variable, holding all other variables constant. I-spline summaries are order by coefficient sum. Variables with all coefficients=0 have no relationship with the modeled biological pattern.

summary(gdm.1)
#> [1] 
#> [1] 
#> [1] GDM Modelling Summary
#> [1] Creation Date:  Wed Mar 27 10:15:19 2024
#> [1] 
#> [1] Name:  gdm.1
#> [1] 
#> [1] Data:  gdmTab
#> [1] 
#> [1] Samples:  4371
#> [1] 
#> [1] Geographical distance used in model fitting?  TRUE
#> [1] 
#> [1] NULL Deviance:  651.914
#> [1] GDM Deviance:  129.025
#> [1] Percent Deviance Explained:  80.208
#> [1] 
#> [1] Intercept:  0.277
#> [1] 
#> [1] PREDICTOR ORDER BY SUM OF I-SPLINE COEFFICIENTS:
#> [1] 
#> [1] Predictor 1: bio19
#> [1] Splines: 3
#> [1] Min Knot: 114.394
#> [1] 50% Knot: 172.416
#> [1] Max Knot: 554.771
#> [1] Coefficient[1]: 0.941
#> [1] Coefficient[2]: 0.868
#> [1] Coefficient[3]: 0
#> [1] Sum of coefficients for bio19: 1.809
#> [1] 
#> [1] Predictor 2: phTotal
#> [1] Splines: 3
#> [1] Min Knot: 277.978
#> [1] 50% Knot: 584.609
#> [1] Max Knot: 1860.37
#> [1] Coefficient[1]: 1.127
#> [1] Coefficient[2]: 0.23
#> [1] Coefficient[3]: 0
#> [1] Sum of coefficients for phTotal: 1.357
#> [1] 
#> [1] Predictor 3: bio5
#> [1] Splines: 3
#> [1] Min Knot: 25.571
#> [1] 50% Knot: 32.16
#> [1] Max Knot: 36.188
#> [1] Coefficient[1]: 0.127
#> [1] Coefficient[2]: 0.453
#> [1] Coefficient[3]: 0.114
#> [1] Sum of coefficients for bio5: 0.694
#> [1] 
#> [1] Predictor 4: solumDepth
#> [1] Splines: 3
#> [1] Min Knot: 705.02
#> [1] 50% Knot: 1017.628
#> [1] Max Knot: 1247.705
#> [1] Coefficient[1]: 0.682
#> [1] Coefficient[2]: 0
#> [1] Coefficient[3]: 0
#> [1] Sum of coefficients for solumDepth: 0.682
#> [1] 
#> [1] Predictor 5: awcA
#> [1] Splines: 3
#> [1] Min Knot: 12.975
#> [1] 50% Knot: 22.186
#> [1] Max Knot: 50.7
#> [1] Coefficient[1]: 0
#> [1] Coefficient[2]: 0
#> [1] Coefficient[3]: 0.523
#> [1] Sum of coefficients for awcA: 0.523
#> [1] 
#> [1] Predictor 6: Geographic
#> [1] Splines: 3
#> [1] Min Knot: 0.452
#> [1] 50% Knot: 2.46
#> [1] Max Knot: 6.532
#> [1] Coefficient[1]: 0.014
#> [1] Coefficient[2]: 0.372
#> [1] Coefficient[3]: 0
#> [1] Sum of coefficients for Geographic: 0.386
#> [1] 
#> [1] Predictor 7: sandA
#> [1] Splines: 3
#> [1] Min Knot: 56.697
#> [1] 50% Knot: 72.951
#> [1] Max Knot: 83.993
#> [1] Coefficient[1]: 0.092
#> [1] Coefficient[2]: 0
#> [1] Coefficient[3]: 0.139
#> [1] Sum of coefficients for sandA: 0.231
#> [1] 
#> [1] Predictor 8: shcA
#> [1] Splines: 3
#> [1] Min Knot: 78.762
#> [1] 50% Knot: 179.351
#> [1] Max Knot: 521.985
#> [1] Coefficient[1]: 0
#> [1] Coefficient[2]: 0.156
#> [1] Coefficient[3]: 0
#> [1] Sum of coefficients for shcA: 0.156
#> [1] 
#> [1] Predictor 9: bio6
#> [1] Splines: 3
#> [1] Min Knot: 4.373
#> [1] 50% Knot: 5.509
#> [1] Max Knot: 9.224
#> [1] Coefficient[1]: 0.121
#> [1] Coefficient[2]: 0
#> [1] Coefficient[3]: 0
#> [1] Sum of coefficients for bio6: 0.121
#> [1] 
#> [1] Predictor 10: bio15
#> [1] Splines: 3
#> [1] Min Knot: 29.167
#> [1] 50% Knot: 55.008
#> [1] Max Knot: 87.143
#> [1] Coefficient[1]: 0.027
#> [1] Coefficient[2]: 0
#> [1] Coefficient[3]: 0
#> [1] Sum of coefficients for bio15: 0.027
#> [1] 
#> [1] Predictor 11: bio18
#> [1] Splines: 3
#> [1] Min Knot: 0
#> [1] 50% Knot: 0
#> [1] Max Knot: 52
#> [1] Coefficient[1]: 0
#> [1] Coefficient[2]: 0
#> [1] Coefficient[3]: 0
#> [1] Sum of coefficients for bio18: 0

GDM plots

The fitted splines represent one of the most informative components of a fitted GDM and so plotting and scrutinizing the splines is a major part of interpreting GDM and the analyzed biological patterns. The fitted model and I-splines can be viewed using the plot function, which produces a multi-panel plot that includes two model summary plots showing (i) the fitted relationship between predicted ecological distance and observed compositional dissimilarity and (ii) predicted versus observed biological distance, followed by a series of panels showing each I-spline with at least one non-zero coefficient (plotted in order by sum of the I-spline coefficients). Note that in the example bio18 is not plotted because all three coefficients equaled zero and so had no relationship with the response.

The maximum height of each spline indicates the magnitude of total biological change along that gradient and thereby corresponds to the relative importance of that predictor in contributing to biological turnover while holding all other variables constant (i.e., is a partial ecological distance). The spline’s shape indicates how the rate of biological change varies with position along that gradient. Thus, the splines provide insight into the total magnitude of biological change as a function of each gradient and where along each gradient those changes are most pronounced. In this example, compositional turnover is greatest along gradients of bio19 (winter precipitation) and phTotal (soil phosphorus) and most rapid near the low ends of these gradients.

length(gdm.1$predictors) # get ideal of number of panels
#> [1] 11
plot(gdm.1, plot.layout=c(4,3))

The fitted model (first two panels) and I-splines (remaining panels).

To allow easy customization of I-spline plots, the isplineExtract function will extract the plotted values for each I-spline.

gdm.1.splineDat <- isplineExtract(gdm.1)
str(gdm.1.splineDat)
#> List of 2
#>  $ x: num [1:200, 1:11] 0.452 0.483 0.513 0.544 0.574 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:11] "Geographic" "awcA" "phTotal" "sandA" ...
#>  $ y: num [1:200, 1:11] 0 0.00045 0.00095 0.0015 0.0021 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:11] "Geographic" "awcA" "phTotal" "sandA" ...
plot(gdm.1.splineDat$x[,"bio19"], 
     gdm.1.splineDat$y[,"bio19"], 
     lwd=3,
     type="l", 
     xlab="Winter precipitation (mm)", 
     ylab="Partial ecological distance")

Custom I-spline plot for geographic distance.

GDM predictions

The I-splines provide an indication of how species composition (or any other fitted biological response variable) changes along each environmental gradient. Beyond these insights, a fitted model also can be used to (i) predict biological dissimilarity between site pairs in space or between times using the predict function and (ii) transform the predictor variables from their arbitrary environmental scales to a common biological importance scale using the gdm.transform function.

The following examples show predictions between site pairs in space and locations through time, and transformation of both tabular and raster data. For the raster example, the transformed layers are used to map spatial patterns of biodiversity.

Using a fitted GDM to predict biological dissimilarity between sites

The predict function requires a site-pair table in the same format as that used to fit the model. For demonstration purposes, we use the same table as that was used to fit the model, though predictions to new sites (or times) can be made as well assuming the same set of environmental/spatial predictors are available at those locations (or times).

gdm.1.pred <- predict(object=gdm.1, data=gdmTab)

head(gdm.1.pred)
#> [1] 0.4720423 0.7133571 0.8710175 0.8534788 0.9777208 0.3996694

plot(gdmTab$distance, 
     gdm.1.pred, 
     xlab="Observed dissimilarity", 
     ylab="Predicted dissimilarity", 
     xlim=c(0,1), 
     ylim=c(0,1), 
     pch=20, 
     col=rgb(0,0,1,0.5))
lines(c(-1,2), c(-1,2))

Predicted vs. observed compositional dissimilarity.

Predicting biological change through time

The predict function can be used to make predictions through time, for example, under climate change scenarios to estimate the magnitude of expected change in biological composition in response to environmental change [@fitzpatrick_2011]. In this case, rasters must be provided for two time periods of interest.

First we fit a new model using only the climate variables and then create some fake future climate rasters to use as example data.

# fit a new gdm using a table with climate data only (to match rasters)
gdm.rast <- gdm(gdmTab.rast, geo=T)

# make some fake climate change data
futRasts <- swBioclims
##reduce winter precipitation by 25% & increase temps
futRasts[[3]] <- futRasts[[3]]*0.75
futRasts[[4]] <- futRasts[[4]]+2
futRasts[[5]] <- futRasts[[5]]+3

We again use the predict function, but with time=TRUE and provide the current and future climate raster stacks. Th resulting map shows the expected magnitude of change in vegetation composition, which can be interpreted as a biologically-scaled metric of climate stress.

timePred <- predict(gdm.rast, swBioclims, time=T, predRasts=futRasts)
terra::plot(timePred, col=rgb.tables(1000))

Predicted magnitude of biological change through time

Transforming spatial predictor layers using a fitted GDM

Using GDM to transform environmental data rescales the individual predictors to a common scale of biological importance. Spatially explicit predictor data to be transformed can be a raster stack or brick with one layer per predictor. If the model was fit with geographical distance and raster data are provided to the transform function, there is no need to provide x- or y-raster layers as these will be generated automatically. However, the character names of the x- and y-coordinates (e.g., “Long” and “Lat”) used to fit the model need to be provided.

First we fit a new model using only the climate variables.

# fit the GDM
gdmRastMod <- gdm(data=gdmTab.rast, geo=TRUE)

We then use the gdm.transform function to rescale the rasters.

transRasts <- gdm.transform(model=gdmRastMod, data=swBioclims)
terra::plot(transRasts, col=rgb.tables(1000))

Visualizing multi-dimensional biological patterns

Site-pair based biological distances are difficult to visualize. However, if the transform function is applied to rasters, the resulting multi-dimensional biological space can be mapped to reveal biological patterns in geographic space. Alternatively, a biplot can be used to depict where sites fall relative to each other in biological space and therefore how sites differ in predicted biological composition. In either case, the multi-dimensional biological space can be most effectively visualized by taking a PCA to reduce dimensionality and assigning the first three components to an RGB color palette. In the resulting map, color similarity corresponds to the similarity of expected plant species composition (in other words, cells with similar colors are expected to contain similar plant communities).

# Get the data from the gdm transformed rasters as a table
# rastDat <- na.omit(terra::values(transRasts))

# The PCA can be fit on a sample of grid cells if the rasters are large
rastDat <- terra::spatSample(transRasts, 50000, na.rm = TRUE) 

# perform the principle components analysis
pcaSamp <- prcomp(rastDat)
 
# Predict the first three principle components for every cell in the rasters
# note the use of the 'index' argument
pcaRast <- terra::predict(transRasts, pcaSamp, index=1:3)

# scale the PCA rasters to make full use of the colour spectrum
pcaRast[[1]] <- terra::app(pcaRast[[1]], fun = scales::rescale, to = c(0, 255))
pcaRast[[2]] <- terra::app(pcaRast[[2]], fun = scales::rescale, to = c(0, 255))
pcaRast[[3]] <- terra::app(pcaRast[[3]], fun = scales::rescale, to = c(0, 255))

# Plot the three PCA rasters simultaneously, each representing a different colour 
#  (red, green, blue)
terra::plotRGB(pcaRast, r=1, g=2, b=3)

Predicted spatial variation in plant species composition. Colors represent gradients in species composition derived from transformed environmental predictors. Locations with similar colors are expected to contain similar plant communities.

SECTION 2 - Advanced spatial analyses using GDM

gdm's People

Contributors

Stargazers

Watchers

Forkers

ianmcfadden howl-anderson kdm9 feltrot joshualiuxu mitramenon cmu002 2692289267 andrewiom rvalavi xinxxxin

gdm's Issues

dimname error when predictor gets dropped in gdm.varImp

When gdm.varImp() drops a predictor variable due to zero sum of I-spline coefficients, it throws a dimname error when setting the colnames of varImpTable

Error:

GPC.varImp <- gdm::gdm.varImp(GPC_dat, geo = T, nPerm = 1000, parallel = FALSE)
Fitting initial model with all 4 predictors...
Sum of I-spline coefficients for predictors(s) X = 0
Removing predictor(s) X and proceeding with permutation testing...
Error in dimnames(x) <- dn : 
  length of 'dimnames' [2] not equal to array extent

Suggested fix is to change lines 411-416 of gdm.variable.importance.R from:

  varImpTable <- matrix(NA, nVars, nVars*5-1)
  rownames(varImpTable) <- varNames
  colnames(varImpTable) <- c("All predictors",
                             "1-removed",
                             paste(seq(2,nVars*5-2), "-removed", sep=""))

to something like:

  varImpTable <- matrix(NA, nVars, nVars - 1)
  rownames(varImpTable) <- varNames
  varImpTable_colnames <- c("All predictors", "1-removed")
  if(nVars - 2 > 1) {
    varImpTable_colnames <- c(varImpTable_colnames, paste(seq(2, nVars - 2), "-removed", sep = ""))
  }
  colnames(varImpTable) <- varImpTable_colnames

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22)
#>  os       macOS Monterey 12.6
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Los_Angeles
#>  date     2022-09-27
#>  pandoc   2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
#>  clipr         0.8.0   2022-02-22 [1] CRAN (R 4.2.0)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  evaluate      0.16    2022-08-09 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
#>  knitr         1.40    2022-08-24 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.2.0)
#>  rlang         1.0.5   2022-08-31 [1] CRAN (R 4.2.0)
#>  rmarkdown     2.16    2022-08-24 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
#>  stringr       1.4.1   2022-08-20 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.32    2022-08-10 [1] CRAN (R 4.2.0)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

gdm.varImp() returns NA when using only one covariate

I run gdm.varImp() with only one covariate (I tried both "awcA" and "phTotal") and it returns "NA" for "model deviance", "p-value" etc. This does not happen if I run it with both "awcA" and "phTotal".

library(gdm)

sppData <- southwest[c(1,2,13,14)]
envTab <- southwest[c(2:ncol(southwest))]
sitePairTab <- formatsitepair(sppData, 2, XColumn="Long", YColumn="Lat",
                              sppColumn="species", siteColumn="site", predData=envTab)

library(tidyverse)
test<- sitePairTab %>% select(distance,weights,s1.xCoord,s1.yCoord,s2.xCoord,
                              s2.yCoord,s1.awcA,s2.awcA)


modTest <- gdm.varImp(test, geo=F, nPerm=50, parallel=T, cores=10)

> modTest$`Predictor p-values`
     All predictors
awcA             NA
> modTest$`Model assessment`
                           All predictors
Model deviance                         NA
Percent deviance explained             NA
Model p-value                          NA
Fitted permutations                    NA
> modTest$`Predictor Importance`
     All predictors
awcA             NA
> modTest$`Model Convergence`
     All predictors
awcA             NA
> modTest$`Predictor p-values`
     All predictors
awcA             NA
> modTest$`Model assessment`
                           All predictors
Model deviance                         NA
Percent deviance explained             NA
Model p-value                          NA
Fitted permutations                    NA
> modTest$`Predictor Importance`
     All predictors
awcA             NA
> modTest$`Model Convergence`
     All predictors
awcA             NA

Non-default splines argument breaks isplineExtract()

The original isplineExtract() function is hardcoded for the default of 3 splines per variable. The problem is in this line of the function definition:

xDat[,i] <-  seq(from=model$knots[[(i*3)-2]],to=model$knots[[(i*3)]], length=PSAMPLE)

And it can be fixed with:

xDat[,i] <-  seq(from=model$knots[[(i*numsplines)-(numsplines-1)]],to=model$knots[[(i*numsplines)]], length=PSAMPLE)

error in formatsiterepair function 'aggregation function missing: defaulting to length'

I followed the same code used in vignette (https://github.com/fitzLab-AL/GDM/blob/master/vignettes/gdmVignette.pdf) for formatsiterepair(). Although my data structure was exactly similar to that used in the vignette. My dataset has about 6000 rows. Using 1000 rows don't give any error. There is not NA values in my dataset. Any solution please?

> gdmTab <- formatsitepair(sppTab, bioFormat = 2, siteColumn = "site", XColumn = "Long", YColumn = "Lat", sppColumn = "grasshopper", predData = envTab) Aggregation function missing: defaulting to length Error in createsitepair(dist = distData, spdata = bioData, envInfo = predData, : The length of distance values are not the same as the expected number of rows of the site-pair table, unable to proceed. In addition: Warning message: In formatsitepair(sppTab, bioFormat = 2, siteColumn = "site", XColumn = "Long", : No abundance column specified, assuming species data are presence

Discussion: GDM assumption on increasing dissimilarity

It is not a issue but I would like to open a discussion on one assumption I found in the Ferrier et al (2007), they say:

In GDM we make the reasonable assumption that compositional dissimilarity can only increase, not decrease, with increasing separation of sites along an environmental gradient.

Despite I think it is a reasonable assumption I was wondering that the opposite may be easy to test if we scale the data to its inverse. For instance, first I can check if plant dissimilarity increases in a increasing P-content in soil and then I would look if it increase in an scale that reverses the pH values. Does it make sense?

Another approach would be to analyze the data in a similarity distance instead of a dissimilarity one. What would be the best approach?

gdm.VarImp failes if the sum of I-spline coefficients for the "Geographic" predictor == 0

From line 286 in gdm.variable.importance.R predictors got removed where the I-Spline coefficients == 0. If this is the "Geographic" predictor which is not a column name in the spTable, the whole table gets empty. I don't have the full overview for this rather complex function, but I think it would be enough to check for this case and set geo <- F:

  zeroSum <- fullGDM$predictors[which(sumCoeff==0)]
  if(length(zeroSum)>0){
    message(paste0("Sum of I-spline coefficients for predictors(s) ", zeroSum," = 0"))
    Sys.sleep(1)
    message(paste0("Removing predictor(s) ", zeroSum, " and proceeding with permutation testing..."))
    Sys.sleep(1)
    
    for(z in zeroSum){
      ##select variable columns to be removed from original site-pair table
      if(z == "Geographic") { geo <- F} else {
        testVarCols1 <- grep(paste("^s1.", z, "$", sep=""), colnames(spTable))
        testVarCols2 <- grep(paste("^s2.", z, "$", sep=""), colnames(spTable))
        spTable <- spTable[,-c(testVarCols1, testVarCols2)]
      }
    }
  }

predict error

Hi @fitzLab-AL @mliskAL

I would like to make some predictions but got an error.
I used two sets of factors in my gdm model,1) 19 bioclimate factors(original value) and 2) land cover data(range from 0~1). When I tried to make a prediction through time the result only had zero value, I checked whether this was brought out by having the same raster layer(i.e. climate or land cover not change) for the two time periods stack but the answer was not.
I've had no problems with the gdm model using 19 bioclimate factors in the past. So I run two models with the same response data but using only climate or land cover data, this time the predict function worked well.
So I cannot figure out why the function produces an error result when the two sets of factors are combined, they have the same resolution and extent, and different period stacks have the same order for factors.

Looking forward to your reply and thanks in advance.

Reproducing Plot 4a in Mokany et al (2022)

Hello @fitzLab-AL and gdm users (also @rvalavi if you're inclined),

Is the code for reproducing plot 4a available? Section 1.7 of the supplementary materials (and online material - link below) includes code for reproducing plot 4b, but not 4a. A similar plot (to 4a) is one of many produced via plot(gdm). I can reproduce that individual plot from plot(gdm) (in part) using plot(gdm$ecological, gdm$observed). But I lack the line showing GDM-predicted dissimilarity.

Having this code will also help clarify whether a data object from predict(gdm, gdmTab) is required to produce plot 4a; and thus further reveal the relationship between plots produced via plot(gdm) and plot(x, gdm.pred).

My expectation is that other users have had this question. Thanks very much

[https://github.com/fitzLab-AL/gdm?tab=readme-ov-file#visualizing-multi-dimensional-biological-patterns]

gdm.varImp sensitive to predictor names

Due to line 569 in the file gdm.variable.importance.R:
grepper <- grep(var, names(permVarDev))
the function gdm.varImp failes if the varible names are character subssets of each other. For example for varibles "var1" and "var10" grep returns both while searching for the first one. The function failes afterwards while trying:
varDevTab <- permVarDev[[grepper]]
with the error message: "Error in permVarDev[[grepper]] : recursive indexing failed at level 2"
I would suggest using:
grepper <- which(var == names(permVarDev))
at line 569 to search for an exact match.

predData is necessary in formatsitepair, even though the help file states it is not

I'm trying to use formatsitepair to analyse the effect of two sets of distance matrices on a third one, but the formatsitepair function gives the following error:
Error in formatsitepair(bioData = In.fmt, bioFormat = 3, abundance = TRUE, :
argument "predData" is missing, with no default

However, in the help of this function, distPreds is stated to be:
An optional list of distance matrices to be used as predictors, either in combination with, or as a substitute for, predData.

How am I supposed to use only distance matrices and no predData?

Consider forcing user to specify spline number

I have found that the results of a GDM can be very sensitive to spline number (supplied in the splines argument), and I would suggest that future versions of gdm() require users to supply the splines argument to avoid false confidence in the default settings.

Along the same lines, it would be helpful to include some means of comparing models of different complexity to decide what the spline number should be. Ferrier et al. (2007) suggest such procedures, albeit vaguely:

"The predictors included in a model, and the number of I-spline basis functions employed for each predictor (controlling the allowable complexity of the nonlinear transformation of that predictor), can be determined using various automated selection strategies, including forward-selection, backward-elimination, and stepwise procedures."

GDM without Coordinates

Hello, I would like to perform GDM, but my data comes from a designed experiment so the geographical distance do not provide much information. Therefore, I would like to know if it is possible to apply GDM without adding this information.

Thanks in advance

par(ask=TRUE) will improve plot.gdm

It took me a while to figure out that the plots of predictor variable splines with plot.gdm() were done sequentially, and that I had to press BACK in RStudio to find them. A more intuitive way would be if plot.gdm() set par(ask=TRUE) before plotting to allow users to click through the plots, then par(ask=FALSE) to turn it off at the end.

Geographic distance calculated using Pythagorean theorem

Dear Fitzpatrick Lab,

I noticed that the geographic distance between two lat-lon points is calculated using the Pythagorean theorem (e.g. v <- sqrt((data[,3]-data[,5])^2 + (data[,4]-data[,6])^2)) rather than a formula that takes into account the spherical(-ish) shape of the Earth. This causes the calculated distances to become more distorted the farther samples are from the equator. May I recommend using something like the haversine formula to calculate geographic distance as an alternative?

Cheers,
Clara

subsample argument in formatsitepair does not actually subsample sites

regardless of what value is provided for the subsample argument, function always returns 100% of site pairs.

formatsiterepair undefined columns

Hi! I'm trying to produce a gdm with presence/absence data that I have and this is the issue that I am having.

This is the command that I am running, and the error that I get:
gdmtab <- formatsitepair(bioData = spp, bioFormat = 2, XColumn="Longitude", YColumn="Latitude", sppColumn = "Trait", site= "zone", predData = env) No abundance column was specified, so the species data are assumed to be presences.Aggregation function missing: defaulting to length cannot xtfrm data framesError in[.data.frame(envInfo, , "siteUltimateCoolness") : undefined columns selected
Here is the head of my biodata:
head(spp)
Longitude Latitude Trait zone
1 -110.3976 31.46074 n A
2 -104.0201 30.65312 n W
3 -109.2110 31.89133 y B
4 -109.2393 31.87379 y B
5 -107.4298 28.77738 y BB
6 -105.7348 28.92134 n CC

Trait is the y/n of having the trait, and the zone is the locality found. My environment data is worldclim, elevation, Latitude, Longitude. Do you have any suggestions for how to address this issue? Thanks!

Error when running gdm.varImp()

Running gdm.varImp() with the example works as expected

library(gdm)

sppData <- southwest[c(1,2,13,14)]
envTab <- southwest[c(2:ncol(southwest))]
sitePairTab <- formatsitepair(sppData, 2, XColumn="Long", YColumn="Lat",
                              sppColumn="species", siteColumn="site", predData=envTab)

modTest <- gdm.varImp(sitePairTab, geo=T, nPerm=50, parallel=T, cores=10)

However, when I try to run it with only a few covariates, it returns an error:

library(tidyverse)
test<- sitePairTab %>% select(distance,weights,s1.xCoord,s1.yCoord,s2.xCoord,
                              s2.yCoord,s1.awcA,s1.phTotal,s2.awcA,s2.phTotal)


modTest <- gdm.varImp(test, geo=F, nPerm=50, parallel=T, cores=10)



Fitting initial model with all 2 predictors...
Error in matrix(NA, 4, nVars, dimnames = list(c("Model deviance", "Percent deviance explained",  : 
  length of 'dimnames' [2] not equal to array extent

Am I doing something wrong?

Variable importance assessment (knots error)

Hi all,

I would like to assess variance importance using gdm.varImp. In the manual, it says that if user supplied knots are provided, the knots argument should be the same as used in model fitting. However, this results in an error:

EXPLO_gdmRes <- gdm(data = EXPLO_spt,
                          geo = FALSE,
                          splines = EXPLO_splines_vector,
                          knots = EXPLO_knots_values)
      
      # OPERATIONS TO PERFORM IF A MODEL WAS SUCCESSFULLY FIT
      if (!is.null(EXPLO_gdmRes)) {
        # 1. PERFORM PREDICTOR AND MODEL SIGNIFICANCE TESTING
        EXPLO_varImp <- gdm.varImp(
          EXPLO_spt,
          geo = FALSE,
          splines = EXPLO_splines_vector,
          knots = EXPLO_knots_values,
          predSelect = TRUE,
          nPerm = 100,
          pValue = 0.05,
          parallel = TRUE,
          cores = 3,
          sampleSites = 1, # the % of sites to use for purmutation routines; 1 = 100 %
          sampleSitePairs = 1, # the % of site pairs to use for purmutation routines; 1 = 100 %
          outFile = paste0(i, "_", y, "_", s, "_varImp_log.RData")
          )

Creating 100 permuted site-pair tables...
Starting model assessment...
Error in gdm(currSitePair, geo = geo, splines = splines, knots = knots) : 
  When knots are supplied by the user, there should be 6 items in the knots argument, not 141 items.

Obviously, following the construction of permuted site-pair tables, the knots argument is not longer the one used for model fitting.
How can I solve this issue?

Thank your for your assistance.

Custom weights - error incorrect number of dimensions

I would like to use the custom weights option of the formatsitepair function. However the code produces the error: Error in custWeights[, "gettingCoolSiteColumn"][, 1] : incorrect number of dimensions. I think the offending line is hwap <- custWeights[, "gettingCoolSiteColumn"][, 1], which seems to incorrectly subset the custom weights data.frame (unless I've incorrectly set up the custom weights data.frame). Here's a reproducible example:

# set up data
sppData <- southwest[c(1,2,13,14)]
envTab <- southwest[c(2:ncol(southwest))]

# create custom weights data.frame
custw <- data.frame(site = unique(sppData$site), weights = runif(length(unique(sppData$site))))

# run format site pair
sitePairTab <- formatsitepair(sppData, 2, XColumn = "Long", YColumn = "Lat", sppColumn = "species", 
siteColumn = "site", predData = envTab, weightType = "custom", custWeights = custw, verbose = TRUE)

Best,
Rob

GDM not supported anymore by CRAN

I'm sure you're aware of this, but CRAN doesn't allow installing the GDM package anymore. It'd be nice if you updated the package so that it continues to be supported.

Issue with long vectors

Hello,

I have a large gridded (raster) geographic area (>17000 sites) for which I have a phylogenetic distance matrix (phyloSimpson). When I create the site x pair matrix with the distance matrix, the lat/lon info, and my predictor variables, the total vector length of the matrix is over 2^31 elements, and thus is long vector. When I run gdm on this site x pair matrix, I get the following error: "long vectors (argument 2) are not supported in .C" (argument 2 would be the matrix input for GDM_FitFromTable).

I notice the R code for the various functions in this package use .C to call the C++ code. From searching around, it looks like .C and .Fortran cannot handle long vectors but .Call can. When I make a duplicate function locally of gdm using .Call instead of .C (for GDM_FitFromTable), I am able to run the code on my long vector, however, I run into a segmentation fault where memory does not map. I'm guessing this is related to the mismatch in my R environment and the original .cpp from the package, though I am not sure.

Would it be possible to make an adjustment for long vectors in the gdm package? I could also subset the matrix and use predict.gdm for the other cells if needed, though I would like to try gdm for the entire matrix, and long vectors may be useful for other large raster-based projects.

Please let me know if you would like any more information or if I am mistaken about the source of the issue. Thank you.

Possible bug in gdm.varImp

Hello,
gdm.varImp() is returning the following error message:

"Fitting initial model with all 18 predictors...
Sum of I-spline coefficients for predictors(s) Geographic = 0Sum of I-spline coefficients for predictors(s) t.mean = 0Sum of I-spline coefficients for predictors(s) t.max = 0
Removing predictor(s) Geographic and proceeding with permutation testing...Removing predictor(s) t.mean and proceeding with permutation testing...Removing predictor(s) t.max and proceeding with permutation testing...
Error in [.data.frame(spTable, c(7:(6 + nVars))) :
undefined columns selected"

Used this function before with no issues. Any hints? Thanks a lot!

Issues with gdm.varImp()

Hello,
I've encountered two issues with the gdm.varImp() function. The first is easy to solve:
For GDMs fit to just one environmental variable, with geo=T, gdm.varImp() fails to run, due to the line

varNames <- colnames(spTable[,c(7:6+nVars))])

Presumably this should be replaced with

varNames <- colnames(spTable)[c(7:6+nVars))]

Second:
gdm.varImp() decides how many sites to expect based on the unique number of site-coordinates. This seems reasonable enough, except that some users (like me) will input two different site names with different environmental conditions at a single set of coordinates (due to a matched-pair design). This is easy enough to solve by adding a tiny offset to the coordinates of the sites in condition 2, but it would probably stump anybody not comfortable looking at the source code. It would be nice if gdm.varImp figured out how many sites to expect based on the site names, instead.

Cheers
Jacob

plotUncertainty(): more documentation, custom plotting method

A couple notes on plotUncertainty():

After reading the documentation, I'm still a bit unclear on exactly how exactly it works and how I would explain it in a paper. Could the documentation on this command be elaborated a bit?
I use isplineExtract() to tabulate splines for custom plotting, but there does not seem to be any corresponding method for extract the results of plotUncertainty() for custom plotting. This would be nice.

Non-default spline number breaks gdm.varImp()

This works fine:
gdm.varImp_WN <- gdm.varImp(sitepair_WN, geo = FALSE)

This breaks in a puzzling way.
gdm.varImp_WN <- gdm.varImp(sitepair_WN, geo = FALSE, splines = c(6, 6, 6, 6, 6))

Error in gdm(currSitePair, geo = geo, splines = splines, knots = knots) : 
  Number of splines does not equal the number of predictors. 
              Splines argument has 6 items but needs 5 items.

One might think that shortening the splines argument by one item would satisfy it:
gdm.varImp_WN <- gdm.varImp(sitepair_WN, geo = FALSE, splines = c(6, 6, 6, 6))

. . . but it is fickle. Now it wants the 5th item back.

Error in gdm(currSitePair, geo = geo, splines = splines, knots = knots) : 
  Number of splines does not equal the number of predictors. 
              Splines argument has 4 items but needs 5 items.

gdm.transform function error

R-Studio crashes when trying to run the gdm.transform function for a rasterStack object, giving the following error:

Error in transformed[, i] <- tmp :
'pairlist' object cannot be coerced to type 'double'

Problem with gdm.varImp function

Hello,

I am using the latest CRAN version of GDM (v. 1.4.2.2) and I am able to run the gdm function on my data and the results look reasonable.

However, I keep getting this error when running gdm.varImp:

modTest_b <- gdm.varImp(gdm_sitepair, fullModelOnly = F, geo=T, nPerm=100, parallel=F,splines=rep(3,13))
Error in data[, 3] - data[, 5] : non-numeric argument to binary operator

I realize that asking for a solution without details of the input data is a bit nebulous.

Therefore, I have attached the data that can be read in with dget(gdm_sitepair.txt)->gdm.varImp.txt

Advice would be really appreciated!

Thanks so much for any help!

Frank

gdm_sitepair.txt

There appears to be a typo in the chunk of code that determines species count for each site and then filters sites with few species. There should be a comma inside the braket after "bioData so the apply function does not return "dim(x) must have a positive length." I think this is how the line in question should read: sppTotals <- cbind(as.data.frame(zst[,"siteCol"]), apply(sppDat, 1, function(m){sum(as.numeric(m))})) . The formatsitepair function does not work for me and I think this is why.

Installation error on Ubuntu 20.04 with R 4.1.3

Hi,
I'm running Ubuntu 20.04 and R 4.1.3 and I'm getting this installation error. I also tried devtools::install_github("fitzLab-AL/gdm") with the same result.

install.packages("gdm")
Installing package into ‘/home/fd/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/gdm_1.5.0-1.tar.gz'
Content type 'application/x-gzip' length 688341 bytes (672 KB)
==================================================
downloaded 672 KB

* installing *source* package ‘gdm’ ...
** package ‘gdm’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/lib/R/site-library/Rcpp/include'    -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-lENDSu/r-base-4.1.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c Gdmlib.cpp -o Gdmlib.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/lib/R/site-library/Rcpp/include'    -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-lENDSu/r-base-4.1.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c Message.cpp -o Message.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/lib/R/site-library/Rcpp/include'    -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-lENDSu/r-base-4.1.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c NNLSDoubleCore.cpp -o NNLSDoubleCore.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/lib/R/site-library/Rcpp/include'    -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-lENDSu/r-base-4.1.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c NNLS_Double.cpp -o NNLS_Double.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/lib/R/site-library/Rcpp/include'    -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-lENDSu/r-base-4.1.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c dllmain.cpp -o dllmain.o
gcc -I"/usr/share/R/include" -DNDEBUG  -I'/usr/lib/R/site-library/Rcpp/include'    -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-lENDSu/r-base-4.1.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c registerDynamicSymbol.c -o registerDynamicSymbol.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/lib/R/site-library/Rcpp/include'    -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-lENDSu/r-base-4.1.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c stdafx.cpp -o stdafx.o
g++ -std=gnu++11 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o gdm.so Gdmlib.o Message.o NNLSDoubleCore.o NNLS_Double.o dllmain.o registerDynamicSymbol.o stdafx.o -L/usr/lib/R/lib -lR
installing to /home/fd/R/x86_64-pc-linux-gnu-library/4.1/00LOCK-gdm/00new/gdm/libs
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
double free or corruption (out)
Aborted (core dumped)
ERROR: lazy loading failed for package ‘gdm’
* removing ‘/home/fd/R/x86_64-pc-linux-gnu-library/4.1/gdm’
Warning in install.packages :
  installation of package ‘gdm’ had non-zero exit status

The downloaded source packages are in
	‘/tmp/RtmpfbOcEj/downloaded_packages’

How can I obtain the unique identifiers of the sites that are being compared by the Bray-Curtis index using formatsitepair()

I would like two add two columns to an object created by formatsitepair() containing unique identifiers for the sites that are being compared by the Bray-Curtis index. Something like this:

site_id1, site_id2, distance, weights,
1,2,0.5,1
1,3,0.8,1

I run the example code

head(southwest)
sppData <- southwest[c(1,2,13,14)]
envTab <- southwest[c(2:ncol(southwest))]

sitePairTab <- formatsitepair(sppData, 2, XColumn="Long", YColumn="Lat", sppColumn="species",
                              siteColumn="site", predData=envTab)

head(sitePairTab )

 distance weights s1.xCoord s1.yCoord s2.xCoord s2.yCoord s1.awcA s1.phTotal s1.sandA  s1.shcA s1.solumDepth
132   0.4485981       1   115.057 -29.40472  115.5677 -29.46599 23.0101   480.3266 83.99326 477.5656      1129.933
132.1 0.7575758       1   115.057 -29.40472  116.0789 -29.52556 23.0101   480.3266 83.99326 477.5656      1129.933
132.2 0.8939394       1   115.057 -29.40472  116.5907 -29.58342 23.0101   480.3266 83.99326 477.5656      1129.933
132.3 0.9178082       1   115.057 -29.40472  117.1029 -29.63957 23.0101   480.3266 83.99326 477.5656      1129.933

and noticed that some row.names seem to be a combination of two numbers (e.g. 131.1) but others do not (e.g. 132). I also noticed these numbers do not to match the column "site" from "envTab".

I think I need help from someone who knows the inner workings of the function.

Thank you.

gdm.varimp error

When I attempt to run gdm.varimp() I am returned with an error "Error in [.data.frame(x, , -c(remVarCols1, remVarCols2)) :
object 'remVarCols1' not found"

I have a number of datasets and for some of them the code works and for others it doesn't. The only difference is the species included in the analysis, the landscape variables are all the same for each site. I have attached an example to demonstrate the error, I hope it works! (and by that I mean I hope it gives you an error)

Thank you

Using terra() data objects with gdm ver v1.6.0 [e.g., for visualizing Multi-dimensional Biological patterns]

Hello @rvalavi and @fitzLab-AL,

I'm testing gdm ver 1.6.0 using terra data objects. To lower computational time for this test, I limited my model to 3 predictors (instead of the 17 I'm normally running) with geo=TRUE. Tests were run on a desktop PC.

The code in Section 2.1 from Appendix S1 of the user guide (Mokany et al 2022) ran okay with a couple minor code amendments. Next, I want to create maps so I can visualize compositional variation across geographic space. The code in section 2.2 of Appendix S1 worked fine until I got to "scaling the PCA rasters to make full use of the colour spectrum"

I get an error when running any one of the following code chunks. Note, I added a space after the "at" symbol in these code chunks because a github user has that as a handle i.e., @ data (no space)

pcaRast[[1]] <- (pcaRast[[1]]-pcaRast[[1]]@ data@min) / (pcaRast[[1]]@ data@max-pcaRast[[1]]@ data@min)*255
pcaRast[[2]] <- (pcaRast[[2]]-pcaRast[[2]]@ data@min) / (pcaRast[[2]]@ data@max-pcaRast[[2]]@ data@min)*255
pcaRast[[3]] <- (pcaRast[[3]]-pcaRast[[3]]@ data@min) / (pcaRast[[3]]@ data@max-pcaRast[[3]]@ data@min)*255

-- Error: no slot of name "data" for this object of class "SpatRaster"

I'm assuming this is because SpatRaster objects were not part of the original pipeline. I tried changing the references (e.g., [[1]]) in these chunks and it didn't work. Note, the name of each raster in pcaRast are lyr1, lyr2, and lyr3. Any suggestions?

Overall, the addition of terra() functionality to this version of gdm makes things much faster, which is great to see.

Thanks.

Increasing the number of splines and knots reduces propability of model fitting

Hello all,
first of all, thank you for providing this powerful piece of software to the scientific community.

I have quite complex models to fit for multiple species and sub-setting approaches. If I use the default settings (# splines 3), at least two models can be fit. However, as increasing the number of splines and knots might result in accommodating more complex models, I hoped I could get more models to fit by increasing these parameters. From the manual, it is not completely clear whether the number of knots is automatically adjusted for the number of splines requested. Hence, I decided to calculate the quantiles for the knots myself using the following approach:

 EXPLO_npred_splines <- ((ncol(EXPLO_spt) - 6) / 2)
      EXPLO_number_splines <- GLOBAL_nsplines
      EXPLO_splines_vector <-
        rep(EXPLO_number_splines, EXPLO_npred_splines)
      
      # Adjust knot vector
      EXPLO_knots_values <- c()
      for (o in 2:ncol(EXPLO_env_data)) {
        EXPLO_pred_col <- EXPLO_env_data[o]
        EXPLO_quant_vec <-
          c(0:(EXPLO_number_splines - 1)) / (EXPLO_number_splines - 1)
        EXPLO_pred_vals <-
          quantile(EXPLO_pred_col,
                   EXPLO_quant_vec,
                   na.rm = TRUE,
                   names = FALSE)
        # Append the knots values to vector to hold the values for all predictors
        EXPLO_knots_values <-
          c(EXPLO_knots_values, EXPLO_pred_vals)
      }

However, if I pursue this approach, the two models that were fit with the default values fail. I tried different values for the number of splines (4, 5, 10, 50, 100) but still, no model can be fit. I am aware of the fact that the impact of these parameters is not fully understood yet (https://github.com/fitzLab-AL/gdm#gdm-fitting). However, I am really surprised that even changing the number of splines to 4 results in model fit failures and I still don't understand why.

Thank you for your assistance.

Error running formatsitepair function

Dear Fitzpatrick Lab,

First, thanks for this great package. I am experiencing an issue when running the function formatsitepair. As biological data, I use a dissimilarity matrix, and as environmental predictor I use a distance matrix with the argument distPreds. As the formatsitepair function does not accept only distances matrices as predictors, I provide one fake predictor for the argument predData what includes coordinate columns and a fake variable. However, it returns the following error:

Error in if ((dim(mat1)[1] != dim(mat)[1]) & (dim(mat1)[2] != dim(mat)[2])) { :
argument is of length zero

This is my code, and I attach the input files:

sim <- read.table("sim.txt", h=T)
euc= read.table("euc.txt", h=T)
fake= read.table("fake.txt", h=T)

simdata = formatsitepair(bioData=sim, bioFormat=3, predData=fake, distPreds=euc, XColumn="x", YColumn="y", siteColumn="site")

I am not able to find what I am doing wrong here.

Many tanks in advance,

Pedro

euc.txt
fake.txt
sim.txt

Specifying non-equal spline numbers across variables breaks plot() and isplineExtract()

When I specify non-equal numbers of splines for the variables in my model, the model call itself seems to run correctly, but both plot() and isplineExtract() throw "subscript out of bounds" errors. I would presume this is because these functions are expecting equal dimensions across variables and then trying to access rows that don't exist.

gdm_WN6 <- gdm(sitepair_WN, geo = FALSE, splines = c(6, 4, 5, 6, 6))

plot(gdm_WN6)

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'plot': subscript out of bounds

Can you share a vignette of phylogenetic generalised dissimilarity model with gdm package?

Hi, @fitzLab-AL & @mliskAL

I have trouble running phylo-GDM, the site-by-site biological distance (dissimilarity) is either 0 or 1, how to address it?
Can gdm package implement cluster(Laffan et al., 2010) based on a Sorenson distance in Biodiverse tool?

Data:
Example_tree.nex
Example_site_data.csv

library(gdm)
library(ape)
library(tidytree)
tre <- read.nexus("Example_tree.nex")
tre$tip.label <- gsub("Genus:","",tre$tip.label)
tre_tb <- as_tibble(tre)
dt <- read.csv("Example_site_data.csv")
test <- dt %>% left_join(tre_tb[,3:4], by = c("species" = "label"))

sp <- test[,c("num","species","x", "y")]
phy <- test[,c("num","branch.length")]

exFormat <- formatsitepair(sp, bioFormat=2, siteColumn="num", XColumn="x",sppColumn = "species",
                             YColumn="y", predData=phy)

head(exFormat)

distance weights s1.xCoord s1.yCoord s2.xCoord s2.yCoord s1.branch.length s2.branch.length
1        0       1   3229628   3078198   3216986   2951192       0.07551968       0.16666667
2        0       1   3229628   3078198   3216038   2960749       0.07551968       0.07551968
3        1       1   3229628   3078198   3217265   2960874       0.07551968       0.16666667
4        1       1   3229628   3078198   3205746   2963060       0.07551968       0.07551968
5        1       1   3229628   3078198   3214891   2928147       0.07551968       0.16666667
6        1       1   3229628   3078198   3219317   2971030       0.07551968       0.03299437

reference
Laffan, S. W., Lubarsky, E., & Rosauer, D. F. (2010). Biodiverse, a tool for the spatial analysis of biological and related diversity. Ecography, 33(4), 643-647. 10.1111/j.1600-0587.2010.06237.x

Thank you！

Looking forward to your reply.