Community modelling workflow

Generalised Dissimilarity Modelling (GDM)

Author details: EcoCommons Platform
Contact details: [email protected]
Copyright statement: This script is the product of EcoCommons platform.
Please refer to the EcoCommons website for more details:
https://www.ecocommons.org.au/
Date: October 2022

About this document

Generalised Dissimilarity Modelling (GDM) has become a widely-used method to link differences in composition between samples and explanatory environmental variables ("covariates"). The dependent or predicted variable in a GDM may be any form of distance or dissimilarity measure scaled to range between 0 (exactly matching pairs)and 1 (totally dissimilar pairs). GDMs may be fitted using any assemblage of covariates thought to be important in explaining the differences in composition between pairs of samples. Geographical distance can be optionally included as a covariate to account for distance-decay or "isolation by distance" (IBD) effects.

The prototype R-package, cmGDM is designed to implement GDM modelling as the first method within a new EcoCommons Community Modelling module. Fitting a GDM is performed by the R-package gdm available from the CRAN repository . cmGDM has been designed to implement a simple, robust workflow for basic fitting, review and reporting of GDMs.

In the near future, this material may form part of comprehensive support materials available to EcoCommons users. If you have any corrections or suggestions, please contact the EcoCommons support and communications team.

This document provides an outline of a prototype R-package, cmGDM, which is designed to implement GDM modelling as the first method within a new EcoCommons Community Modelling module. Fitting a GDM is performed by the R-package gdm available from the CRAN repository. cmGDM has been designed to implement a simple, robust workflow for basic fitting, review and reporting of GDMs.

Design concepts and principles

1. One streamlined protocol for data collation & upload

The R-package gdm provides a preparatory function, formatsitepair(), which allows users to present data components in four formats, or "bioformats" in gdm terminology, which are then compiled into a data table used by the function gdm() to fit the model. Although this flexibility is suitable for experienced R users, it can be very challenging for casual users to make appropriate choices among numerous possible combinations, and prepare data components in the required formats.

The four formats represent different combinations of pre-processed fundamental data tables needed to fit a GDM. The concept implemented in cmGDM is to begin with basic table formats and generate pre-processed or derived tables by asking users to supply necessary information. For example, users will be asked to load community data as either a presence-absence table, an abundance table, or a dissimilarity table. In the first two table types, they are standard tables which may be assembled by hand (in spreadsheets for example) or output by functions in other R-packages. For dissimilarity tables, many functions in other R-packages can produce dissimilarity tables including the R-package vegan, and several packages for analysing phylogenetic and population genetic data.

The path to data importation chosen during the development of cmGDM is to follow procedures referred to as "bioformat = 3" in the gdm package.

The expected advantages of this approach include:

Users can easily assemble necessary tables for each data component using any application with which they feel comfortable (typically, a spreadsheet application);
More robust data checking is possible enabling better guidance and support to users who need additional assistance;
No loss of modelling flexibility even though data input preparations are streamlined into a simplified protocol.

2. Staged data entry

Implementation of the data entry steps in the prototype workflow is based on the idea of staged data entry. That is, users are asked to step through a sequence of uploading each data component.

This approach is appropriate because of ordered chains of dependencies. For example, a basic requirement is that site/sample labels associated with geographical coordinates match those used in the community ("biological") data table. Feedback on errors encountered at each step is provided and progress to the next step is only possible when data quality checks have been passed for the current step.

3. Intercept data format errors

As mentioned above, data entry problems are intercepted at each step through the data upload sequence. These data quality checks duplicate many corresponding checks and error messages generated by the function formatsitepair(). However, the error messages returned by that function are not easily interpreted by less-experienced R-users. A key design decision was taken to perform data quality checks independent of formatsitepair() and generate error messages designed to be more informative for all users. This is particularly important given the design choice described in the next section.

4. Flexible data file formats

EcoCommons users have a wide range of skills and experience in preparing and manipulating data required for model fitting. More experienced users, particularly those users with good R skills, will be able to load data into an R session from a range of formats (e.g. Excel spreadsheet, comma-separated value or csv file, etc), re-organise and export data to a variety of formats suitable for use in fitting a GDM. Existing modelling modules in EcoCommons permit only the uploading of csv-formatted text files as data tables. A design decision taken in developing the prototype R-package cmGDM, was allowing users to upload data tables in a wide selection of standard data file types. This is easily implemented in R because of the very extensive list of R-packages available in the CRAN repository which cater for many file formats.

The file formats currently supported by cmGDM include:

comma-separated value (csv) text files;
TAB-separated (tsv) text files;
space-separated text files;
openDocument (ods) format spreadsheets;
Excel spreadsheets (xls & xlsx) spreadsheets.

For spreadsheets, users can select the sheet within a multi-sheet document (more correctly referred to as a 'workbook') by sheet number or name as this is a parameter available in the functions called from relevant packages to import spreadsheets.

5. Loosely OO design

A major design choice in the prototype package cmGDM was to use an object-oriented approach as much as possible. The package implements a simple S3 object of class "cm_experiment" which embodies the staged approach to data collation and quality checking.

Test Data set and testing process

An important aspect of implementing a flexible data entry front-end is rigorous testing of the many combinations or data states likely to be encountered in use. A test data suite was developed to facilitate comprehensive testing during development.

Data files

Sets of small data files within each data component required to fit a GDM have been prepared to provide a cross-section of the errors which must be trapped. The files and the conditions they test are listed in the following table. The files are available in the test_data folder of the GitLab repository.

File names reflect the error condition contained within each file (where error condition includes the state No error as indicated by "OK" within the file name).

Test procedure

The "Community modelling" Jupyter Notebook sets four tests examples.

Because a trapped error invokes the stop() function, running each test involves highlighting the appropriate lines and then clicking on the "Run" option in R-Studio. An error-checking test is expected to emit the appropriate error message to the R-console.

Running the "OK" test should emit no message - just a return to the command prompt. Typing thisExperiment$status on the command line will show that the status flag for the data component under test is TRUE. For example, immediately after instantiating a new experiment object we will see the following:

> library(cmGDM)
Loading required package: gdm
> thisExperiment <- cm_create_new_experiment("user123", "Blah")
> thisExperiment$status
      siteData_OK biologicalData_OK      covarData_OK predictionData_OK       modelFit_OK 
            FALSE             FALSE             FALSE             FALSE             FALSE 
>

After successfully loading site/sample data, we expect to see the following:

> thisExperiment$status
      siteData_OK biologicalData_OK      covarData_OK predictionData_OK       modelFit_OK 
             TRUE             FALSE             FALSE             FALSE             FALSE

The workflow follows the sequence left to right along the status vector. Progression depends on successful completion of the preceding step with the exception that a GDM model can be fitted (by a call to cm_run_gdm_experiment()) which sets 'modelFit_OK' to TRUE) immediately after a successful call to cm_load_covars(). Prediction covariates are only needed when a fitted GDM is being "projected" (i.e. used to make predictions) using a covariate set not used to fit the model. Two typical prediction situations are: (a) predicting composition at all grid cells covering a study area (note that a GDM is fitted only to environmental conditions at sites/sample locations); and (b) predicting changes in composition related to climate change either at the sites/sample locations used to fit the GDM or in grid cells covering a study region. At the time of writing, that function has not been written although the function cm_load_prediction_data() has been written and tested.

Integration into EcoCommons framework

Info supplied by user or uploaded by user, plus messages back to the user, is considered here. See source code for the R data structure below..

Data expected to be supplied by UI

field	data type	role
userID	string	Unique EcoCommons user ID
userName	string	Username (ID for humans)
experimentName	string	User-supplied meaningful name for the experiment
description	string	User-supplied description of the experiment
includeGeo	logical	Will this experiment include geographic distance as a covariate?

data$siteData$srcFile	string	File name of the site data table
data$siteData$siteCol	string	Index to the column storing site name or labels (maybe supplied as a column name or integer representing col number converted to string)
data$siteData$longitudeCol	string	Index to the column storing longitude of sites (maybe supplied as a column name or integer representing col number converted to string)
data$siteData$latitudeCol	string	Index to the column storing latitude of sites (maybe supplied as a column name or integer representing col number converted to string)
data$siteData$weightType	string	One of "equal" (default), "richness" or "custom"
data$siteData$weightCol	string	Optional for weightType == "equal" or weightType == "richness". Index to the column storing site weights (maybe supplied as a column name or integer representing col number converted to string)
data$siteData$dataTable	R data.frame	Site data table

data$biologicalData$srcFile	string	Full path to the file holding biological data
data$biologicalData$fileType	string	Which of the accepted file types is to be uploaded? Expected values incl. 'csv', 'txt', 'ods', 'xls', or 'xlsx'
data$biologicalData$sheet	string	Label for the sheet in data file when 'dataType' == ''
data$biologicalData$siteCol	string	Column in 'dataTable' storing the site/sample labels
data$biologicalData$dataType	string	String representing the 'absence' state when 'dataType' == ''
data$biologicalData$presenceMarker	string
data$biologicalData$absenceMarker	string	String representing the 'absence' state when 'dataType' == ''
data$biologicalData$dissimMeasure	string	Dissimilarity measure used to generate 'dissimMatrix'
data$biologicalData$dataTable	R data.frame	Stores raw community table supplied by the user
data$biologicalData$dissimMatrix	R numeric matrix	If dataType == 'Dissimilarity', stores uploaded dissimilarity matrix. Otherwise, stores result of applying 'dissomMeasure' to 'dataTable'.

data$covarData$srcFolder	string	Full path to the folder holding covariate layer files
data$covarData$filenames	string vector	Vector of names to covariate layers
data$covarData$label	string
data$covarData$covarNames	string vector	Names of the coavariate layers; defaults to the base name component of filenames
data$covarData$dataTable	R numeric matrix	Data for covariates at each site defined in siteData extracted from the stacked covariate layers

data$predictionData$srcFolder	string	Full path to the folder holding covariate layer files used for prediction to new environemtal conditions
data$predictionData$label	string
data$predictionData$covarNames	string vector	Names of the coavariate layers; defaults to the base name component of filenames; must match the covariate names stored in covarData which are used to fit the model
data$predictionData$dataTable	R numeric matrix	Data for covariates at each site defined in siteData extracted from the stacked prediction covariate layers

The following panel shows the source-code used to create the data structure of a cm_experiment S3 object. Note that a number of fields are populated by R code and are not solicited from the user via the GUI. These include:

data table structures storing derived data (e.g. a dissimilarity matrix created by function cm_load_community_data when the user has uploaded a presence-absence community table and asked for the Bray-Curtis dissimilarity measure to be computed)
values stored in the "status"" vector which are updated by each cm_load_XXXXX function upon successful uploading and processing of a data component
date fields which are likewise updated by code (e.g. dateCreated, dateDataUpdated, etc.)
data structures generated by a successful GDM fit and subsequent post-fit processing (e.g. the 'model' component)

  # Create (instantiate) a community modelling S3 object
  statusVec <- vector(length = 5) # type is "logical" by default
  names(statusVec) <- c("siteData_OK", "biologicalData_OK",
                        "covarData_OK", "predictionData_OK", "modelFit_OK")
  
  thisExperiment <- list(userID = userID,
                         userName = userName,
                         experimentName = experimentName,
                         dateCreated = as.character(Sys.Date()),
                         dateDataUpdated = "",
                         dateLastModelRun = "",
                         description = description,
                         status = statusVec,
                         includeGeo = FALSE,
                         data = list(siteData = list(srcFile = "",
                                                     siteCol = "",
                                                     longitudeCol = "",
                                                     latitudeCol = "",
                                                     weightType = "",
                                                     weightCol = "",
                                                     dataTable = data.frame()),
                                     biologicalData = list(srcFile = "",
                                                           fileType = "",
                                                           sheet = "",
                                                           siteCol = "",
                                                           dataType = "",
                                                           presenceMarker = "1",
                                                           absenceMarker = "0",
                                                           dissimMeasure = "",
                                                           sppFilter = 0,
                                                           dataTable = data.frame(),
                                                           dissimMatrix = matrix()),
                                     covarData = list(srcFolder = "",
                                                      filenames = "",
                                                      label = "",
                                                      covarNames = "",
                                                      covarSiteMin = numeric(),
                                                      covarSiteMax = numeric(),
                                                      covarExtentMin = numeric(),
                                                      covarExtentMax = numeric(),
                                                      dataTable = matrix()),
                                     predictionData = list(srcFolder = "",
                                                           filenames = "",
                                                           label = "",
                                                           covarNames = "",
                                                           covarSiteMin = numeric(),
                                                           covarSiteMax = numeric(),
                                                           covarExtentMin = numeric(),
                                                           covarExtentMax = numeric(),
                                                           dataTable = matrix()),
                                     sitepair = data.frame()),
                         model = list(gdm = list(),
                                      varImp = list(),
                                      perfSummary = list()))
  
  class(thisExperiment) <- c("cm_experiment", class(thisExperiment))

A worked example

A small set of data files has also been included in the folder Workflow_examples. An R-script is included to run the complete workflow, but of course each step can be run from the command line if desired.

The worked example illustrates the workflow to fit a very basic GDM.

Successful completion of the worked example should result in the following output in the console when the function cm_gdm_summary() is called:

> cat(cmGDM::cm_gdm_summary(gdmObj))
-----------------------------------------------
EcoCommons Community Modelling Module
GDM Model Summary
-----------------------------------------------

Experiment name: RF_Refugia_GDM_Argyrodendron_trifoliolatum
Run date: 2022-01-30

Number of site/sample pairs: 120

Covariates:
  Geographical distance included: No
  Covariates: Number used = 26
    aspect
    bio01
    bio02
    bio03
    bio04
    bio05
    bio06
    bio07
    bio08
    bio09
    bio10
    bio11
    bio12
    bio13
    bio14
    bio15
    bio16
    bio17
    bio18
    bio19
    clay
    sand
    silt
    slope
    TPI
    TWI

Model performance:
      Model deviance: 0.36
  Explained deviance: 78.91%
       NULL deviance: 1.69
          Interecept: 0.045

Covariate importance:
    Covariate       Contrib.
    ------------    -----------
        aspect          7.00
         bio01          0.00
         bio02          0.00
         bio03          0.00
         bio04          0.36
         bio05          0.00
         bio06          0.00
         bio07          0.00
         bio08          0.00
         bio09          0.00
         bio10          0.00
         bio11          0.20
         bio12          0.00
         bio13          0.03
         bio14          0.02
         bio15          0.04
         bio16          0.00
         bio17          0.00
         bio18          0.00
         bio19          0.00
          clay         84.07
          sand          0.00
          silt          3.46
         slope          0.00
           TPI          0.00
           TWI          0.00
>

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Workflow_examples		Workflow_examples
cmGDM		cmGDM
.Rhistory		.Rhistory
Community workflow_cmGDM.ipynb		Community workflow_cmGDM.ipynb
Community workflow_cmGDM.md		Community workflow_cmGDM.md
README.md		README.md
README.pdf		README.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Community modelling workflow

Generalised Dissimilarity Modelling (GDM)

About this document

Design concepts and principles

Test Data set and testing process

Data files

Test procedure

Integration into EcoCommons framework

Data expected to be supplied by UI

A worked example

About

Releases

Packages

Contributors 2

Languages

EcoCommons-Australia/community-modelling

Folders and files

Latest commit

History

Repository files navigation

Community modelling workflow

Generalised Dissimilarity Modelling (GDM)

About this document

Design concepts and principles

Test Data set and testing process

Data files

Test procedure

Integration into EcoCommons framework

Data expected to be supplied by UI

A worked example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages