Skip to content

Commit

Permalink
add detailed documentation on the different types of models, the corr…
Browse files Browse the repository at this point in the history
…ection factors and the preservation effects
  • Loading branch information
haganjam committed Jul 28, 2023
1 parent b7930c2 commit 3da7000
Showing 1 changed file with 191 additions and 0 deletions.
191 changes: 191 additions & 0 deletions vignettes/InvTraitR_documentation.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
---
title: "InvTraitR model descriptions"
author: "James G. Hagan"
date: "`r Sys.Date()`"
vignette: >
%\VignetteIndexEntry{Description of the model forms for the equations}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

### Forms of the length-dry biomass allometric equations in InvTraitR

Based on an extensive review of the literature of length-dry biomass allometric equations for freshwater invertebrates, there are two different model forms. The standard form is a power function with a multiplicative error structure (*model1* in the *InvTraitR* package):

Equation 1:
$$
Y = a(X^b) \cdot e^\epsilon
$$

This is the model that is fit when both the Y (i.e. dry biomass) and X (i.e. body size) variables are log-transformed. For example, let's consider taking the natural logarithm of both sides of this equation:

Equation 2:
$$
ln(Y) = ln(a) + ln(X^b) + \epsilon
$$
This equation simplifies to:

Equation 3:
$$
ln(Y) = ln(a) + (b \cdot ln(X)) + \epsilon
$$

This simplified version is the standard linear regression equation. However, given that we fit the model in log-log space with additive error, when we back-transform the error, it is no longer centered on zero. This means that we typically require a correction factor to obtain the appropriate back-transformed Y value (compare Equation 1 with Equation 3).

The second model form (*model2* in the *InvTraitR* package) is a model that is fit directly using non-linear regression model:

$$
Y = a \cdot (X^b) + \epsilon
$$

When models are fit using this type of non-linear regression equation, then we do not need any corrections to get the predictions of Y back onto the natural scale.

The vast majority of allometric equations in the literature and in the *InvTraitR* database take the form of model1.

### Correction factors for back-transforming log-log models (i.e. *model1*)

The problems with using log-log models for prediction on the original scale of the response variable are well known (e.g. Marhrlein et al. 2016). Models of the form:

$$
Y = a(X^b) \cdot e^\epsilon
$$
Are typically transformed onto the log-scale as follows:

$$
log(Y) = log(a) + (b \cdot log(X)) + \epsilon
$$
The problem with this transformation is that the errors become multiplicative and not additive on the log-scale. Thus, predictions of the expected value of Y become the geometric mean rather than the arithmetric which is always an underestimation. This means that predictions on the original-scale tend to be biased in such models. How does this bias work mathematically?

When we fit the model on the log-scale, we get the expected value of log(y) given X which is just the fitted model coefficients multiplied by the predictor variables with some additive error.

$$
E(log(Y)|log(X)) = \beta_{fitted} \cdot log(X) + \epsilon
$$
We, however, want the following:

$$
E(e^{log(Y)} |log(X))
$$

But, if we simply take the exponential of both sides of the equation, we get the following:

$$
e^{E(log(Y)|log(X))} = e^{\beta_{fitted} \cdot log(X) + \epsilon}
$$

We solve this by taking the expectation of both sides E and after some simplification, we get to:

$$
E(Y|X) = E(e^{\beta_{fitted} \cdot log(X)}) \cdot E(e^{\epsilon})
$$
This shows that to get the quantity we want (i.e. E(Y|X)), we need to multiply the model prediction by expectation of the exponentiated errors.

There are several correction factors that can be used to solve this issue. *InvTraitR* uses three of these correction factors which depend on different sources of information (e.g. model fit: r^2, maximimum and minimum y-values, residual mean square and the residuals).

*Duan's smearing factor*

The first correction that we use is the standard Duan's smearing factor (Duan 1983, *Journal of the American Statistical Association*). This correction factor requires data on the residuals to calculate and, therefore, it is typically not easy to extract from the literature. However, some papers report it and we directly digitised scatter plots and calculated it for some other equations.

The smearing factor is calculated as:

$$
E(e^{residuals})
$$
Or, if using a different logarithm base:

$$
E(base^{residuals})
$$

*Bird and Prairie (1985)*

The correction factor proposed by Bird and Prairie (1985, *Journal of Plankton Research*) is to multiply the antilog of the outputed dry biomass by the following correction factor which uses the residual mean square (RMS):

$$
e^{\frac{RMS}{2}}
$$
Here, RMS is the residual mean square which is the same as the mean squared error (see *MSE, RMS and RMSE* section below).

*Strimbu et al. (2018)*

The vast majority of correction factors to deal with the log-transformation bias rely on knowing the standard deviation of the model residuals. However, when researchers compile equations from the literature, usually they do not have access to the standard deviation of the residuals. This means that the typical correction factors do not work in these cases.

Strimbu et al. (2018, *Forestry*) recently proposed a set of correction factors that work when access to the residuals of the model are not available. For *InvTraitR*, we generally only have information on the equation, the range of X values that were used to fit the equation and the coefficient of determination (i.e. the r2 value). Using Strimbu et al.'s (2018, *Forestry*) correction factors, this information tends to be sufficient.

Specifically, Strimbu et al. (2018, *Forestry*) propose two correction factors that we may be able to use on log-linear equations in our database:

$$
BC3 = e^{(0.5 \cdot (1-r^2)) \cdot (\frac{1}{log(a)} \cdot \frac{( log_{b=a}(y_{max})-log_{b=a}(y_{max}) )}{6})^2}
$$

The other correction factor that does not require r^2 is:

$$
BC4 = e^{\frac{1}{log(a)^2} \cdot \frac{( log_{b=a}(y_{max})-log_{b=a}(y_{min}) )^2}{72}}
$$


### MSE, RMS and RMSE

In many of the older literature, there tends to be some confusion about the use of different terms like mean square error, residual mean square and root mean squared error.

To make sure that this is clear, as far as we can tell, the mean squared error (MSE) is calculated in the same way as residual mean square (RMS). It appears that RMS is the older term. Nonetheless, it is calculated as:

$$
MSE = \frac{1}{n} \cdot \sum_{i=1}^{n} (y_{i} - yhat_{i})
$$
Where y_i is the observed data and yhat_i is the predicted data. Therefore, the MSE (and the RMS) is simply the the average of the squared residuals (in words).

Then, if we take the square root of the RMS we get the RMSE which is the root mean squared error.

$$
RMSE = \sqrt{MSE}
$$
We generally use the RMSE as the standard of the regression. But, the denominator is not always (1/n). It is typically (1/n-K) where k is the number of parameters that the model is fit with. Therefore, we can write the standard deviation of the regression (s) as:

$$
s = \sqrt{\frac{1}{n-K} \cdot \sum_{i=1}^{n} (y_{i} - yhat_{i})}
$$

So, given that we are only dealing with models with two parameters, this small difference can probably be ignored and we can use RMSE as a measure of the standard deviation of the model.



### Preservation effects

The recommended approach to generating length-dry biomass equations is to use fresh specimens or frozen specimens (Benke et al. 1999). This is because preservation in commonly used preservatives like formalin and ethanol can cause mass loss as lipids and other compounds may dissolve.

Loss in dry mass due to preservation can be substantial. For example, Mahrlein et al. (2016, *Hydrobiologia*) report dry weight loss of more than 20% due to preservation in ethanol. Similarly, Wetzel et al. (2005) report dry weight loss of more than 30% due to both formalin and ethanol preservation. This is in contrast to the common doctrine that preservation-induced mass loss is less prevalent when animals are preserved in formalin (Leuven et al. 1985). These numbers (20 or 30%) are significant and, therefore, we should do our best to account for them to the best of our ability.

Generating equations or gathering test data that used different preservation methods can lead to a large additional source of error when estimating dry biomass. However, there is another issue with preservation methods. Usually, we want to estimate dry biomass for communities or species that were collected and preserved. However, individuals that are preserved in ethanol or formalin can also change their lengths. Thus, when these preserved individuals are measured for length, the measured lengths can be incorrect.

We need to look carefully at which preservation methods were used to generate the equation and which preservation is used for the testing data or for the dataset where the length data come from.

*What information do we need to extract?*

Add preservation method information for each equation. I think I'm going to stick with three possibilities:

+ unpreserved
+ formaldehyde
+ ethanol

Add a preservation correction factor for equations generated with preserved specimens. For example, Dekanova et al. (2022) generated their length-dry biomass equations using formalin-preserved specimens. We need to add some correction factor to these equations that increases the dry biomass estimate to account for this. For example, Mahrlein et al. (2016) propose a 22% correction factor but we'll need to check how consistent this is. The point is that we'll need to add a correction factor to each equation that was generated using preserved specimens. This will be based on the preservation method used.

In addition, we need to provide a length correction factor because preserved animals also change their length. Whether this length correction factor is required will depend on the preservation status of the equation:

+ length preserved -> equation unpreserved (correct length upwards)
+ length unpreserved -> equation preserved (correct length downwards)

The details of how to do this will depend and I'll need to figure it out.

A nice comparison figure for the manuscript will be something like comparing accuracy of the method when accounting for preservation effects and when not accounting for preservation effects.

I think this is actually very important because most people that use these equations generally don't think too much about the preservation effects. This makes sense given that it's a lot of extra work. But, if it only needs to be done once (by us) then it might improve these biomass estimations a lot.

0 comments on commit 3da7000

Please sign in to comment.