060-prediction_models.Rmd

# Prediction models

```{r, results = "asis", echo = FALSE}
status("drafting")
```

We have received important feedback since the release of the first models and web services (Dec. 2021). The feedback was that the previous model versions were doing a great job, but for others not that much. 

Variable model performance may have happened due to the specific soil types not being well represented or due to the spectra not being well aligned with the OSSL instruments. This led us to improve the current outputs to include a flag that indicates if the new samples to be predicted are represented by the calibration set. In addition to that, we revised our uncertainty estimation method by switching to conformal predictions, a simple and robust method for delivering uncertainty bands.

In addition to that, we conducted a systematic analysis of learning algorithms, compression strategies, and preprocessing using the OSSL database and external test sets, and the insights from ring trial experiment - a separate project that was developed to understand the dissimilarity across multiple soil spectroscopy laboratories.

We also removed spatial covariates and even not fused spectral regions compared to the first models, in order to make them simpler and generally applicable. These further combinations require more time to verify the potential performance improvements, although recent literature have been supporting it.

The following sections describe the steps used to produce the OSSL models, which 
are used in the OSSL Engine and API, and also public available from Google Cloud Storage.

It is important to note that the current models are not free of error and may still provide variable results. Please, let us know about any inconsistency you might find to help us improve even more the current models.

For an in-depth inspection of the modeling steps, please visit the [ossl-models](https://github.com/soilspectroscopy/ossl-models) GitHub repository.

## Modeling framework

We have used the [MLR3 framework](https://mlr3book.mlr-org.com/) [@mlr3] for fitting machine learning models, specifically with the [Cubist algorithm](https://cran.r-project.org/web/packages/Cubist/vignettes/cubist.html) [@quinlan1992learning; @quinlan1993combining].

The [`R-mlr` folder](https://github.com/soilspectroscopy/ossl-models/tree/main/R-mlr) on `ossl-models` GitHub presents the code used to calibrate the models.

In summary, we have provided 5 different model types depending on the availability of samples in for each soil property, which was fitted without the use of ancillary information (`na` code), i.e., site information or environmental layers are not used as predictors, only the spectral variation.

The model types are composed of two different subsets, i.e. using the KSSL soil spectral library alone (`kssl` code) or the full OSSL database (`ossl` code), in combination with three spectral types: VisNIR (`visnir` code), NIR from the Neospectra instrument (`nir.neospectra` code), and MIR (`mir` code).

```{r model_types, echo=FALSE}
readr::read_csv("https://github.com/soilspectroscopy/ossl-models/raw/main/out/fitted_modeling_combinations_v1.2.csv", show_col_types = FALSE) |>
  dplyr::distinct(spectra_type, subset, geo, model_name) |>
  knitr::kable()
```

>> **We highly recommend using the OSSL model types**. The models fitted exclusively with the KSSL can be used when the spectra to be predicted has the same instrument manufacturer/model as the spectrometers used to build the KSSL VisNIR and MIR libraries, and the KSSL library is representative for the new spectra based both on spectral similarity and range of soil properties of interest.  

The machine learning algorithm Cubist (coding name `cubist`) takes advantage of a decision-tree splitting method but fits linear regression models at each terminal leaf. It also uses a boosting mechanism (sequential trees adjusted by weights) that allows the growth of a forest by tuning the number of committees. We haven't used the correction of final predictions by the nearest neighbors' influence due to the lack of this feature in the MLR3 framework.

As predictors, we have used the first 120 PCs of the compressed spectra, a threshold that considers the trade-off between spectral representation and compression magnitude [@chang2001near]. The remaining farther components were used for the trustworthiness test, i.e., checking if a sample is underrepresented in respect of the feature space from the calibration set because of some unique features presented in those farther components [@Jackson1979;@deSantana2023].

Before PCA compression, spectra was preprocessed with Standard Normal Variate (SNV) to reduce some of the variability caused by contrasting instruments and SOPs, besides the minimization of particle size, light scattering, and multicollinearity known issues of diffuse reflectance [@Barnes1989].

Hyperparameter optimization was done with internal resampling (`inner`) using 5-fold cross-validation and a smaller subset for speeding up this operation [@Yang2020]. This task was performed with a grid search of the hyperparameter space testing up to 5 configurations of Comitees to find the lowest Root Mean Squared Error (RMSE). The final model with the best optimal hyperparameter was fitted at the end with the full train data.

Some highly-skewed soil properties were natural-log transformed (with offset = 1, that is why we use the `log1p` function) to improve the prediction performance [@Dangal2019]. They were back-transformed only at the end after running all modeling steps, including performance estimation and the definition of the uncertainty intervals.

Final fitted models are listed on [GitHub](https://github.com/soilspectroscopy/ossl-models/blob/main/out/fitted_modeling_combinations_v1.2.csv).

## Model performance

Model evaluation was performed with external (`outer`) 10-fold cross validation of the tuned models using RMSE (`rmse`), mean error (`bias`), R squared (`rsq`), Lin's concordance correlation coefficient (`ccc`), and the ratio of performance to the interquartile range (`rpiq`) [@malone2021soil].

The final fitted models along with their performance metrics are listed on [GitHub](https://github.com/soilspectroscopy/ossl-models/raw/main/out/fitted_models_performance_v1.2.csv).

All validation scatterplots are available on [GitHub](https://github.com/soilspectroscopy/ossl-models/tree/main/out/plots). Example of a validation plot.

```{r ac-toc, echo=FALSE, fig.cap="Validation scatterplot for `log..c.tot_usda.a622_w.pct` using `mir_cubist_ossl_na_v1.2`.", out.width="70%"}
knitr::include_graphics("https://storage.googleapis.com/soilspec4gg-public/models/log..c.tot_usda.a622_w.pct/valplot_mir_cubist_ossl_na_v1.2.png")
```

<!-- ```{r perf_metrics, echo=FALSE} -->
<!-- readr::read_csv("https://github.com/soilspectroscopy/ossl-models/raw/main/out/fitted_models_performance_v1.2.csv", show_col_types = FALSE) |> -->
<!--   knitr::kable() -->
<!-- ``` -->

## Uncertainty and trustworthiness

The cross-validated predictions were used to estimate the unbiased absolute error, which were  employed to calibrate uncertainty models via [conformal prediction intervals](https://en.wikipedia.org/wiki/Conformal_prediction).The conformal prediction is built upon past experience, i.e. the error model is calibrated using the same fine-tuned structure of the respective response model. Non-conformity of the calibration set were estimated from the predicted error with a defined confidence level of 68% to approximate one standard deviation. The confidence interval (CI) can be derived with the prediction of the response and error values corrected by the non-conformity score of the calibration set [@Norinder2014; @CortsCiriano2015]. The main advantage of conformal prediction is that the intervals are always valid, i.e. a confidence of 80% means that the predicted confidence regions may contain the observed value in at least 80% of the cases. Another big advantage of conformal prediction methods is that we take rid of resampling mechanisms such as bootstrapping or Monte Carlo simulation which significantly requires a huge computing time for large sample sizes.  

The trustworthiness test was built using the PCA model employed for compression [@Jackson1979; @deSantana2023]. Considering that only the first 120 PCs are used to decompose spectra for modeling fitting, the remaining variation from farther components are denominated residual, which is calculated by the difference between the original spectra and back-transformed with the first 120 PCs, which makes possible to calculate the Q-statistics (sum of squared residual across the spectrum). The Q statistic of new samples are compared against a Q critical value estimated across the calibration set with a 99% confidence level.

## Using the OSSL models

To load the complete analysis-ready models, train data, cross-validated predictions, validation performance metrics, and validation plot in R, please use the public URLs described [GitHub](https://github.com/soilspectroscopy/ossl-models/blob/main/out/fitted_models_access.csv).

`qs` is a serialized and compressed file format that is faster than native R `rds`. You need to have [qs package](https://github.com/traversc/qs) version >= 0.25.5 to load files direct from the URLs.

**NOTE: For using the trained models with new spectra, the supplemented data must be formatted following the defined spectral ranges:**  
**- VisNIR: 400 - 2500 nm**  
**- NIR Neospectra: 1350 - 2550 nm**  
**- MIR: 600 - 4000 cm<sup>-1</sup>**  

**Shorter intervals will raise an error! Users can impute the edges and fill the missing columns with the last available value of the border wavelengths.**

We provided on GitHub both [examples of datasets](https://github.com/soilspectroscopy/ossl-models/tree/main/sample-data) and a [prediction bundle](https://github.com/soilspectroscopy/ossl-models/blob/main/R-mlr/OSSL_functions.R) that incorporates all processing operations to get the outputs.

Please, check the example datasets for formatting your spectra to the minimum required level of the prediction bundle You can provide either `csv` files or directly `asd` or opus (`.0`) for VisNIR and MIR scans, respectively.

The results table has the prediction value (already back-transformed if log transformation was used) for the soil property of interest, 1-standard-deviation uncertainty band, and a flag column for **potential underrepresented** samples given the calibration data.

The prediction function requires the [tidyverse](https://www.tidyverse.org/), [mlr3 and mlr3extralearners](https://mlr3book.mlr-org.com), [Cubist](https://topepo.github.io/Cubist//). [qs](https://github.com/traversc/qs), [asdreader](https://github.com/pierreroudier/asdreader), [opusreader2](https://github.com/spectral-cockpit/opusreader2), and [matrixStats](https://cran.rstudio.com/web/packages/matrixStats/index.html) packages.

Parameters:  

- `target`: the soil property of interest. Log-transformed properties must have `log..` appended at the beginning of the name as indicated in the `export_name` column.  
- `spectra.file`: the path for the spectral measurements, either a `csv` table (following the sample-data specifications), `asd`, or opus (`.0`) file.  
- `spectra.type`: the spectral range of interest. Either `visnir`, `nir.neospectra`, or `mir`.  
- `subset.type`: the subset of interest, either the whole `ossl` or the `kssl` alone.  
- `geo.type`: only available for `na`.  
- `models.dir`: the path for the `ossl_models` folder. Should follow the same structure and naming code as represented in [fitted_models_access.csv](out/fitted_models_access.csv).

All files that represent the **ossl_models** directory tree for local run or online access are described in [ossl_models_directory_tree.csv](https://github.com/soilspectroscopy/ossl-models/blob/main/out/ossl_models_directory_tree.csv).

> Please note that the soil properties indication follows the export name. In addition, check [fitted_models_performance_v1.2.csv](https://github.com/soilspectroscopy/ossl-models/blob/main/out/fitted_models_performance_v1.2.csv) for the complete list of models as some spectra types are not completely available. More importantly, for natural-log soil properties, the upper and lower bands were estimated before the back-transformation.

```{r prediction, echo=TRUE, eval=FALSE, message=FALSE}
source("R-mlr/OSSL_functions.R")

list.files("sample-data")

clay.visnir.ossl <- predict.ossl(target = "clay.tot_usda.a334_w.pct",
                                 spectra.file = "sample-data/101453MD01.asd",
                                 spectra.type = "visnir",
                                 subset.type = "ossl",
                                 geo.type = "na",
                                 models.dir = "~/mnt-ossl/ossl_models/")
```

### Registering and hosting models

If you plan to contribute with your own models to the Soil Spectroscopy for Global Good project, 
please follow these steps:

1. We recommend registering and sharing new models via S3 file 
    service system (Amazon or Google Cloud) with RDS files produced using 
    [Dockerized solution for full scientific reproducibility](https://github.com/nuest/docker-reproducible-research).  
2. For each model the following metadata need to be supplied:
    - Registered [docker image](https://hub.docker.com/r/opengeohub/r-geo); which should specify in detail: software in 
    use, R package versions etc;
    - DOI of the zenodo repository with a back-up for the data (i.e. how to cite your models);
    - URL for accessing the RDS file via S3; 

We can host your models on our infrastructure at no additional costs. Please 
[contact us](https://soilspectroscopy.org/contact/) and apply for registering your models via soilspectroscopy.org.