-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
8366475
commit daa6bfc
Showing
1 changed file
with
44 additions
and
41 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -61,7 +61,7 @@ University of Michigan, Ann Arbor, Michigan, USA | |
${^*}$ Current Affiliation: | ||
|
||
${^\#}$ Current Affiliation: Bristol Myers Squibb, Summit, New Jersey, | ||
USA | ||
USA | ||
|
||
$\dagger$ To whom correspondence should be addressed: | ||
[pschloss\@umich.edu](mailto:[email protected]) | ||
|
@@ -79,27 +79,27 @@ $\dagger$ To whom correspondence should be addressed: | |
Machine learning classification of disease based on the gut microbiome | ||
often relies on clustering 16S rRNA gene sequences into operational | ||
taxonomic units (OTUs) to quantify microbial composition. The standard | ||
approach to clustering sequences into OTUs leverages the similarity of | ||
*de novo* approach to clustering sequences into OTUs leverages the similarity of | ||
the sequences to each other rather than to a reference database. The | ||
abundance of each OTU is used to train a classification model. However, | ||
OTU assignments depend on the sequences in the data set and therefore | ||
can change if new data are added. This lack of stability complicates | ||
can change if new data are added. This lack of consistency complicates | ||
This comment has been minimized.
Sorry, something went wrong. |
||
classification because in order to use the model to classify additional | ||
samples, the OTUs have to be reclustered to include the new sequences | ||
and the model retrained with the new OTU clusters. A new reference-based | ||
clustering algorithm, called OptiFit, addresses this issue by fitting | ||
new sequences into existing OTUs. While OptiFit is proven to produce | ||
high quality OTU clusters, it is unclear whether this method for fitting | ||
new sequence data into existing OTUs will impact the performance of | ||
new sequences into existing OTUs. While OptiFit has been shown to produce | ||
high quality OTU clusters, it is unknown whether this method will impact | ||
the performance of | ||
classification models. We used OptiFit to cluster additional data into | ||
existing OTU clusters and quantified model performance in classifying a | ||
data set containing samples from patients with and without colonic | ||
screen relevant neoplasias (SRN). We compared the performance of this | ||
model to the standard procedure of clustering all the data together. We | ||
model to the standard procedure of *de novo* clustering all the data together. We | ||
found that both approaches performed equally well in classifying SRNs. | ||
Moving forward, when OTUs are used in classification problems, OptiFit | ||
can be used to avoid the need to retrain models using reclustered | ||
sequences when classifying new samples. | ||
can streamline the process of classifying new samples by avoiding the need to | ||
retrain models using reclustered sequences. | ||
|
||
## Importance | ||
|
||
|
@@ -112,7 +112,7 @@ generated from new patients seeking a diagnosis, then it would be | |
necessary to reassign sequences to OTUs and retrain the classification | ||
model. Yet there is a desire to have a single, validated model that can | ||
be deployed. To overcome this obstacle, we applied the OptiFit | ||
clustering algorithm which fits new sequence data to existing OTUs | ||
clustering algorithm which fits new sequence data to existing OTUs, | ||
allowing the reuse of a consistent model. A random forest machine | ||
learning model deployed using OptiFit performed as well as the | ||
traditional reassignment and retraining approach. This result indicates | ||
|
@@ -126,59 +126,59 @@ classification of diseases, including colorectal cancer [@baxter2016; | |
@zackular2014]. Amplicon sequencing of the 16S rRNA gene is a reliable | ||
tool for assessing the taxonomic composition of microbial communities, | ||
which is the input to these models. Analysis of 16S rRNA gene sequence | ||
data generally relies on clustering of sequences based on similarity | ||
data generally relies on clustering sequences based on similarity | ||
into operational taxonomic units (OTUs). The process of OTU clustering | ||
can either be reference-based or *de novo*. The quality of OTUs | ||
generated with reference-based clustering is generally poor compared to | ||
those generated with *de novo* clustering @westcott2015. While *de novo* | ||
clustering produces high-quality OTU clusters where sequences are | ||
accurately grouped based on similarity thresholds, the resulting OTU | ||
clusters depend on the data in the data set and the addition of new data | ||
could change the overall OTU clusters. The unstable nature of OTU | ||
could change the overall OTU clusters. The inconsistent nature of *de novo* OTU | ||
clustering complicates deployment of machine learning models since | ||
integration of additional data requires reclustering all the data and | ||
retraining of the model. The ability to integrate new data into a | ||
retraining the model. The ability to integrate new data into a | ||
validated model without reclustering and retraining could allow for | ||
deployment of a single model that new data can be continually added to. | ||
Recently Sovacool *et al* introduced OptiFit: a method for fitting new | ||
deployment of a single model that new data can be continually tested against. | ||
Recently, Sovacool *et al* introduced OptiFit: a method for fitting new | ||
sequence data into existing OTUs @sovacool2022. While OptiFit is proven | ||
to effectively fit new sequence data to existing OTU clusters, it is | ||
unknown if the use of OptiFit will have an impact on classification. | ||
Here we tested the ability of OptiFit to cluster new sequence data into | ||
unknown if the use of OptiFit will have an impact on classification performance. | ||
Here, we tested the ability of OptiFit to cluster new sequence data into | ||
existing OTU clusters for the purpose of classification of disease based | ||
on gut microbiome composition. | ||
|
||
We compared two approaches, one using all of the data to generate OTU | ||
clusters and the other generating OTU clusters with a portion of the | ||
We compared two approaches, one using all of the data to generate *de novo* OTU | ||
clusters, and the other generating *de novo* OTU clusters with a portion of the | ||
data and then fitting the remaining sequence data to the existing OTUs | ||
using OptiFit. In the first approach, all of the 16S rRNA sequence data | ||
was *de novo* clustered into OTUs with the OptiClust algorithm in mothur | ||
@westcott2017. The resulting abundance data was then split into training | ||
and testing sets, where the training set was used to tune | ||
hyperparameters and ultimately train the model. The testing set was then | ||
classified with the model and the performance of the model was | ||
quantified (Figure 1A). However, with this methodology we would have to | ||
quantified (Figure 1A). However, with this methodology, we would have to | ||
regenerate the OTU clusters and retrain the model if we wanted to | ||
classify additional samples. The OptiFit algorithm @sovacool2022 | ||
addresses this problem by enabling new sequences to be clustered into | ||
existing OTUs. The OptiFit workflow is similar to the OptiClust workflow | ||
existing OTUs. The OptiFit workflow is similar to the OptiClust workflow, | ||
where the data was clustered into OTUs and used to tune hyperparameters | ||
and ultimately train the model. Then, we used OptiFit to fit sequence | ||
data of samples not part of the original data set into the existing OTUs | ||
data of samples not part of the original data set into the existing OTUs, | ||
and used the same model to classify the samples (Figure 1B). To test how | ||
the model performance compares between these two methodologies, we used | ||
a publicly available data set of 16S rRNA gene sequences from stool | ||
samples of healthy subjects as well as subjects with SRN consisting of | ||
advanced adenoma and carcinoma @baxter2016. The data set was randomly | ||
split into an 80% train set and 20% test set. For the standard OptiClust | ||
workflow, the training and test sets were *de novo* clustered together | ||
into OTUs then the resulting abundance table was split into the training | ||
into OTUs, then the resulting abundance table was split into the training | ||
and testing set. For the OptiFit workflow, the train set was clustered | ||
*de novo* into OTUs and the remaining test set was fit to the OTU | ||
*de novo* into OTUs, and the remaining test set was fit to the OTU | ||
clusters using the OptiFit algorithm. For both workflows, the abundance | ||
table of the train set was used to tune hyperparameters and train a | ||
random forest model to classify SRN. The test set was classified as | ||
either control or SRN using the trained models.To account for variation | ||
either control or SRN using the trained models. To account for variation | ||
depending on the split of the data, the data set was randomly split 100 | ||
times and the process repeated for each of the 100 data splits. By | ||
comparing the model performance of classifying the samples in the test | ||
|
@@ -213,7 +213,7 @@ OTUs, we expected the MCC scores produced by the OptiClust and OptiFit | |
workflows to be similar. Since the data was only clustered once in the | ||
OptiClust workflow there was only one MCC score while the OptiFit | ||
workflow produced an MCC score for the OTU clusters from each data | ||
split. Overall the MCC scores were similar between OptiClust (MCC = | ||
split. Overall, the MCC scores were similar between OptiClust (MCC = | ||
`r round(opticlust_mcc,digits=3)`) and OptiFit (average MCC = | ||
`r round(optifit_avg_mcc,digits=3)`). This indicated that OptiFit | ||
performed as well as OptiClust when integrating new sequences into the | ||
|
@@ -240,7 +240,7 @@ pvals <- read_csv("../results/tables/pvalues.csv",col_types = cols(p_value = col | |
|
||
After verifying that the quality of the OTUs was consistent between | ||
OptiClust and OptiFit, we examined the model performance for classifying | ||
samples in the held out test data set. To quantify model performance we | ||
samples in the held out test data set. To quantify model performance, we | ||
used the OTU relative abundances from the training data from the | ||
OptiClust and OptiFit workflows to train a model to predict SRNs. Using | ||
the predicted and actual diagnosis classification, we calculated the | ||
|
@@ -268,10 +268,10 @@ We tested the ability of OptiFit to integrate new data into existing | |
OTUs for the purpose of machine learning classification using OTU | ||
relative abundance. A potential problem with using OptiFit is that any | ||
sequences in the new data that do not map to the existing OTU clusters | ||
will be discarded resulting in a possible loss of information. However, | ||
will be discarded, resulting in a possible loss of information. However, | ||
we demonstrated that OptiFit can be used to fit new sequence data into | ||
existing OTU clusters and perform equally well in predicting SRN | ||
compared to clustering all of the sequence data together. The ability to | ||
existing OTU clusters and performs equally well in predicting SRN | ||
compared to *de novo* clustering all of the sequence data together. The ability to | ||
integrate data from new samples into existing OTUs enables the | ||
deployment of a single machine learning model. These results are based | ||
on a single data set and disease. Further analysis is needed to | ||
|
@@ -287,7 +287,7 @@ stool samples was downloaded from NCBI Sequence Read Archive (accession | |
no. SRP062005) [@edgar2011; @baxter2016]. This data set contains stool | ||
samples from a total of 490 subjects. For this analysis, samples from | ||
subjects identified in the metadata as normal, high risk normal, or | ||
adenoma were categorized as "normal" while samples from subjects | ||
adenoma were categorized as "normal", while samples from subjects | ||
identified as advanced adenoma or carcinoma were categorized as "screen | ||
relevant neoplasia" (SRN). The resulting data set consisted of 261 | ||
normal samples and 229 SRN samples. | ||
|
@@ -321,23 +321,26 @@ the test set an average of `r n_test_train %>% pull(avg_test)` times | |
(SD=`r format_decimal(n_test_train %>% pull(sd_test),digits = 1)`). | ||
|
||
The data was processed through two workflows. First, the standard | ||
workflow using the OptiClust algorithm @westcott2017. In this pathway, | ||
workflow using the OptiClust algorithm @westcott2017. In this workflow, | ||
all of the data was clustered together with OptiClust to generate OTUs | ||
and the resulting abundance tables were split into the training and | ||
testing sets. In the second workflow, the preprocessed data was split | ||
into the training and testing sets. The training set was clustered into | ||
OTUs, then the test set was fit to the OTUs of the training set using | ||
the OptiFit algorithm @sovacool2022. The OptiFit algorithm was run with | ||
method open so that any sequences that did not map to the existing OTU | ||
clusters would form new OTUs. For both pathways, the shared files were | ||
the OptiFit algorithm @sovacool2022. The OptiFit algorithm was run with the open | ||
method so that any sequences that did not map to the existing OTU | ||
clusters would form new OTUs. | ||
<!-- Did you then remove the columns corresponding to the additional OTUs? | ||
Typically you'd want to run the closed method for this use-case, right? --> | ||
For both workflows, the shared files were | ||
sub-sampled to 10,000 reads per sample. | ||
|
||
***Machine Learning.*** Machine learning using Random Forest was | ||
conducted with the R package mikrompl (v 1.2.0) @topçuoglu2021 to | ||
predict the diagnosis (SRN or normal) for the samples in the test set | ||
for each data split. The training set was preprocessed to normalize OTU | ||
counts (scale/center), collapse correlated OTUs, and remove OTUs with | ||
zero-variance. The preprocessing from the training set was then applied | ||
counts (scale and center), collapse correlated OTUs, and remove OTUs with | ||
zero variance. The preprocessing from the training set was then applied | ||
to the test set. Any OTUs in the test set that were not in the training | ||
set were removed. P values comparing model performance were calculated | ||
as previously described @topçuoglu2020. The averaged ROC curves were | ||
|
@@ -387,8 +390,8 @@ This work was supported through a grant from the NIH (R01CA215574). | |
was clustered into OTUs using the OptiClust algorithm in mothur. The | ||
data was then split into two sets where 80% of the samples were assigned | ||
to the training set and 20% to the testing set. The training set was | ||
preprocessed with mikropml to normalize values (scale/center), collapse | ||
correlated features, and remove features with zero-variance. Using | ||
preprocessed with mikropml to normalize values (scale and center), collapse | ||
correlated features, and remove features with zero variance. Using | ||
mikropml, the training set was split into train and validate sets to | ||
compare results using different hyperparameter settings. The highest | ||
performing hyperparameter setting was then used to train the model with | ||
|
@@ -405,12 +408,12 @@ sets where 80% of the samples were assigned to the training set and 20% | |
to the testing set. The training set was then clustered into OTUs using | ||
the OptiClust algorithm in mothur. The resulting abundance data was | ||
preprocessed with mikropml to normalize values (scale/center), collapse | ||
correlated features, and remove features with zero-variance. Using | ||
correlated features, and remove features with zero variance. Using | ||
mikropml, the training set was split into train and validate sets to | ||
compare results using different hyperparameter settings. The highest | ||
performing hyperparameter setting was then used to train the model with | ||
the full training set. The OptiFit algorithm in mothur was used to | ||
cluster the left out testing data set using the OTUs of the training set | ||
cluster the held out testing data set using the OTUs of the training set | ||
as a reference. The preprocessing scale from the training set was | ||
applied to the test data set, then the trained model was used to | ||
classify the samples in the test set. Based on the actual classification | ||
|
I originally had consistency but Pat specifically changed to "stability"