Skip to content

Algorithm overview

Arkadiy-Garber edited this page Jan 20, 2019 · 8 revisions

The TaxonSluice algorithm analyzes OTUs present in environmental samples for overlap with i) sample-specific blanks and ii) non-sample-specific blanks. This is not an automated solution against contaminants. Our tool provides a software assisted analysis where the user ultimately decides, based on multiple lines of evidence, if a specific flagged OTU will be retained or discarded. Our key algorithm steps and their rationale are summarized below. Sample-specific blank overlap.
For every comparison, if overlap between sample and blanks exists, a decision to “flag” the OTU as a potential contaminant is made based on a user-defined proportional abundance threshold. The program's default is to flag an OTU if its proportional abundance in the blank is at or above 10% its abundance in the sample. We justify this arbitrary threshold by assuming that, following geometric amplification during PCR, legitimate kit contaminant sequences, present in both sample and blank reactions at equal template amounts, should not diverge in counts by more than an order of magnitude. If an OTU, present in a sample and its blank, has more than an order of magnitude sequence counts in the sample relative to the blank, it is likely that the environment contains closely related lineages to kit contaminants and that these sequences have been artificially merged by the OTU clustering step. Thus, in such a scenario, we would opt to retain this OTU, since it likely represents legitimate environmental data despite its similarity to lineages present in kit reagents. However, since this is not a quantitative assessment, our assumption is imperfect and the user is welcomed to explore other threshold cutoffs. Note: If an OTU is only present in blanks, it is automatically removed from the data set.

Non-sample-specific blank Overlap.
The presence of multiple blanks in a dataset allows higher order comparisons that take into account, in addition to potential reagent contamination, laboratory introduced contaminants. A single sample and blank pair may fail to catch laboratory introduced contamination following technical error (poor aseptic practice, pipetting error, etc.) or contaminant of laboratory disposables (microcentrifuge tubes, pipette tips, etc.). Therefore, if the environmental sample is inadvertently contaminated and the blank is not; comparisons of OTUs across multiple, non-sample-specific blanks, may catch such laboratory introduced contaminants. TaxonSluice checks every OTU not flagged as a contaminant based on its sample-specific blank against every available non-sample-specific blank to take into account laboratory introduced contamination.

Inspection of closest relatives in the SILVA database.
With the exception of automatic elimination of the OTUs found only in blanks, the user is charged with making a final decision on the retention or removal of all flagged OTUs. To inform this decision, TaxonSluice outputs the ten closest relatives to each flagged OTU in the SILVA database. Identity, coverage and e-value metrics, in addition to environmental isolation metadata, if available, are also provided by TaxonSluice. This information, we hope, will help guide the user in making a retain or remove decision for each flagged OTU in the dataset. Thus, TaxonSluice, parses through the bulk of environmental sequence data, and based on 3 independent lines of evidence [i) sample-specific blanks, ii) non-sample-specific blanks and iii) environmental context of close sequence relatives], provides the user as much information as possible to robustly determine retention or removal of taxonomic lineages in the dataset that were also present in sample blanks.

Clone this wiki locally