Skip to content

Commit

Permalink
MRG: adjust documentation to recommend tax over lca for taxonomic…
Browse files Browse the repository at this point in the history
… analysis (#2777)

This PR adds cautionary notes to the command line docs, and updates the
information on classifying signatures to suggest using tax instead of
LCA, and even explains why :).

There is more work to be done - we need to add more tutorials, and
adjust the language in classifying-signatures around gather and LCA -
but this is a nice standalone PR!

Fixes #2562
Fixes #2772
Fixes #2773

Adds information from
#2760
Addresses #2535
  • Loading branch information
ctb authored Sep 25, 2023
1 parent 256a705 commit 66bc910
Show file tree
Hide file tree
Showing 2 changed files with 139 additions and 19 deletions.
108 changes: 93 additions & 15 deletions doc/classifying-signatures.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Classifying signatures: `search`, `gather`, and `lca` methods.

```{contents} Contents
:depth: 3
```

sourmash provides several different techniques for doing
classification and breakdown of signatures.

Expand Down Expand Up @@ -41,6 +45,8 @@ question to ask is the reverse: **what genomes are in my metagenome?**

We have implemented two approaches in sourmash to do this.

<!-- CTB refactor this soon :) -->

One approach uses taxonomic information from e.g. GenBank to classify
individual k-mers, and then infers taxonomic distributions of
metagenome contents from the presence of these individual
Expand All @@ -65,21 +71,59 @@ Some benchmarking on CAMI suggests that `gather` is a very accurate
method for doing strain-level resolution of genomes. More on
that as we move forward!

## To do taxonomy, or not to do taxonomy?
## Taxonomic profiling with sourmash

By default, there is no structured taxonomic information available in
sourmash signatures or SBT databases of signatures. Generally what
this means is that you will have to provide your own mapping from a
match to some taxonomic hierarchy. This is generally the case when
you are working with lots of genomes that have no taxonomic
information.
sourmash supports two basic kinds of taxonomic profiling, under the
`lca` and `tax` modules. **As of 2023, we strongly recommend the
`tax`-based profiling approach.**

The `lca` subcommands, however, work with LCA databases, which contain
taxonomic information by construction. This is one of the main
differences between the `sourmash lca` subcommands and the basic
`sourmash search` functionality. So the `lca` subcommands will generally
output structured taxonomic information, and these are what you should look
to if you are interested in doing classification.
But first, let's back up! By default, there is no structured taxonomic
information available in sourmash signatures or collections. What
this means is that you will have to provide your own mapping from a
match to some taxonomic hierarchy. Both the `lca` and `tax` modules
support identifier-based taxonomic mappings, in which identifiers
from the signature names can be linked to the standard seven NCBI/GTDB
taxonomy ranks - superkingdom, phylum, class, order, family, genus, and
species. These are typically provided in a spreadsheet _separately_ from
the signature collection. The `tax` module also supports `lins` taxonomies,
for which we have a tutorial.

There are several advantages that this approach affords sourmash. One
is that sourmash is not tied closely to a specific taxonomy - you can
use either GTDB or NCBI as you wish. Another advantage is that you can
create your own custom taxonomic ranks and even use them with private
databases of genomes to classify your own metagenomes.

The main disadvantage of sourmash's approach to taxonomy is that
sourmash doesn't classify individual metagenomic reads to either a genome
or a taxon. (Note that we're not sure
this can be done robustly in practice - neither short nor long reads typically
contain enough information to uniquely identify a single genome.) If you
want to do this, we suggest running `sourmash gather` first, and then
mapping the reads to the matching genomes; then you can use the mapping
to determine which read maps to which genome. This is the approach taken by
[the genome-grist pipeline](https://dib-lab.github.io/genome-grist/).

<!-- link to tutorials and examples -->

### Using `sourmash tax` to do taxonomic analysis

We recommend using the `tax` module to do taxonomic classification of
genomes (with `tax genome`) and metagenomes (with `tax metagenome`).
The `tax` module commands operate downstream of `sourmash gather`,
which builds a minimum set cover of the query against the database -
intuitively speaking, this is the shortest possible list of genomes
that the query would map to. Then, both `tax genome` and `tax
metagenome` take the CSV output of `sourmash gather` and produce
taxonomic profiles. (You can read more about minimum set covers
in
[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers, Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2).)

The `tax metagenome` approach was benchmarked in
[Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets, Portik et al., 2022](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0)
and appears to be both very accurate and very sensitive, unless you're
using Nanopore data or other data types that have a high sequencing
error rate.

It's important to note that taxonomy based on k-mers is very, very
specific and if you get a match, it's pretty reliable. On the
Expand All @@ -88,6 +132,40 @@ to evolutionary divergence, so if you don't get a match it may only
mean that the specific species or genus you're searching for isn't in
the database.

### Using `sourmash lca` to do taxonomic analysis

The `sourmash lca` module supports taxonomic classification using
single hashes, corresponding to single k-mers, in an approach inspired
by Kraken. Briefly, you first build an LCA database using `lca index`,
which takes a taxonomy spreadsheet and a collection of sketches. Then,
you can use `lca classify` to classify single-genome sketches or
`lca summarize` to classify metagenomes.

The `lca` approach is not published anywhere, but we're happy to discuss
it in more detail; just [post to the issue tracker](https://github.com/sourmash-bio/sourmash/issues).

While we do not recommend the `lca` approach for general taxonomic
classification purposes (see below!), it remains useful for certain
kinds of diagnostic evaluation of sequences, so we are leaving the
functionality in sourmash.

### `sourmash tax` vs `sourmash lca`

Why do we recommend using the `tax` module over `lca`? `sourmash lca`
was the first implementation in sourmash, and over the years we've
found that it is prone to false positives: that is, individual k-mers
are very sensitive but are often misassigned to higher taxonomic ranks
than they need to be, either because of contamination in the reference
database or because the taxonomy is not based directly on genome
similarity. Instead of using single k-mers, `sourmash gather` estimates
the best matching genome based on combinations of k-mers, which is much
more specific than the LCA approach; only then is a taxonomy assigned
using `sourmash tax`.

The bottom line is that in our experience, `sourmash tax` is as
sensitive as `lca`, and a lot more specific. Please let us know if you
discover differently!

## Abundance weighting

By default, sourmash tracks k-mer presence, *not* their abundance. The
Expand Down Expand Up @@ -166,9 +244,9 @@ an abundance-weighted query, and automatically apply `--ignore-abundance`.
As of v4.4, `sourmash` can estimate Average Nucleotide Identity (ANI)
between two FracMinHash ("scaled") sketches. `sourmash compare` can now
produce a matrix of ANI values estimated from Jaccard, Containment,
or Max Containment by specifiing `--ani` (optionally along with search type,
or Max Containment by specifying `--ani` (optionally along with search type,
e.g. `--containment`). `sourmash search`, `sourmash prefetch`, and
`sourmash gather` will now output ANI estimates to output csvs.
`sourmash gather` will now output ANI estimates to output CSVs.

Note that while ANI can be estimated from either the Jaccard Index or
the Containment Index, ANI from Containment is preferable (more accurate).
Expand Down
50 changes: 46 additions & 4 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,13 +97,21 @@ information; these are grouped under the `sourmash tax` and

`sourmash lca` commands:

```{attention}
We do not recommend using the `lca` subcommands for taxonomic analysis
any more; please use `sourmash tax` instead. See
[taxonomic profiling with sourmash](classifying-signatures.md#taxonomic-profiling-with-sourmash)
for more information.
```

* `lca classify` classifies many signatures against an LCA database.
* `lca summarize` summarizes the content of metagenomes using an LCA database.
* `lca index` creates a database for use with LCA subcommands.
* `lca rankinfo` summarizes the content of a database.
* `lca compare_csv` compares lineage spreadsheets, e.g. those output by `lca classify`.

> See [the LCA tutorial](tutorials-lca.md) for a
See [the LCA tutorial](tutorials-lca.md) for a
walkthrough of some of these commands.

Finally, there are a number of utility and information commands:
Expand Down Expand Up @@ -484,8 +492,14 @@ signatures, rather than all the signatures in the database.

## `sourmash tax` subcommands for integrating taxonomic information into gather results

The `sourmash tax` subcommands support taxonomic analysis of genomes
and taxonomic profiling of metagenomes.
See
[taxonomic profiling with sourmash](classifying-signatures.md#taxonomic-profiling-with-sourmash)
for more information.

The sourmash `tax` or `taxonomy` commands integrate taxonomic
information into the results of `sourmash gather`. All `tax` commands
information with the results of `sourmash gather`. All `tax` commands
require one or more properly formatted `taxonomy` files where the
identifiers correspond to those in the database(s) used for
`gather`. Note that if using multiple databases, the `gather` needs
Expand Down Expand Up @@ -1075,15 +1089,21 @@ and 120,757 within the `p__Proteobacteria`.
## `sourmash lca` subcommands for in-memory taxonomy integration

These commands use LCA databases (created with `lca index`, below, or
prepared databases such as
[genbank-k31.lca.json.gz](databases.md)).
prepared databases such as [genbank-k31.lca.json.gz](databases.md)).

### `sourmash lca classify` - classify a genome using an LCA database

`sourmash lca classify` classifies one or more signatures using the given
list of LCA DBs. It is meant for classifying metagenome-assembled genome
bins (MAGs) and single-cell genomes (SAGs).

```{attention}
We no longer recommend using `sourmash lca` for taxonomic analysis;
please use `sourmash tax` instead. See
[taxonomic profiling with sourmash](classifying-signatures.md#taxonomic-profiling-with-sourmash)
for more information.
```

Usage:

```
Expand Down Expand Up @@ -1132,6 +1152,21 @@ number of additional k-mers in the input signature were classified as
taxonomic assignment below genus *Shewanella* and it would report
a status of `disagree` with the genus-level assignment of *Shewanella*;
species level assignments would not be reported.
Here, the assigned rank is the rank immediately *above* where there is
a taxonomic disagreement, and the taxid & lineage refer to the name at
that rank (the least-common-ancestor at which an assignment can be
made).

For another example, if you saw this line in the CSV file:

```
TARA_ASW_MAG_00029,1224,disagree,phylum,Bacteria;Proteobacteria
```

you would know that TARA_ASW_MAG_00029 has k-mers that are shared
between different orders: 'Pseudomonadales' and
'Rhodobacterales'. Therefore, the classifier status is `disagree`, and
the classified taxid is at rank `phylum` - just above `order`.

(This is the approach that Kraken and other lowest common ancestor
implementations use, we believe.)
Expand All @@ -1151,6 +1186,13 @@ exploring metagenomes and metagenome-assembled genome bins.
that output percentages are weighted by the number of times a k-mer is
seen; this can be turned off with `--ignore-abundance`.

```{attention}
We no longer recommend using `sourmash lca` for taxonomic analysis;
please use `sourmash tax` instead. See
[taxonomic profiling with sourmash](classifying-signatures.md#taxonomic-profiling-with-sourmash)
for more information.
```

Usage:

```
Expand Down

0 comments on commit 66bc910

Please sign in to comment.