Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/latest' into lirber/mastiff
Browse files Browse the repository at this point in the history
  • Loading branch information
luizirber committed Sep 29, 2023
2 parents feaafb1 + ba1fd3e commit 81c5bce
Show file tree
Hide file tree
Showing 9 changed files with 239 additions and 113 deletions.
5 changes: 0 additions & 5 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,6 @@ on:
push:
branches: [latest]
pull_request:
paths:
- 'Cargo.lock'
- 'src/core/**'
- 'tests/test-data/**'
- '.github/workflows/rust.yml'
schedule:
- cron: "0 0 * * *" # daily

Expand Down
47 changes: 8 additions & 39 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

164 changes: 120 additions & 44 deletions doc/classifying-signatures.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Classifying signatures: `search`, `gather`, and `lca` methods.

```{contents} Contents
:depth: 3
```

sourmash provides several different techniques for doing
classification and breakdown of signatures.

Expand Down Expand Up @@ -31,63 +35,135 @@ analysis only.
See [the main sourmash tutorial](tutorial-basic.md#make-and-search-a-database-quickly)
for information on using `search` with and without `--containment`.

## Breaking down metagenomic samples with `gather` and `lca`
## Analyzing metagenomic samples with `gather`

Neither search option (similarity or containment) is effective when
comparing or searching with metagenomes, which typically have a
comparing or searching with metagenomes, which typically contain a
mixture of many different genomes. While you might use containment to
see if a query genome is present in one or more metagenomes, a common
question to ask is the reverse: **what genomes are in my metagenome?**

We have implemented two approaches in sourmash to do this.

One approach uses taxonomic information from e.g. GenBank to classify
individual k-mers, and then infers taxonomic distributions of
metagenome contents from the presence of these individual
k-mers. (This is the approach pioneered by
[Kraken](https://ccb.jhu.edu/software/kraken/) and used by many other tools.)
`sourmash lca` can be used to classify individual genome bins with
`classify`, or summarize metagenome taxonomy with `summarize`. The
[sourmash lca tutorial](tutorials-lca.md)
shows how to use the `lca classify` and `lca summarize` commands, and also
provides guidance on building your own database.

The other approach, `gather`, breaks a metagenome down into individual
genomes based on greedy partitioning. Essentially, it takes a query
metagenome and searches the database for the most highly contained
genome; it then subtracts that match from the metagenome, and repeats.
At the end it reports how much of the metagenome remains unknown. The
An alternative phrasing is this: **what reference genomes should I map
my metagenomic reads to?**

The main approach we provide in sourmash is `sourmash gather`. This
constructs the shortest possible list of reference genomes that cover
all of the known k-mers in a metagenome. We call this a *minimum
metagenome cover*.

From an algorithmic perspective, `gather` generates a minimum set
cover for a query metagenome, using the reference database you give
it. The minimum set cover is calculated using a greedy approximation
algorithm. Essentially, `gather` takes a query metagenome and
searches the database for the most highly contained genome; it then
subtracts that match from the metagenome, and repeats. At the end it
reports how much of the metagenome remains unknown. The
[basic sourmash tutorial](tutorial-basic.md#whats-in-my-metagenome)
has some sample output from using gather with GenBank. See Appendix A at
the bottom of this page for more technical details.
has some sample output from using gather with GenBank. See Appendix A
at the bottom of this page for more technical details.

Some benchmarking on CAMI suggests that `gather` is a very accurate
method for doing strain-level resolution of genomes. More on
that as we move forward!
The `gather` method is described in
[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers, Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2).
Our benchmarking in that paper and also in
[Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets, Portik et al., 2022](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0)
suggests that it is a very sensitive and specific method for
analyzing metagenomes.

## To do taxonomy, or not to do taxonomy?
## Taxonomic profiling with sourmash

By default, there is no structured taxonomic information available in
sourmash signatures or SBT databases of signatures. Generally what
sourmash supports two basic kinds of taxonomic profiling, under the
`lca` and `tax` modules. **As of 2023, we strongly recommend the
`tax`-based profiling approach.**

But first, let's back up! By default, there is no structured taxonomic
information available in sourmash signatures or collections. What
this means is that you will have to provide your own mapping from a
match to some taxonomic hierarchy. This is generally the case when
you are working with lots of genomes that have no taxonomic
information.

The `lca` subcommands, however, work with LCA databases, which contain
taxonomic information by construction. This is one of the main
differences between the `sourmash lca` subcommands and the basic
`sourmash search` functionality. So the `lca` subcommands will generally
output structured taxonomic information, and these are what you should look
to if you are interested in doing classification.

It's important to note that taxonomy based on k-mers is very, very
specific and if you get a match, it's pretty reliable. On the
match to some taxonomic hierarchy. Both the `lca` and `tax` modules
support identifier-based taxonomic mappings, in which identifiers
from the signature names can be linked to the standard seven NCBI/GTDB
taxonomy ranks - superkingdom, phylum, class, order, family, genus, and
species. These are typically provided in a spreadsheet _separately_ from
the signature collection. The `tax` module also supports `lins` taxonomies,
for which we have a tutorial.

There are several advantages that this approach affords sourmash. One
is that sourmash is not tied closely to a specific taxonomy - you can
use either GTDB or NCBI as you wish. Another advantage is that you can
create your own custom taxonomic ranks and even use them with private
databases of genomes to classify your own metagenomes.

The main disadvantage of sourmash's approach to taxonomy is that
sourmash doesn't classify individual metagenomic reads to either a
genome or a taxon. (Note that we're not sure this can be done robustly
in practice - neither short nor long reads typically contain enough
information to uniquely identify a single genome, especially if there
are many genomes from the same species present in the database.) If
you want to do this, we suggest running `sourmash gather` first, and
then mapping the reads to the matching genomes; then you can determine
which read maps to which genome. This is the approach taken by
[the genome-grist pipeline](https://dib-lab.github.io/genome-grist/).

<!-- link to tutorials and examples -->

### Using `sourmash tax` to do taxonomic analysis

We recommend using the `tax` module to do taxonomic classification of
genomes (with `tax genome`) and metagenomes (with `tax metagenome`).
The `tax` module commands operate downstream of `sourmash gather`,
which builds a minimum set cover of the query against the database -
intuitively speaking, this is the shortest possible list of genomes
that the query would map to. Then, both `tax genome` and `tax
metagenome` take the CSV output of `sourmash gather` and produce
taxonomic profiles. (You can read more about minimum set covers
in
[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers, Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2).)

The `tax metagenome` approach was benchmarked in
[Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets, Portik et al., 2022](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0)
and appears to be both very accurate and very sensitive, unless you're
using Nanopore data or other data types that have a high sequencing
error rate.

It's important to note that taxonomy based on multiple k-mers is very,
very specific and if you get a match, it's pretty reliable. On the
converse, however, k-mer identification is very brittle with respect
to evolutionary divergence, so if you don't get a match it may only
mean that the specific species or genus you're searching for isn't in
the database.

### Using `sourmash lca` to do taxonomic analysis

The `sourmash lca` module supports taxonomic classification using
single hashes, corresponding to single k-mers, in an approach inspired
by Kraken. Briefly, you first build an LCA database using `lca index`,
which takes a taxonomy spreadsheet and a collection of sketches. Then,
you can use `lca classify` to classify single-genome sketches or
`lca summarize` to classify metagenomes.

The `lca` approach is not published anywhere, but we're happy to discuss
it in more detail; just [post to the issue tracker](https://github.com/sourmash-bio/sourmash/issues).

While we do not recommend the `lca` approach for general taxonomic
classification purposes (see below!), it remains useful for certain
kinds of diagnostic evaluation of sequences, so we are leaving the
functionality in sourmash.

### `sourmash tax` vs `sourmash lca`

Why do we recommend using the `tax` module over `lca`? `sourmash lca`
was the first implementation in sourmash, and over the years we've
found that it is prone to false positives: that is, individual k-mers
are very sensitive but are often misassigned to higher taxonomic ranks
than they need to be, either because of contamination in the reference
database or because the taxonomy is not based directly on genome
similarity. Instead of using single k-mers, `sourmash gather` estimates
the best matching genome based on combinations of k-mers, which is much
more specific than the LCA approach; only then is a taxonomy assigned
using `sourmash tax`.

The bottom line is that in our experience, `sourmash tax` is as
sensitive as `lca`, and a lot more specific. Please let us know if you
discover differently!

## Abundance weighting

By default, sourmash tracks k-mer presence, *not* their abundance. The
Expand Down Expand Up @@ -166,9 +242,9 @@ an abundance-weighted query, and automatically apply `--ignore-abundance`.
As of v4.4, `sourmash` can estimate Average Nucleotide Identity (ANI)
between two FracMinHash ("scaled") sketches. `sourmash compare` can now
produce a matrix of ANI values estimated from Jaccard, Containment,
or Max Containment by specifiing `--ani` (optionally along with search type,
or Max Containment by specifying `--ani` (optionally along with search type,
e.g. `--containment`). `sourmash search`, `sourmash prefetch`, and
`sourmash gather` will now output ANI estimates to output csvs.
`sourmash gather` will now output ANI estimates to output CSVs.

Note that while ANI can be estimated from either the Jaccard Index or
the Containment Index, ANI from Containment is preferable (more accurate).
Expand Down
Loading

0 comments on commit 81c5bce

Please sign in to comment.