Skip to content

Commit

Permalink
MRG: revise documentation structure; add internals page. (#2184)
Browse files Browse the repository at this point in the history
This PR rearranges the docs to according to the https://diataxis.fr/
structure, per #2054.

New pages:
* [A heavily revised index
page](https://sourmash--2184.org.readthedocs.build/en/2184/index.html)
* [A guide to the internals of
sourmash](https://sourmash--2184.org.readthedocs.build/en/2184/sourmash-internals.html)
* [Frequently asked
questions](https://sourmash--2184.org.readthedocs.build/en/2184/faq.html)
*
[Publications](https://sourmash--2184.org.readthedocs.build/en/2184/publications.html)
*
[Funding](https://sourmash--2184.org.readthedocs.build/en/2184/funding.html)

Fixes #2054 (document restructuring)
Fixes #932 (add an FAQ)
Fixes #2760 (tax preferred to lca)
Tackles #1227 (what is
gather)
Fixes #971 (funding acks)
Fixes #1289 (p_match and
p_query)
Fixes #1531 (document
memory tradeoffs in save formats)
Fixes #1532 (order of
database load/reporting)
Fixes #1609 (better
gather description, `f_unique_query`, etc.)
Fixes #2170 (use
`detection`)
Fixes #1881 (correlation
with read mapping)
Fixes #2566 (retrieving
reads)
Fixes #2775 (vision &
mission)

---------

Co-authored-by: ccbaumler <[email protected]>
  • Loading branch information
ctb and ccbaumler authored Oct 15, 2023
1 parent 9ff523e commit 0db31c8
Show file tree
Hide file tree
Showing 12 changed files with 1,443 additions and 155 deletions.
2 changes: 1 addition & 1 deletion doc/api-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -476,7 +476,7 @@ building the minhashes.

If you're interested in building comparison matrices and dendrograms,
please see the notebook
[Building plots from `sourmash compare` output](plotting-compare.md).
[Building plots from `sourmash compare` output](plotting-compare.ipynb).

## Saving and loading signature files

Expand Down
140 changes: 139 additions & 1 deletion doc/classifying-signatures.md
Original file line number Diff line number Diff line change
Expand Up @@ -293,7 +293,8 @@ match that is _unique_ with respect to the original query. It will
always decrease as you get more matches. The sum of
`f_unique_to_query` across all rows is what is reported in by gather
as the fraction of query k-mers hit by the recovered matches
(unweighted) and should never be greater than 1!
(unweighted) and should never be greater than 1! This column should
be used in any analysis that needs to avoid double-counting matches.

The second column, `f_match_orig`, is how much of the match is in the
_original_ query. For this fictional metagenome, each match is
Expand Down Expand Up @@ -395,3 +396,140 @@ overlap p_query p_match avg_abund
The cause of the poor approximation for genome s10 is unclear; it
could be due to low coverage, or small genome size, or something
else. More investigation needed.

## Appendix C: sourmash gather output examples

Below we show two real gather analyses done with a mock metagenome,
SRR606249 (from
[Shakya et al., 2014](https://pubmed.ncbi.nlm.nih.gov/23387867/)) and
three of the known genomes contained within it - two *Shewanella baltica*
strains and one *Akkermansia muciniphila* genome

### sourmash gather with a query containing abundance information

```
% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig
== This is sourmash version 4.8.5.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
selecting specified query k=31
loaded query: SRR606249... (k=31, DNA)
--
loaded 9 total signatures from 3 locations.
after selecting signatures compatible with search, 3 remain.
Starting prefetch sweep across databases.
Prefetch found 3 signatures with overlap >= 50.0 kbp.
Doing gather to generate minimum metagenome cover.
overlap p_query p_match avg_abund
--------- ------- ------- ---------
5.2 Mbp 0.8% 99.0% 11.7 NC_011663.1 Shewanella baltica OS223...
2.7 Mbp 0.9% 100.0% 24.5 CP001071.1 Akkermansia muciniphila A...
5.2 Mbp 0.3% 51.0% 8.1 NC_009665.1 Shewanella baltica OS185...
found less than 50.0 kbp in common. => exiting
found 3 matches total;
the recovered matches hit 2.0% of the abundance-weighted query.
the recovered matches hit 2.5% of the query k-mers (unweighted).
```

### sourmash gather with the same query, *ignoring* abundances

```
% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig --ignore-abundance
== This is sourmash version 4.8.5.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
selecting specified query k=31
loaded query: SRR606249... (k=31, DNA)
--
loaded 9 total signatures from 3 locations.
after selecting signatures compatible with search, 3 remain.
Starting prefetch sweep across databases.
Prefetch found 3 signatures with overlap >= 50.0 kbp.
Doing gather to generate minimum metagenome cover.
overlap p_query p_match
--------- ------- -------
5.2 Mbp 1.2% 99.0% NC_011663.1 Shewanella baltica OS223, complete...
2.7 Mbp 0.6% 100.0% CP001071.1 Akkermansia muciniphila ATCC BAA-83...
5.2 Mbp 0.6% 51.0% NC_009665.1 Shewanella baltica OS185, complete...
found less than 50.0 kbp in common. => exiting
found 3 matches total;
the recovered matches hit 2.5% of the query k-mers (unweighted).
```

### Notes and comparisons

There are a few interesting things to point out about the above output:

* `p_match` is the same whether or not abundance information is used.
This is because it is the fraction of the matching genome detected in
the metagenome, which is inherently "flat". It is also reported
progressively: only the portions of the metagenome that have not
matched to any previous matches are used in `p_match`; read on for
details :).
* `p_query` is different when abundance information is used. For
queries with abundance information, `p_query` provides a weighted
estimate that approximates the number of metagenome reads that would
map to this genome _after_ mapping reads to all previously reported
matches, i.e. all matches above this match.
* When abundance information is not available or
not used, `p_query` is an estimate of what fraction of the remaining k-mers
in the metagenome match to this genome, after all previous matches
have been removed.
* The `avg_abund` column only shows up when abundance information is
supplied. This is the k-mer coverage of the detected portion of the
match; it is a lower bound on the expected mapping-based coverage
for metagenome reads mapped to the detected portion of the match.
* The percent of recovered matches for the abundance-weighted query
is the sum of the `p_query` column and estimates the total fraction
of metagenome reads that will map across all of the matching references.
* The percent of recovered matches when _ignoring_ abundances is likewise
the sum of the (unweighted) `p_query` column and is not particularly
informative - it will always be low for real metagenomes, because sourmash
cannot match erroneous k-mers created by sequencing errors.
* The `overlap` column is the estimated size of the overlap between the
(unweighted) original query metagenome and the match. It does not take
into account previous matches.

Last but not least, something interesting is going on here with strains.
While it is not highlighted in the text output of gather, there is
substantial overlap between the two *Shewanella baltica* genomes. And,
in fact, both of them are entirely (well, 99%) present in the metagenome
if measured individually with `sourmash search --containment`.

Consider a few more details:

* `p_match` for the first *Shewanella* match, `NC_011663.1`, is 99%!
* `p_match` for the second *Shewanella* match, `NC_009665.1`, is only 50%!
* and, confusingly, the `overlap` for both matches is 5.2 Mbp!

What's up?!

What's happening here is that `sourmash gather` is subtracting the match
to the first *Shewanella* genome from the metagenome before moving on to
the next result, and `p_match` reports only the amount of the match
detected in the _remaining_ metagenome after that subtraction.
However, `overlap` is reported as the amount of overlap with the
_original_ metagenome, which is essentially the entire genome in all
three cases.

The main things to keep in mind for gather are this:

* `p_query` and `p_match` do not double-count k-mers or matches; in particular, you can sum across `p_query` for a metagenome without
counting anything more than once.
* `overlap` _does_ count matches redundantly.
* the percent of recovered matches is a useful summary of the whole
metagenome!

We know it's confusing but it's the best output we've been able to
figure out across all of the different use cases for gather. Perhaps
in the future we'll find a better way to represent all of these
numbers in a more clear, concise, and interpretable way; in the
meantime, we welcome your questions and comments!
40 changes: 38 additions & 2 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -361,11 +361,40 @@ overlap p_query p_match
0.9 Mbp 7.4% 11.8% BA000019.2 Nostoc sp. PCC 7120 DNA, c...
0.7 Mbp 5.9% 23.0% FOVK01000036.1 Proteiniclasticum rumi...
0.7 Mbp 5.3% 17.6% AE017285.1 Desulfovibrio vulgaris sub...
```
...
found less than 50.0 kbp in common. => exiting
found 64 matches total;
the recovered matches hit 94.0% of the abundance-weighted query.
the recovered matches hit 45.6% of the query k-mers (unweighted).
```

For each match,
* 'overlap', the first column, is the estimated number of k-mers shared between the match and the query.
* 'p_query' is the _percentage_ of the query that overlaps with the match; it is the amount of the metagenome "explained" by this match.
* 'p_match' is the percentage of the _match_ that overlaps with the query; it is the "detection" of the match in the metagenome.

Quite a bit more information per match row is available in the CSV
output saved with `-o`; for details, see
[Classifying signatures: how sourmash gather works](classifying-signatures.md#appendix-a-how-sourmash-gather-works).

The "recovered matches" lines detail how much of the query is
explained by the entire collection of matches. You will get two numbers if
your metagenome sketch has been calculated with `-p abund`, and only
one if it does not have abundances. The abundance-weighted
number should approximate the fraction of metagenome reads that will
map to at least one reference genome, while the unweighted number
describes how much of the metagenome itself matches to genomes.
Here's another way to put it: if the metagenome could be perfectly
assembled into contigs, the unweighted number would approximate the
number of bases from the contigs that would match perfectly to at
least one genome in the reference database. More practically,
the abundance-weighted number is less sensitive to sequencing errors.
See [classifying signatures](classifying-signatures.md#abundance-weighting) or [the FAQ](faq.md) for more information!

The command line option `--threshold-bp` sets the threshold below
which matches are no longer reported; by default, this is set to
50kb. see the Appendix in
50kb. See the Appendix in
[Classifying Signatures](classifying-signatures.md) for details.

As of sourmash 4.2.0, `gather` supports `--picklist`, to
Expand Down Expand Up @@ -2033,6 +2062,13 @@ All of these save formats can be loaded by sourmash commands.
**We strongly suggest using .zip files to store signatures: they are fast,
small, and fully supported by all the sourmash commands.**

Note that when outputting large collections of signatures, some save
formats require holding all the sketches in memory until they can be
written out, and others can save progressively. This can affect memory
usage! Currently `.sig` and `.sig.gz` formats are held in memory,
while `.zip`, directory outputs, and `.sqldb` formats write progressively
to disk.

For more detailed information on database formats and performance
tradeoffs, please see [the advanced usage information for
databases!](databases-advanced.md)
Expand Down
10 changes: 7 additions & 3 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@

# General information about the project.
project = 'sourmash'
copyright = '2016-2020, C. Titus Brown and Luiz Irber'
author = 'C. Titus Brown and Luiz Irber'
copyright = '2016-2023, C. Titus Brown, Luiz Irber, and N. Tessa Pierce-Ward'
author = 'C. Titus Brown, Luiz Irber, and N. Tessa Pierce-Ward'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
Expand Down Expand Up @@ -114,6 +114,9 @@
# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False

# CTB: suppress warnings about circularity in ToC.
# see https://github.com/sphinx-doc/sphinx/issues/7410.
suppress_warnings = ['toc.circular']

# -- Options for HTML output ----------------------------------------------

Expand All @@ -127,7 +130,8 @@
html_theme_options = {
'logo': 'logo.png',
'logo_name': True,
'description': 'Quickly search, compare, and analyze genomic and metagenomic data sets'
'description': 'Quickly search, compare, and analyze genomic and metagenomic data sets',
'sidebar_collapse': False,
}

# Add any paths that contain custom themes here, relative to this directory.
Expand Down
Loading

0 comments on commit 0db31c8

Please sign in to comment.