Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: revise documentation structure; add internals page. #2184

Merged
merged 66 commits into from
Oct 15, 2023
Merged
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
77e4f49
first try at revised index
ctb Aug 6, 2022
9f9f046
Merge branch 'latest' into update_doc_structure
ctb Aug 15, 2022
33eeb06
lots of details
ctb Aug 16, 2022
68fe6d7
much docs, so wow
ctb Aug 16, 2022
0bf170d
even more docs
ctb Aug 16, 2022
f3ccf11
even more docs
ctb Aug 16, 2022
4c60cd7
mo'
ctb Aug 16, 2022
471ddf4
add issue links
ctb Aug 16, 2022
419e433
fix more links
ctb Aug 16, 2022
8e9248f
notes
ctb Aug 17, 2022
91fbf7c
update text
ctb Aug 17, 2022
b67d38f
Apply suggestions from code review
ctb Aug 18, 2022
fbbcee5
address many of @ccbaumler suggestions
ctb Aug 18, 2022
87ab0dc
Update doc/sourmash-internals.md
ctb Aug 18, 2022
ee17cf3
misc changes
ctb Aug 18, 2022
bb4a89d
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Aug 19, 2022
6d4a397
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Aug 20, 2022
8fce224
whups, add faq.md
ctb Aug 20, 2022
9e1fbfc
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Aug 21, 2022
df8d0e2
Merge branch 'latest' into update_doc_structure
ctb Aug 22, 2022
a967534
Merge branch 'latest' into update_doc_structure
ctb Aug 26, 2022
a1ec24b
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Aug 30, 2022
2b6e3df
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Aug 31, 2022
e468205
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Sep 23, 2023
2f1eb4b
update some text
ctb Sep 23, 2023
6576d07
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Sep 26, 2023
87ae6a2
more FAQ
ctb Sep 26, 2023
1d5b754
update / add verify notes
ctb Sep 26, 2023
7a254ee
more writing
ctb Sep 26, 2023
651ca6e
more
ctb Sep 26, 2023
01ed183
more
ctb Sep 27, 2023
8b2fac3
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Sep 27, 2023
d99d7c9
fix links etc
ctb Sep 27, 2023
c60da96
add faq
ctb Sep 27, 2023
0120b78
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Sep 28, 2023
47aa759
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Sep 29, 2023
b21282e
add funding acks
ctb Sep 29, 2023
de2b5a4
detailed gather output
ctb Sep 30, 2023
b789b73
minor updates
ctb Sep 30, 2023
fb45b6c
add information about save formats and memory usage
ctb Sep 30, 2023
69dfbc0
add section on choosing a reference for read mapping
ctb Sep 30, 2023
e828ff9
add section on retrieving reads to FAQ
ctb Sep 30, 2023
4d2a91b
collection load order
ctb Sep 30, 2023
4e97e8e
update author/copyright
ctb Sep 30, 2023
571c865
minor corrections
ctb Sep 30, 2023
c6251d3
close to done with a first pass
ctb Oct 1, 2023
1d1e460
finish off internals
ctb Oct 1, 2023
4ed0367
finish things off?
ctb Oct 1, 2023
a4c1244
clean up missing refs
ctb Oct 3, 2023
3967b15
add publications
ctb Oct 3, 2023
30ed122
Merge branch 'latest' into update_doc_structure
ctb Oct 3, 2023
362a48f
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Oct 13, 2023
1f44043
Merge branch 'update_doc_structure' of https://github.com/sourmash-bi…
ctb Oct 13, 2023
535d7eb
update new -> index
ctb Oct 13, 2023
6026ecb
Apply suggestions from code review
ctb Oct 13, 2023
247d4e4
Merge branch 'update_doc_structure' into update_doc_structure_index
ctb Oct 14, 2023
82aa9ed
update
ctb Oct 14, 2023
14e9a51
add global toc
ctb Oct 14, 2023
c30aea6
remove ToC at top
ctb Oct 14, 2023
a08a36f
try unhiding
ctb Oct 14, 2023
d128c65
add sidebar
ctb Oct 14, 2023
13fde86
add TOC to sidebar
ctb Oct 14, 2023
8144c3c
upd sidebar
ctb Oct 14, 2023
6a5bb1a
rename internals page
ctb Oct 14, 2023
817092f
clean up
ctb Oct 14, 2023
b944b69
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Oct 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 139 additions & 1 deletion doc/classifying-signatures.md
Original file line number Diff line number Diff line change
Expand Up @@ -293,7 +293,8 @@ match that is _unique_ with respect to the original query. It will
always decrease as you get more matches. The sum of
`f_unique_to_query` across all rows is what is reported in by gather
as the fraction of query k-mers hit by the recovered matches
(unweighted) and should never be greater than 1!
(unweighted) and should never be greater than 1! This column should
be used in any analysis that needs to avoid double-counting matches.

The second column, `f_match_orig`, is how much of the match is in the
_original_ query. For this fictional metagenome, each match is
Expand Down Expand Up @@ -395,3 +396,140 @@ overlap p_query p_match avg_abund
The cause of the poor approximation for genome s10 is unclear; it
could be due to low coverage, or small genome size, or something
else. More investigation needed.

## Appendix C: sourmash gather output examples

Below we show two real gather analyses done with a mock metagenome,
SRR606249 (from
[Shakya et al., 2014](https://pubmed.ncbi.nlm.nih.gov/23387867/)) and
three of the known genomes contained within it - two *Shewanella baltica*
strains and one *Akkermansia muciniphila* genome

### sourmash gather with a query containing abundance information

```
% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig

== This is sourmash version 4.8.5.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting specified query k=31
loaded query: SRR606249... (k=31, DNA)
--
loaded 9 total signatures from 3 locations.
after selecting signatures compatible with search, 3 remain.

Starting prefetch sweep across databases.
Prefetch found 3 signatures with overlap >= 50.0 kbp.
Doing gather to generate minimum metagenome cover.

overlap p_query p_match avg_abund
--------- ------- ------- ---------
5.2 Mbp 0.8% 99.0% 11.7 NC_011663.1 Shewanella baltica OS223...
2.7 Mbp 0.9% 100.0% 24.5 CP001071.1 Akkermansia muciniphila A...
5.2 Mbp 0.3% 51.0% 8.1 NC_009665.1 Shewanella baltica OS185...
found less than 50.0 kbp in common. => exiting

found 3 matches total;
the recovered matches hit 2.0% of the abundance-weighted query.
the recovered matches hit 2.5% of the query k-mers (unweighted).
```

### sourmash gather with the same query, *ignoring* abundances

```
% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig --ignore-abundance

== This is sourmash version 4.8.5.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting specified query k=31
loaded query: SRR606249... (k=31, DNA)
--
loaded 9 total signatures from 3 locations.
after selecting signatures compatible with search, 3 remain.

Starting prefetch sweep across databases.
Prefetch found 3 signatures with overlap >= 50.0 kbp.
Doing gather to generate minimum metagenome cover.

overlap p_query p_match
--------- ------- -------
5.2 Mbp 1.2% 99.0% NC_011663.1 Shewanella baltica OS223, complete...
2.7 Mbp 0.6% 100.0% CP001071.1 Akkermansia muciniphila ATCC BAA-83...
5.2 Mbp 0.6% 51.0% NC_009665.1 Shewanella baltica OS185, complete...
found less than 50.0 kbp in common. => exiting

found 3 matches total;
the recovered matches hit 2.5% of the query k-mers (unweighted).
```

### Notes and comparisons

There are a few interesting things to point out about the above output:

* `p_match` is the same whether or not abundance information is used.
This is because it is the fraction of the matching genome detected in
the metagenome, which is inherently "flat". It is also reported
progressively: only the portions of the metagenome that have not
matched to any previous matches are used in `p_match`; read on for
details :).
* `p_query` is different when abundance information is used. For
queries with abundance information, `p_query` provides a weighted
estimate that approximates the number of metagenome reads that would
map to this genome _after_ mapping reads to all previously reported
matches, i.e. all matches above this match.
* When abundance information is not available or
not used, `p_query` is an estimate of what fraction of the remaining k-mers
in the metagenome match to this genome, after all previous matches
have been removed.
* The `avg_abund` column only shows up when abundance information is
supplied. This is the k-mer coverage of the detected portion of the
match; it is a lower bound on the expected mapping-based coverage
for metagenome reads mapped to the detected portion of the match.
* The percent of recovered matches for the abundance-weighted query
is the sum of the `p_query` column and estimates the total fraction
of metagenome reads that will map across all of the matching references.
* The percent of recovered matches when _ignoring_ abundances is likewise
the sum of the (unweighted) `p_query` column and is not particularly
informative - it will always be low for real metagenomes, because sourmash
cannot match erroneous k-mers created by sequencing errors.
* The `overlap` column is the estimated size of the overlap between the
(unweighted) original query metagenome and the match. It does not take
into account previous matches.

Last but not least, something interesting is going on here with strains.
While it is not highlighted in the text output of gather, there is
substantial overlap between the two *Shewanella baltica* genomes. And,
in fact, both of them are entirely (well, 99%) present in the metagenome
if measured individually with `sourmash search --containment`.

Consider a few more details:

* `p_match` for the first *Shewanella* match, `NC_011663.1`, is 99%!
* `p_match` for the second *Shewanella* match, `NC_009665.1`, is only 50%!
* and, confusingly, the `overlap` for both matches is 5.2 Mbp!

What's up?!

What's happening here is that `sourmash gather` is subtracting the match
to the first *Shewanella* genome from the metagenome before moving on to
the next result, and `p_match` reports only the amount of the match
detected in the _remaining_ metagenome after that subtraction.
However, `overlap` is reported as the amount of overlap with the
_original_ metagenome, which is essentially the entire genome in all
three cases.

The main things to keep in mind for gather are this:

* `p_query` and `p_match` do not double-count k-mers or matches; in particular, you can sum across `p_query` for a metagenome without
counting anything more than once.
* `overlap` _does_ count matches redundantly.
* the percent of recovered matches is a useful summary of the whole
metagenome!

We know it's confusing but it's the best output we've been able to
figure out across all of the different use cases for gather. Perhaps
in the future we'll find a better way to represent all of these
numbers in a more clear, concise, and interpretable way; in the
meantime, we welcome your questions and comments!
40 changes: 38 additions & 2 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -358,11 +358,40 @@ overlap p_query p_match
0.9 Mbp 7.4% 11.8% BA000019.2 Nostoc sp. PCC 7120 DNA, c...
0.7 Mbp 5.9% 23.0% FOVK01000036.1 Proteiniclasticum rumi...
0.7 Mbp 5.3% 17.6% AE017285.1 Desulfovibrio vulgaris sub...
```
...
found less than 50.0 kbp in common. => exiting

found 64 matches total;
the recovered matches hit 94.0% of the abundance-weighted query.
the recovered matches hit 45.6% of the query k-mers (unweighted).
```

For each match,
* 'overlap', the first column, is the estimated number of k-mers shared between the match and the query.
* 'p_query' is the _percentage_ of the query that overlaps with the match; it is the amount of the metagenome "explained" by this match.
* 'p_match' is the percentage of the _match_ that overlaps with the query; it is the "detection" of the match in the metagenome.

Quite a bit more information per match row is available in the CSV
output saved with `-o`; for details, see
[Classifying signatures: how sourmash gather works](classifying-signatures.md#appendix-a-how-sourmash-gather-works).

The "recovered matches" lines detail how much of the query is
explained by the entire collection of matches. You will get two numbers if
your metagenome sketch has been calculated with `-p abund`, and only
one if it does not have abundances. The abundance-weighted
number should approximate the fraction of metagenome reads that will
map to at least one reference genome, while the unweighted number
describes how much of the metagenome itself matches to genomes.
Here's another way to put it: if the metagenome could be perfectly
assembled into contigs, the unweighted number would approximate the
number of bases from the contigs that would match perfectly to at
least one genome in the reference database. More practically,
the abundance-weighted number is less sensitive to sequencing errors.
See [classifying signatures](classifying-signatures.md#abundance-weighting) or [the FAQ](faq.md) for more information!

The command line option `--threshold-bp` sets the threshold below
which matches are no longer reported; by default, this is set to
50kb. see the Appendix in
50kb. See the Appendix in
[Classifying Signatures](classifying-signatures.md) for details.

As of sourmash 4.2.0, `gather` supports `--picklist`, to
Expand Down Expand Up @@ -2030,6 +2059,13 @@ All of these save formats can be loaded by sourmash commands.
**We strongly suggest using .zip files to store signatures: they are fast,
small, and fully supported by all the sourmash commands.**

Note that when outputting large collections of signatures, some save
formats require holding all the sketches in memory until they can be
written out, and others can save progressively. This can affect memory
usage! Currently `.sig` and `.sig.gz` formats are held in memory,
while `.zip`, directory outputs, and `.sqldb` formats write progressively
to disk.

For more detailed information on database formats and performance
tradeoffs, please see [the advanced usage information for
databases!](databases-advanced.md)
Expand Down
4 changes: 2 additions & 2 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@

# General information about the project.
project = 'sourmash'
copyright = '2016-2020, C. Titus Brown and Luiz Irber'
author = 'C. Titus Brown and Luiz Irber'
copyright = '2016-2023, C. Titus Brown, Luiz Irber, and N. Tessa Pierce-Ward'
author = 'C. Titus Brown, Luiz Irber, and N. Tessa Pierce-Ward'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
Expand Down
Loading
Loading