sourmash-bio · ctb · Oct 15, 2023 · Aug 6, 2022 · Aug 15, 2022 · Aug 16, 2022
diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md
@@ -293,7 +293,8 @@ match that is _unique_ with respect to the original query. It will
 always decrease as you get more matches. The sum of
 `f_unique_to_query` across all rows is what is reported in by gather
 as the fraction of query k-mers hit by the recovered matches
-(unweighted) and should never be greater than 1!
+(unweighted) and should never be greater than 1! This column should
+be used in any analysis that needs to avoid double-counting matches.
 
 The second column, `f_match_orig`, is how much of the match is in the
 _original_ query.  For this fictional metagenome, each match is
@@ -395,3 +396,140 @@ overlap     p_query p_match avg_abund
 The cause of the poor approximation for genome s10 is unclear; it
 could be due to low coverage, or small genome size, or something
 else. More investigation needed.
+
+## Appendix C: sourmash gather output examples
+
+Below we show two real gather analyses done with a mock metagenome,
+SRR606249 (from
+[Shakya et al., 2014](https://pubmed.ncbi.nlm.nih.gov/23387867/)) and
+three of the known genomes contained within it - two *Shewanella baltica*
+strains and one *Akkermansia muciniphila* genome
+
+### sourmash gather with a query containing abundance information
+
+```
+% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig
+
+== This is sourmash version 4.8.5.dev0. ==
+== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
+
+selecting specified query k=31
+loaded query: SRR606249... (k=31, DNA)
+--
+loaded 9 total signatures from 3 locations.
+after selecting signatures compatible with search, 3 remain.
+
+Starting prefetch sweep across databases.
+Prefetch found 3 signatures with overlap >= 50.0 kbp.
+Doing gather to generate minimum metagenome cover.
+
+overlap     p_query p_match avg_abund
+---------   ------- ------- ---------
+5.2 Mbp        0.8%   99.0%      11.7    NC_011663.1 Shewanella baltica OS223...
+2.7 Mbp        0.9%  100.0%      24.5    CP001071.1 Akkermansia muciniphila A...
+5.2 Mbp        0.3%   51.0%       8.1    NC_009665.1 Shewanella baltica OS185...
+found less than 50.0 kbp in common. => exiting
+
+found 3 matches total;
+the recovered matches hit 2.0% of the abundance-weighted query.
+the recovered matches hit 2.5% of the query k-mers (unweighted).
+```
+
+### sourmash gather with the same query, *ignoring* abundances
+
+```
+% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig --ignore-abundance
+
+== This is sourmash version 4.8.5.dev0. ==
+== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
+
+selecting specified query k=31
+loaded query: SRR606249... (k=31, DNA)
+--
+loaded 9 total signatures from 3 locations.
+after selecting signatures compatible with search, 3 remain.
+
+Starting prefetch sweep across databases.
+Prefetch found 3 signatures with overlap >= 50.0 kbp.
+Doing gather to generate minimum metagenome cover.
+
+overlap     p_query p_match
+---------   ------- -------
+5.2 Mbp        1.2%   99.0%    NC_011663.1 Shewanella baltica OS223, complete...
+2.7 Mbp        0.6%  100.0%    CP001071.1 Akkermansia muciniphila ATCC BAA-83...
+5.2 Mbp        0.6%   51.0%    NC_009665.1 Shewanella baltica OS185, complete...
+found less than 50.0 kbp in common. => exiting
+
+found 3 matches total;
+the recovered matches hit 2.5% of the query k-mers (unweighted).
+```
+
+### Notes and comparisons
+
+There are a few interesting things to point out about the above output:
+
+* `p_match` is the same whether or not abundance information is used.
+  This is because it is the fraction of the matching genome detected in
+  the metagenome, which is inherently "flat". It is also reported
+  progressively: only the portions of the metagenome that have not
+  matched to any previous matches are used in `p_match`; read on for
+  details :).
+* `p_query` is different when abundance information is used. For
+  queries with abundance information, `p_query` provides a weighted
+  estimate that approximates the number of metagenome reads that would
+  map to this genome _after_ mapping reads to all previously reported
+  matches, i.e. all matches above this match.
+* When abundance information is not available or
+  not used, `p_query` is an estimate of what fraction of the remaining k-mers
+  in the metagenome match to this genome, after all previous matches
+  have been removed.
+* The `avg_abund` column only shows up when abundance information is
+  supplied. This is the k-mer coverage of the detected portion of the
+  match; it is a lower bound on the expected mapping-based coverage
+  for metagenome reads mapped to the detected portion of the match.
+* The percent of recovered matches for the abundance-weighted query
+  is the sum of the `p_query` column and estimates the total fraction
+  of metagenome reads that will map across all of the matching references.
+* The percent of recovered matches when _ignoring_ abundances is likewise
+  the sum of the (unweighted) `p_query` column and is not particularly
+  informative - it will always be low for real metagenomes, because sourmash
+  cannot match erroneous k-mers created by sequencing errors.
+* The `overlap` column is the estimated size of the overlap between the
+  (unweighted) original query metagenome and the match. It does not take
+  into account previous matches.
+
+Last but not least, something interesting is going on here with strains.
+While it is not highlighted in the text output of gather, there is
+substantial overlap between the two *Shewanella baltica* genomes. And,
+in fact, both of them are entirely (well, 99%) present in the metagenome
+if measured individually with `sourmash search --containment`.
+
+Consider a few more details:
+
+* `p_match` for the first *Shewanella* match, `NC_011663.1`, is 99%!
+* `p_match` for the second *Shewanella* match, `NC_009665.1`, is only 50%!
+* and, confusingly, the `overlap` for both matches is 5.2 Mbp!
+
+What's up?!
+
+What's happening here is that `sourmash gather` is subtracting the match
+to the first *Shewanella* genome from the metagenome before moving on to
+the next result, and `p_match` reports only the amount of the match
+detected in the _remaining_ metagenome after that subtraction.
+However, `overlap` is reported as the amount of overlap with the
+_original_ metagenome, which is essentially the entire genome in all
+three cases.
+
+The main things to keep in mind for gather are this:
+
+* `p_query` and `p_match` do not double-count k-mers or matches; in particular, you can sum across `p_query` for a metagenome without
+  counting anything more than once.
+* `overlap` _does_ count matches redundantly.
+* the percent of recovered matches is a useful summary of the whole
+  metagenome!
+
+We know it's confusing but it's the best output we've been able to
+figure out across all of the different use cases for gather.  Perhaps
+in the future we'll find a better way to represent all of these
+numbers in a more clear, concise, and interpretable way; in the
+meantime, we welcome your questions and comments!
diff --git a/doc/command-line.md b/doc/command-line.md
@@ -358,11 +358,40 @@ overlap     p_query p_match
 0.9 Mbp       7.4%   11.8%      BA000019.2 Nostoc sp. PCC 7120 DNA, c...
 0.7 Mbp       5.9%   23.0%      FOVK01000036.1 Proteiniclasticum rumi...
 0.7 Mbp       5.3%   17.6%      AE017285.1 Desulfovibrio vulgaris sub...
-```
+...
+found less than 50.0 kbp in common. => exiting
+
+found 64 matches total;
+the recovered matches hit 94.0% of the abundance-weighted query.
+the recovered matches hit 45.6% of the query k-mers (unweighted).
+```
+
+For each match,
+* 'overlap', the first column, is the estimated number of k-mers shared between the match and the query.
+* 'p_query' is the _percentage_ of the query that overlaps with the match; it is the amount of the metagenome "explained" by this match.
+* 'p_match' is the percentage of the _match_ that overlaps with the query; it is the "detection" of the match in the metagenome.
+
+Quite a bit more information per match row is available in the CSV
+output saved with `-o`; for details, see
+[Classifying signatures: how sourmash gather works](classifying-signatures.md#appendix-a-how-sourmash-gather-works).
+
+The "recovered matches" lines detail how much of the query is
+explained by the entire collection of matches. You will get two numbers if
+your metagenome sketch has been calculated with `-p abund`, and only
+one if it does not have abundances. The abundance-weighted
+number should approximate the fraction of metagenome reads that will
+map to at least one reference genome, while the unweighted number
+describes how much of the metagenome itself matches to genomes.
+Here's another way to put it: if the metagenome could be perfectly
+assembled into contigs, the unweighted number would approximate the
+number of bases from the contigs that would match perfectly to at
+least one genome in the reference database.  More practically,
+the abundance-weighted number is less sensitive to sequencing errors.
+See [classifying signatures](classifying-signatures.md#abundance-weighting) or [the FAQ](faq.md) for more information!
 
 The command line option `--threshold-bp` sets the threshold below
 which matches are no longer reported; by default, this is set to
-50kb. see the Appendix in
+50kb. See the Appendix in
 [Classifying Signatures](classifying-signatures.md) for details.
 
 As of sourmash 4.2.0, `gather` supports `--picklist`, to
@@ -2030,6 +2059,13 @@ All of these save formats can be loaded by sourmash commands.
 **We strongly suggest using .zip files to store signatures: they are fast,
 small, and fully supported by all the sourmash commands.**
 
+Note that when outputting large collections of signatures, some save
+formats require holding all the sketches in memory until they can be
+written out, and others can save progressively. This can affect memory
+usage! Currently `.sig` and `.sig.gz` formats are held in memory,
+while `.zip`, directory outputs, and `.sqldb` formats write progressively
+to disk.
+
 For more detailed information on database formats and performance
 tradeoffs, please see [the advanced usage information for
 databases!](databases-advanced.md)

diff --git a/doc/conf.py b/doc/conf.py
@@ -59,8 +59,8 @@
 
 # General information about the project.
 project = 'sourmash'
-copyright = '2016-2020, C. Titus Brown and Luiz Irber'
-author = 'C. Titus Brown and Luiz Irber'
+copyright = '2016-2023, C. Titus Brown, Luiz Irber, and N. Tessa Pierce-Ward'
+author = 'C. Titus Brown, Luiz Irber, and N. Tessa Pierce-Ward'
 
 # The version info for the project you're documenting, acts as replacement for
 # |version| and |release|, also used in various other places throughout the