From de2b5a4d79c4f7e2bc12a65f0be0fbb070ad21ad Mon Sep 17 00:00:00 2001
From: "C. Titus Brown" <titus@idyll.org>
Date: Fri, 29 Sep 2023 18:03:31 -0700
Subject: [PATCH] detailed gather output

---
 doc/classifying-signatures.md | 137 ++++++++++++++++++++++++++++++++++
 1 file changed, 137 insertions(+)

diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md
index d30c36b3fa..089793c731 100644
--- a/doc/classifying-signatures.md
+++ b/doc/classifying-signatures.md
@@ -396,3 +396,140 @@ overlap     p_query p_match avg_abund
 The cause of the poor approximation for genome s10 is unclear; it
 could be due to low coverage, or small genome size, or something
 else. More investigation needed.
+
+## Appendix C: sourmash gather output examples
+
+Below we show two real gather analyses done with a mock metagenome,
+SRR606249 (from
+[Shakya et al., 2014](https://pubmed.ncbi.nlm.nih.gov/23387867/)) and
+three of the known genomes contained within it - two *Shewanella baltica*
+strains and one *Akkermansia muciniphila* genome
+
+### sourmash gather with a query containing abundance information
+
+```
+% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig
+
+== This is sourmash version 4.8.5.dev0. ==
+== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
+
+selecting specified query k=31
+loaded query: SRR606249... (k=31, DNA)
+--
+loaded 9 total signatures from 3 locations.
+after selecting signatures compatible with search, 3 remain.
+
+Starting prefetch sweep across databases.
+Prefetch found 3 signatures with overlap >= 50.0 kbp.
+Doing gather to generate minimum metagenome cover.
+
+overlap     p_query p_match avg_abund
+---------   ------- ------- ---------
+5.2 Mbp        0.8%   99.0%      11.7    NC_011663.1 Shewanella baltica OS223...
+2.7 Mbp        0.9%  100.0%      24.5    CP001071.1 Akkermansia muciniphila A...
+5.2 Mbp        0.3%   51.0%       8.1    NC_009665.1 Shewanella baltica OS185...
+found less than 50.0 kbp in common. => exiting
+
+found 3 matches total;
+the recovered matches hit 2.0% of the abundance-weighted query.
+the recovered matches hit 2.5% of the query k-mers (unweighted).
+```
+
+### sourmash gather with the same query, *ignoring* abundances
+
+```
+% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig --ignore-abundance
+
+== This is sourmash version 4.8.5.dev0. ==
+== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
+
+selecting specified query k=31
+loaded query: SRR606249... (k=31, DNA)
+--
+loaded 9 total signatures from 3 locations.
+after selecting signatures compatible with search, 3 remain.
+
+Starting prefetch sweep across databases.
+Prefetch found 3 signatures with overlap >= 50.0 kbp.
+Doing gather to generate minimum metagenome cover.
+
+overlap     p_query p_match
+---------   ------- -------
+5.2 Mbp        1.2%   99.0%    NC_011663.1 Shewanella baltica OS223, complete...
+2.7 Mbp        0.6%  100.0%    CP001071.1 Akkermansia muciniphila ATCC BAA-83...
+5.2 Mbp        0.6%   51.0%    NC_009665.1 Shewanella baltica OS185, complete...
+found less than 50.0 kbp in common. => exiting
+
+found 3 matches total;
+the recovered matches hit 2.5% of the query k-mers (unweighted).
+```
+
+### Notes and comparisons
+
+There are a few interesting things to point out about the above output:
+
+* `p_match` is the same whether or not abundance information is used.
+  This is because it is the fraction of the matching genome detected in
+  the metagenome, which is inherently "flat". It is also reported
+  progressively: only the portions of the metagenome that have not
+  matched to any previous matches are used in `p_match`; read on for
+  details :).
+* `p_query` is different when abundance information is used. For
+  queries with abundance information, `p_query` provides a weighted
+  estimate that approximates the number of metagenome reads that would
+  map to this genome _after_ mapping reads to all previously reported
+  matches, i.e. all matches above this match.
+* When abundance information is not available or
+  not used, `p_query` is an estimate of what fraction of the remaining k-mers
+  in the metagenome match to this genome, after all previous matches
+  have been removed.
+* The `avg_abund` column only shows up when abundance information is
+  supplied. This is the k-mer coverage of the detected portion of the
+  match; it is a lower bound on the expected mapping-based coverage
+  for metagenome reads mapped to the detected portion of the match.
+* The percent of recovered matches for the abundance-weighted query
+  is the sum of the `p_query` column and estimates the total fraction
+  of metagenome reads that will map across all of the matching references.
+* The percent of recovered matches when _ignoring_ abundances is likewise
+  the sum of the (unweighted) `p_query` column and is not particularly
+  informative - it will always be low for real metagenomes, because sourmash
+  cannot match erroneous k-mers created by sequencing errors.
+* The `overlap` column is the estimated size of the overlap between the
+  (unweighted) original query metagenome and the match. It does not take
+  into account previous matches.
+
+Last but not least, something interesting is going on here with strains.
+While it is not highlighted in the text output of gather, there is
+substantial overlap between the two *Shewanella baltica* genomes. And,
+in fact, both of them are entirely (well, 99%) present in the metagenome
+if measured individually with `sourmash search --containment`.
+
+Consider a few more details:
+
+* `p_match` for the first *Shewanella* match, `NC_011663.1`, is 99%!
+* `p_match` for the second *Shewanella* match, `NC_009665.1`, is only 50%!
+* and, confusingly, the `overlap` for both matches is 5.2 Mbp!
+
+What's up?!
+
+What's happening here is that `sourmash gather` is subtracting the match
+to the first *Shewanella* genome from the metagenome before moving on to
+the next result, and `p_match` reports only the amount of the match
+detected in the _remaining_ metagenome after that subtraction.
+However, `overlap` is reported as the amount of overlap with the
+_original_ metagenome, which is essentially the entire genome in all
+three cases.
+
+The main things to keep in mind for gather are this:
+
+* `p_query` and `p_match` do not double-count k-mers or matches; in particular, you can sum across `p_query` for a metagenome without
+  counting anything more than once.
+* `overlap` _does_ count matches redundantly.
+* the percent of recovered matches is a useful summary of the whole
+  metagenome!
+
+We know it's confusing but it's the best output we've been able to
+figure out across all of the different use cases for gather.  Perhaps
+in the future we'll find a better way to represent all of these
+numbers in a more clear, concise, and interpretable way; in the
+meantime, we welcome your questions and comments!