diff --git a/doc/api-example.md b/doc/api-example.md index 782ca86c07..0be78a4fc2 100644 --- a/doc/api-example.md +++ b/doc/api-example.md @@ -476,7 +476,7 @@ building the minhashes. If you're interested in building comparison matrices and dendrograms, please see the notebook -[Building plots from `sourmash compare` output](plotting-compare.md). +[Building plots from `sourmash compare` output](plotting-compare.ipynb). ## Saving and loading signature files diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md index 756e5b4728..089793c731 100644 --- a/doc/classifying-signatures.md +++ b/doc/classifying-signatures.md @@ -293,7 +293,8 @@ match that is _unique_ with respect to the original query. It will always decrease as you get more matches. The sum of `f_unique_to_query` across all rows is what is reported in by gather as the fraction of query k-mers hit by the recovered matches -(unweighted) and should never be greater than 1! +(unweighted) and should never be greater than 1! This column should +be used in any analysis that needs to avoid double-counting matches. The second column, `f_match_orig`, is how much of the match is in the _original_ query. For this fictional metagenome, each match is @@ -395,3 +396,140 @@ overlap p_query p_match avg_abund The cause of the poor approximation for genome s10 is unclear; it could be due to low coverage, or small genome size, or something else. More investigation needed. + +## Appendix C: sourmash gather output examples + +Below we show two real gather analyses done with a mock metagenome, +SRR606249 (from +[Shakya et al., 2014](https://pubmed.ncbi.nlm.nih.gov/23387867/)) and +three of the known genomes contained within it - two *Shewanella baltica* +strains and one *Akkermansia muciniphila* genome + +### sourmash gather with a query containing abundance information + +``` +% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig + +== This is sourmash version 4.8.5.dev0. == +== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == + +selecting specified query k=31 +loaded query: SRR606249... (k=31, DNA) +-- +loaded 9 total signatures from 3 locations. +after selecting signatures compatible with search, 3 remain. + +Starting prefetch sweep across databases. +Prefetch found 3 signatures with overlap >= 50.0 kbp. +Doing gather to generate minimum metagenome cover. + +overlap p_query p_match avg_abund +--------- ------- ------- --------- +5.2 Mbp 0.8% 99.0% 11.7 NC_011663.1 Shewanella baltica OS223... +2.7 Mbp 0.9% 100.0% 24.5 CP001071.1 Akkermansia muciniphila A... +5.2 Mbp 0.3% 51.0% 8.1 NC_009665.1 Shewanella baltica OS185... +found less than 50.0 kbp in common. => exiting + +found 3 matches total; +the recovered matches hit 2.0% of the abundance-weighted query. +the recovered matches hit 2.5% of the query k-mers (unweighted). +``` + +### sourmash gather with the same query, *ignoring* abundances + +``` +% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig --ignore-abundance + +== This is sourmash version 4.8.5.dev0. == +== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == + +selecting specified query k=31 +loaded query: SRR606249... (k=31, DNA) +-- +loaded 9 total signatures from 3 locations. +after selecting signatures compatible with search, 3 remain. + +Starting prefetch sweep across databases. +Prefetch found 3 signatures with overlap >= 50.0 kbp. +Doing gather to generate minimum metagenome cover. + +overlap p_query p_match +--------- ------- ------- +5.2 Mbp 1.2% 99.0% NC_011663.1 Shewanella baltica OS223, complete... +2.7 Mbp 0.6% 100.0% CP001071.1 Akkermansia muciniphila ATCC BAA-83... +5.2 Mbp 0.6% 51.0% NC_009665.1 Shewanella baltica OS185, complete... +found less than 50.0 kbp in common. => exiting + +found 3 matches total; +the recovered matches hit 2.5% of the query k-mers (unweighted). +``` + +### Notes and comparisons + +There are a few interesting things to point out about the above output: + +* `p_match` is the same whether or not abundance information is used. + This is because it is the fraction of the matching genome detected in + the metagenome, which is inherently "flat". It is also reported + progressively: only the portions of the metagenome that have not + matched to any previous matches are used in `p_match`; read on for + details :). +* `p_query` is different when abundance information is used. For + queries with abundance information, `p_query` provides a weighted + estimate that approximates the number of metagenome reads that would + map to this genome _after_ mapping reads to all previously reported + matches, i.e. all matches above this match. +* When abundance information is not available or + not used, `p_query` is an estimate of what fraction of the remaining k-mers + in the metagenome match to this genome, after all previous matches + have been removed. +* The `avg_abund` column only shows up when abundance information is + supplied. This is the k-mer coverage of the detected portion of the + match; it is a lower bound on the expected mapping-based coverage + for metagenome reads mapped to the detected portion of the match. +* The percent of recovered matches for the abundance-weighted query + is the sum of the `p_query` column and estimates the total fraction + of metagenome reads that will map across all of the matching references. +* The percent of recovered matches when _ignoring_ abundances is likewise + the sum of the (unweighted) `p_query` column and is not particularly + informative - it will always be low for real metagenomes, because sourmash + cannot match erroneous k-mers created by sequencing errors. +* The `overlap` column is the estimated size of the overlap between the + (unweighted) original query metagenome and the match. It does not take + into account previous matches. + +Last but not least, something interesting is going on here with strains. +While it is not highlighted in the text output of gather, there is +substantial overlap between the two *Shewanella baltica* genomes. And, +in fact, both of them are entirely (well, 99%) present in the metagenome +if measured individually with `sourmash search --containment`. + +Consider a few more details: + +* `p_match` for the first *Shewanella* match, `NC_011663.1`, is 99%! +* `p_match` for the second *Shewanella* match, `NC_009665.1`, is only 50%! +* and, confusingly, the `overlap` for both matches is 5.2 Mbp! + +What's up?! + +What's happening here is that `sourmash gather` is subtracting the match +to the first *Shewanella* genome from the metagenome before moving on to +the next result, and `p_match` reports only the amount of the match +detected in the _remaining_ metagenome after that subtraction. +However, `overlap` is reported as the amount of overlap with the +_original_ metagenome, which is essentially the entire genome in all +three cases. + +The main things to keep in mind for gather are this: + +* `p_query` and `p_match` do not double-count k-mers or matches; in particular, you can sum across `p_query` for a metagenome without + counting anything more than once. +* `overlap` _does_ count matches redundantly. +* the percent of recovered matches is a useful summary of the whole + metagenome! + +We know it's confusing but it's the best output we've been able to +figure out across all of the different use cases for gather. Perhaps +in the future we'll find a better way to represent all of these +numbers in a more clear, concise, and interpretable way; in the +meantime, we welcome your questions and comments! diff --git a/doc/command-line.md b/doc/command-line.md index 8b3b43780d..b93a6c7c10 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -361,11 +361,40 @@ overlap p_query p_match 0.9 Mbp 7.4% 11.8% BA000019.2 Nostoc sp. PCC 7120 DNA, c... 0.7 Mbp 5.9% 23.0% FOVK01000036.1 Proteiniclasticum rumi... 0.7 Mbp 5.3% 17.6% AE017285.1 Desulfovibrio vulgaris sub... -``` +... +found less than 50.0 kbp in common. => exiting + +found 64 matches total; +the recovered matches hit 94.0% of the abundance-weighted query. +the recovered matches hit 45.6% of the query k-mers (unweighted). +``` + +For each match, +* 'overlap', the first column, is the estimated number of k-mers shared between the match and the query. +* 'p_query' is the _percentage_ of the query that overlaps with the match; it is the amount of the metagenome "explained" by this match. +* 'p_match' is the percentage of the _match_ that overlaps with the query; it is the "detection" of the match in the metagenome. + +Quite a bit more information per match row is available in the CSV +output saved with `-o`; for details, see +[Classifying signatures: how sourmash gather works](classifying-signatures.md#appendix-a-how-sourmash-gather-works). + +The "recovered matches" lines detail how much of the query is +explained by the entire collection of matches. You will get two numbers if +your metagenome sketch has been calculated with `-p abund`, and only +one if it does not have abundances. The abundance-weighted +number should approximate the fraction of metagenome reads that will +map to at least one reference genome, while the unweighted number +describes how much of the metagenome itself matches to genomes. +Here's another way to put it: if the metagenome could be perfectly +assembled into contigs, the unweighted number would approximate the +number of bases from the contigs that would match perfectly to at +least one genome in the reference database. More practically, +the abundance-weighted number is less sensitive to sequencing errors. +See [classifying signatures](classifying-signatures.md#abundance-weighting) or [the FAQ](faq.md) for more information! The command line option `--threshold-bp` sets the threshold below which matches are no longer reported; by default, this is set to -50kb. see the Appendix in +50kb. See the Appendix in [Classifying Signatures](classifying-signatures.md) for details. As of sourmash 4.2.0, `gather` supports `--picklist`, to @@ -2033,6 +2062,13 @@ All of these save formats can be loaded by sourmash commands. **We strongly suggest using .zip files to store signatures: they are fast, small, and fully supported by all the sourmash commands.** +Note that when outputting large collections of signatures, some save +formats require holding all the sketches in memory until they can be +written out, and others can save progressively. This can affect memory +usage! Currently `.sig` and `.sig.gz` formats are held in memory, +while `.zip`, directory outputs, and `.sqldb` formats write progressively +to disk. + For more detailed information on database formats and performance tradeoffs, please see [the advanced usage information for databases!](databases-advanced.md) diff --git a/doc/conf.py b/doc/conf.py index 23331efe08..fdd819b93a 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -59,8 +59,8 @@ # General information about the project. project = 'sourmash' -copyright = '2016-2020, C. Titus Brown and Luiz Irber' -author = 'C. Titus Brown and Luiz Irber' +copyright = '2016-2023, C. Titus Brown, Luiz Irber, and N. Tessa Pierce-Ward' +author = 'C. Titus Brown, Luiz Irber, and N. Tessa Pierce-Ward' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the @@ -114,6 +114,9 @@ # If true, `todo` and `todoList` produce output, else they produce nothing. todo_include_todos = False +# CTB: suppress warnings about circularity in ToC. +# see https://github.com/sphinx-doc/sphinx/issues/7410. +suppress_warnings = ['toc.circular'] # -- Options for HTML output ---------------------------------------------- @@ -127,7 +130,8 @@ html_theme_options = { 'logo': 'logo.png', 'logo_name': True, - 'description': 'Quickly search, compare, and analyze genomic and metagenomic data sets' + 'description': 'Quickly search, compare, and analyze genomic and metagenomic data sets', + 'sidebar_collapse': False, } # Add any paths that contain custom themes here, relative to this directory. diff --git a/doc/faq.md b/doc/faq.md new file mode 100644 index 0000000000..88af9dc82c --- /dev/null +++ b/doc/faq.md @@ -0,0 +1,247 @@ +# Frequently Asked Questions + +```{contents} Contents +:depth: 3 +``` + +## How is sourmash different from mash? + +[mash](https://mash.readthedocs.org/) is an awesome piece of software +that inspired sourmash - hence the name we chose for sourmash! But are +they the same? No, they are not! + +mash is based on +[MinHash sketching](https://en.wikipedia.org/wiki/MinHash), a +technique for randomly selecting a very small set of k-mers to +represent a potentially very large sequence. MinHash sketches can be +used to estimate Jaccard similarity (a distance metric that lets you +find closely related sequences) with very high accuracy, under one +condition - that the two sequences be of similar size. + +So, MinHash sketching is great when comparing bacterial genomes, which are +all around 2-10 MB in size. + +But... MinHash sketching doesn't work when comparing _metagenomes_ to +genomes, because metagenomes are usually _much_ larger than genomes. + +So we built sourmash around a different kind of sketch - FracMinHash - +which can estimate Jaccard similarity between sets of very different +sizes. FracMinHash sketches also support overlap and containment +analysis, and are convenient in a variety of other ways - see our +paper on FracMinHash, +[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). + +There are some drawbacks to FracMinHash sketches - read on! + +## What are the drawbacks to FracMinHash and sourmash? + +There are two drawbacks to FracMinHash relative to MinHash. Neither +one is a showstopper, but you should know about them! + +One drawback is that FracMinHash sketches can be a _lot_ bigger than +MinHash sketches - technically speaking, MinHash sketches are bounded +in size (you pick the number of hashes to keep!) while FracMinHash +sketches can get arbitrarily large. In practice this means that when +you're sketching large metagenomes, you end up with large sketches. +How large depends on the _cardinality_ of the metagenome k-mers - if a +metagenome has 1 billion unique k-mers, a FracMinHash sketch with scaled +parameter of 1000 will contain a million hashes. + +The other drawback is that FracMinHash sketches _don't work well_ for +very small sequences. Our default parameter choice for DNA +(scaled=1000) works well for finding 10 kb or larger matches between +sequences - some simple Poisson matching math suggests that about +99.98% of 5kb overlaps will be found with scaled=1000. + +## How can I better understand FracMinHash and sourmash intuitively? + +Please see [the k-mers and minhash tutorial](kmers-and-minhash.ipynb). + +## What papers should I read to better understand the FracMinHash approach used by sourmash? + +I would suggest reading these four papers, in order: + +[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), +Irber et al., 2022. This is the fullest technical description of FracMinHash available. + +[Mash: fast genome and metagenome distance estimation using MinHash](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x), +Ondov et al., 2016. This is the original paper that inspired +sourmash. It discusses sketching with MinHash and does a great job of +showing how well Jaccard estimation works for comparing genomes! A +good contrasting point to take into account is that _MinHash cannot do +overlap or containment estimation_, which nicely motivates the +previous paper and the next two. + +[Mash Screen: high-throughput sequence containment estimation for genome discovery](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1841-x), +Ondov et al., 2019; and +[CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices](https://academic.oup.com/bioinformatics/article/38/Supplement_1/i28/6617499). Both +papers discusses containment and metagenome analysis extensively, and +use an approach that can be usefully contrasted with sourmash. There +is a nice blog post on +[mash screen](https://genomeinformatics.github.io/mash-screen/) that +is worth reading, too! + +If you want a nice chaser, please see this section at the end of the +blog post above: + +>It would be great to see additional methods developed to process +containment scores, reduce the output redundancy, and report accurate +compositional estimates for metagenomes. One easy approach is a +“winner take all” model, like sourmash implements. + +## What k-mer size(s) should I use with sourmash? + +The short answer is: for DNA, use k=31. + +Slightly longer answer: when we look at the k-mer distribution +across all of the bacterial genomes in GTDB, we find that 99% (or +more) of 31-mers are _genome_, _species_, or _genus_ specific. + +If you go lower (say, k=21), then you get a few percent of k-mers +that match above the genus level - family or above. + +If you go higher (k=51), a higher percentage of k-mers are genome-specific. + +For the core sourmash operations - search, gather, and compare - we +believe (with evidence!) that (a) the differences between k=21, k=31, +and k=51 are negligible; and that (b) k=31 works fine for most +day-to-day use of sourmash. + +We also provide [Genbank and GTDB databases](databases.md) for k=21, +k=31, and k=51. + +For some background on k-mer specificity, we recommend this paper: +[MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation](https://journals.asm.org/doi/10.1128/msystems.00020-16), +Koslicki & Falush, 2016. + +## How do k-mer-based analyses compare with read mapping? + +tl;dr very well! But it's a bit one sided: if k-mers match, reads will +map, but not necessarily vice versa. So read mapping rates are almost always +higher than k-mer matching rates. + +### Mapping reads to reference vs k-mer "detection" or containment + +Let's start by looking at a simple example: suppose you have Illumina +shotgun sequencing of a new isolate, and you want to compare it to a +reference genome for a member of the same species. You calculate +k-mer containment of the reference genome in the isolate shotgun +sequence to be 65%, using e.g. `sourmash search --containment +ref.sig.gz isolate.sig.gz`. You then map the reads from the isolate +data to the reference genome. What should you expect to see? + +What you should see is that 65% or more of the reference genome is +covered by at least one read. This is known as the mapping-based +"detection" of the reference genome in the read data set, and k-mer +detection typically *underestimates* mapping-based detection. + +(If you want to know _how many of the reads will map_, you need to use +a different number that is output by `sourmash gather` - although, for +single genomes with only a few repeats, the percentage of reads that +map should match the k-mer detection number you calculate with +containment. Read the section below on metagenome analysis for more +information on read mapping rates!) + +### Why do k-mers underestimate read mapping? + +K-mers underestimate read mapping because k-mers rely on exact matches, +while read mapping tolerates mismatches. So unless you are looking at +entirely error free data with no strain variation, mismatches will +sometimes prevent k-mers from matching. + +For some math: consider a k-mer of size 21. It will only match to a +specific 21 base sequence exactly. If you are matching two isolate genomes +that align completely but have a 95% average nucleotide identity, then +one out of every 20 bases will be different (on average), and, on average, +every 21-mer will be different! (In practice you'd expect about 1/3 of the +k-mers to match if the variation is random, since there will be many 21-base +windows that don't have any mismatches.) However, alignment-based approaches +like read mapping will happily align across single-base mismatches, and so +even at 95% ANI you would expect many reads to align. + +In practice, the numbers will differ, but the intuition remains! + +### How do read mapping rates for metagenomes compare with k-mer statistics? + +Shotgun metagenome sequencing adds several complicating factors when +comparing metagenome reads to a reference database. First, each genome +will generally be present at a different abundance in the metagenome. +Second, metagenomes rarely contain the exact strain genome present in +a reference database; often the genomes in the metagenome differ both +in terms of SNPs _and_ accessory elements from what's in the reference +database. And third, metagenomes often contain complicated mixtures of +strains - even if one strain is dominant. + +The `sourmash gather` method is built for analyzing metagenomes against +reference databases, and it does so by finding the shortest list of +reference genomes that "cover" all of the k-mers in the metagenome. +This list is arranged by how many k-mers from the metagenome are covered +by that entry in the output: the first match is the biggest, and the +second match is the second biggest, and so on. We call this a "minimum +metagenome cover" and it is described in the Irber et al., 2022, paper below. + +When we construct the minimum metagenome cover, it correlates well with +mapping (per Irber et al., 2022), with one caveat: you need to assign +reads to the reference genomes in the rank order output by gather. +This is needed to properly assign reads that map to multiple genomes +to the "best" match - the one that will capture the most reads. + +`sourmash gather` is still a research method but it seems to work +pretty well - in particular, it is both highly sensitive _and_ highly +specific in taxonomic benchmarking. Please ask questions as you have them! + +### Further reading and links on k-mers and mapping: + +* The paper + [Biogeographic Distribution of Five Antarctic Cyanobacteria Using Large-Scale k-mer Searching with sourmash branchwater](https://www.biorxiv.org/content/10.1101/2022.10.27.514113v1), + Lumian et al., 2022, shows that k-mer detection almost always + underestimates mapping, and k-mer abundance analysis is always more + conservative than mapping-based analyses. + +* The paper + [Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash](https://genome.cshlp.org/content/33/7/1061), + Rahman Hera et al., 2023, shows how to translate between average + nucleotide identity (ANI) and k-mer statistics. + +* The paper +[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), +Irber et al., 2022, describes how `sourmash gather` assigns k-mers +from metagenomes to a set of reference genomes, and shows that read +mapping correlates pretty well with k-mer overlap. Note that it is +focused on *systematic decomposition of metagenomes against reference +databases*, so it tackles the question of analyzing an entire +metagenome against all available references, not just a single +matching genome. + +## Can I use sourmash to determine the best reference genome for mapping my reads? + +Yes! (And see the FAQ above, +[How do k-mer analyses compare with read mapping?](#how-do-k-mer-based-analyses-compare-with-read-mapping)) + +If you're interested in picking a single best reference genome (from a +large database) for read mapping, you can do the following: + +* sketch all your reference genomes with `sourmash sketch dna`, and/or download one of our [prepared databases](databases.md). You can use default parameters for `sourmash sketch`. +* sketch your read collection with `sourmash sketch dna`. +* run `sourmash prefetch reads.sig -o matches.csv` +* sort `matches.csv` on the `f_match_query` column and pick the highest value - this is the k-mer detection - and pick the match name from the `match_name` column; +* use that reference genome as your mapping target. + +If you want to map a metagenome to _multiple_ references, consider +using `sourmash gather` and/or +[the genome-grist workflow](https://dib-lab.github.io/genome-grist/). + +(This is also known as "read recruitment.") + +## How do I get the sequences for a particular reference genome from a metagenome, using sourmash? + +If sourmash reports that a particular strain or genome is present in a +metagenome, how do you retrieve the reads using sourmash? + +The short answer is: you have to use a different tool. You can do +read mapping between the metagenome and the relevant reference genome +(which can be automated with +[the genome-grist workflow](https://dib-lab.github.io/genome-grist/); +or, if you are interested in retrieving accessory elements, you can +try out +[spacegraphcats](https://spacegraphcats.github.io/spacegraphcats/02-spacegraphcats-use-cases/). diff --git a/doc/funding.md b/doc/funding.md new file mode 100644 index 0000000000..9a717042b7 --- /dev/null +++ b/doc/funding.md @@ -0,0 +1,9 @@ +# Funding + +Sourmash development was initiated under award GBMF4551 from the Gordon +and Betty Moore Foundation to C. Titus Brown. + +Further development was sponsored by: + +* [NSF DBI-2018911](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2018911&HistoricalAwards=false), BBSRC-NSF/BIO:Collaborative Research: genomeRxiv: a microbial whole-genome database and diagnostic marker design resource for classification, identification, and data sharing. +* [NIH 1R03OD030596-01](https://reporter.nih.gov/search/aktj48mYz0ObXo62p_Tx7Q/project-details/10112077), Large-scale annotation-free disease correlation analysis of the iHMP. diff --git a/doc/index.md b/doc/index.md index 4e7a9927e3..2adbf317e9 100644 --- a/doc/index.md +++ b/doc/index.md @@ -1,191 +1,133 @@ # Welcome to sourmash! -sourmash is a command-line tool and Python library for computing -[hash sketches](https://en.wikipedia.org/wiki/MinHash) from DNA -sequences, comparing them to each other, and plotting the results. -This allows you to estimate sequence similarity between even very -large data sets quickly and accurately. +sourmash is a command-line tool and Python/Rust library for +**metagenome analysis** and **genome comparison** using k-mers. It +supports the compositional analysis of metagenomes, rapid search of +large sequence databases, and flexible taxonomic profiling with both +NCBI and GTDB taxonomies +([see our prepared databases for more information](databases.md)). sourmash +works well with sequences 30kb or larger, including bacterial and +viral genomes. -sourmash can be used to quickly search large databases of genomes -for matches to query genomes and metagenomes; see [our list of -available databases](databases.md). +You might try sourmash if you want to - -sourmash also includes k-mer based taxonomic exploration and -classification routines for genome and metagenome analysis. These -routines can use the NCBI and GTDB taxonomies but do not depend on them -specifically. +* identify which reference genomes to map your metagenomic reads to +* search all Genbank microbial genomes with a sequence query +* cluster many genomes by similarity +* taxonomically classify genomes or metagenomes against NCBI and/or GTDB +* search thousands of metagenomes with a query genome or sequence -We have [several tutorials](tutorials.md) available! Start with -[Making signatures, comparing, and searching](tutorial-basic.md). +Our **vision**: sourmash strives to support biologists in analyzing +modern sequencing data at high resolution and with full context, +including all public reference genomes and metagenomes. -The paper [Large-scale sequence comparisons with sourmash (Pierce et al., 2019)](https://f1000research.com/articles/8-1006) -gives an overview of how sourmash works and what its major use cases are. -Please also see the `mash` [software](http://mash.readthedocs.io/en/latest/) and -[paper (Ondov et al., 2016)](http://dx.doi.org/10.1186/s13059-016-0997-x) for -background information on how and why MinHash works. +## Mission statement -**Questions? Thoughts?** Ask us on the [sourmash issue tracker](https://github.com/sourmash-bio/sourmash/issues/)! +This project's mission is to provide practical tools and approaches for +analyzing extremely large sequencing data sets, with an emphasis on +high resolution results. Our designs follow these guiding principles: -**Want to migrate to sourmash v4?** sourmash v4 is now available, and -has a number of incompatibilites with v2 and v3. Please see -[our migration guide](support.md#migrating-from-sourmash-v3x-to-sourmash-v4x)! +* genomic and metagenomic analyses should be able to make use of all + available reference genomes. +* metagenomic analyses should support assembly independent approaches, + to avoid biases stemming from low coverage or high strain + variability. +* private and public databases should be equally well supported. +* a variety of data structures and algorithms are necessary to support + a wide set of use cases, including efficient command-line analysis, + real-time queries, and massive-scale batch analyses. +* our tools should be well behaved members of the bioinformatics + analysis tool ecosystem, and use common installation approaches, + standard formats, and semantic versioning. +* our tools should be robustly tested, well documented, and supported. +* we discuss scientific and computational tradeoffs and make specific + recommendations where possible, admitting uncertainty as needed. ----- +## How does sourmash work? -To use sourmash, you must be comfortable with the UNIX command line; -programmers may find the [Python library and API](api.md) useful as well. +Underneath, sourmash uses [FracMinHash sketches](https://www.biorxiv.org/content/10.1101/2022.01.11.475838) for fast and +lightweight sequence comparison; FracMinHash builds on +[MinHash sketching](https://en.wikipedia.org/wiki/MinHash) to support both Jaccard similarity +_and_ containment analyses with k-mers. This significantly expands +the range of operations that can be done quickly and in low +memory. sourmash also implements a number of new and powerful techniques +for analysis, including minimum metagenome covers and alignment-free ANI +estimation. -If you use sourmash, please cite us! +sourmash is inspired by [mash](https://mash.readthedocs.io), and +supports most mash analyses. sourmash also implements an expanded set +of functionality for metagenome and taxonomic analysis. -> Brown and Irber (2016), -> **sourmash: a library for MinHash sketching of DNA**. -> Journal of Open Source Software, 1(5), 27, [doi:10.21105/joss.00027](https://joss.theoj.org/papers/3d793c6e7db683bee7c03377a4a7f3c9) +sourmash development was initiated with a grant from the Moore +Foundation under the Data Driven Discovery program, and has been +supported by further funding from the NIH and NSF. Please see +[funding acknowledgements](funding.md) for details! -## sourmash in brief +## Using sourmash -sourmash uses MinHash-style sketching to create "signatures", compressed -representations of DNA/RNA sequence. These signatures can then -be stored, searched, explored, and taxonomically annotated. + -* `sourmash` provides command line utilities for creating, comparing, - and searching signatures, as well as plotting and clustering - signatures by similarity (see [the command-line docs](command-line.md)). +### Tutorials and examples -* `sourmash` can **search very large collections of signatures** to find matches - to a query. +These tutorials are command line tutorials that should work on Mac OS +X and Linux. They require about 5 GB of disk space and 5 GB of RAM. -* `sourmash` can also **identify parts of metagenomes that match known genomes**, - and can **taxonomically classify genomes and metagenomes** against databases - of known species. +* [The first sourmash tutorial - making signatures, comparing, and searching](tutorial-basic.md) -* `sourmash` can be used to **search databases of public sequences** - (e.g. all of GenBank) and can also be used to create and search databases - of **private sequencing data**. +* [Using sourmash LCA to do taxonomic classification](tutorials-lca.md) -* `sourmash` supports saving, loading, and communication of signatures - via [JSON](http://www.json.org/), a ~human-readable and editable format. +* [Analyzing the genomic and taxonomic composition of an environmental genome using GTDB and sample-specific MAGs with sourmash](tutorial-lemonade.md) -* `sourmash` also has a simple Python API for interacting with signatures, - including support for online updating and querying of signatures - (see [the API docs](api.md)). +* [Some sourmash command line examples!](sourmash-examples.ipynb) -* `sourmash` relies on an underlying Rust core for performance. +### How-To Guides -* `sourmash` is developed [on GitHub](https://github.com/sourmash-bio/sourmash) - and is **freely and openly available** under the BSD 3-clause license. - Please see [the README](https://github.com/sourmash-bio/sourmash/blob/latest/README.md) - for more information on development, support, and contributing. +* Installing sourmash -You can take a look at sourmash analyses on real data -[in a saved Jupyter notebook](https://github.com/sourmash-bio/sourmash/blob/latest/doc/sourmash-examples.ipynb), -and experiment with it yourself -[interactively in a Jupyter Notebook](https://mybinder.org/v2/gh/sourmash-bio/sourmash/latest?labpath=doc%2Fsourmash-examples.ipynb) -at [mybinder.org](http://mybinder.org). +* [Classifying genome sketches](classifying-signatures.md) -## Installing sourmash +* [Working with private collections of genome sketches](sourmash-collections.ipynb) -You can use pip: -```bash -$ pip install sourmash -``` - -or conda: -```bash -$ conda install -c conda-forge -c bioconda sourmash -``` - -Please see [the README file in github.com/sourmash-bio/sourmash](https://github.com/sourmash-bio/sourmash/blob/latest/README.md) -for more information. - -## Memory and speed - -sourmash has relatively small disk and memory requirements compared to -many other software programs used for genome search and taxonomic -classification. - -`sourmash search` and `sourmash gather` can be used to search 100k -genbank microbial genomes ([using our prepared databases](databases.md)) -with about 20 GB of disk and in under 1 GB of RAM. -Typically a search for a single genome takes about 30 seconds on a laptop. +* [Using the `LCA_Database` API](using-LCA-database-API.ipynb) -`sourmash lca` can be used to search/classify against all genbank -microbial genomes with about 200 MB of disk space and about 10 GB of -RAM. Typically a metagenome classification takes about 1 minute on a -laptop. +* [Building plots from `sourmash compare` output](plotting-compare.ipynb). -## sourmash versioning +* [A short guide to using sourmash output with R](other-languages.md). -We support the use of sourmash in pipelines and applications -by communicating clearly about bug fixes, feature additions, and feature -changes. We use version numbers as follows: +### How sourmash works under the hood -* Major releases, like v4.0.0, may break backwards compatibility at - the command line as well as top-level Python/Rust APIs. -* Minor releases, like v4.1.0, will remain backwards compatible but - may introduce significant new features. -* Patch releases, like v4.1.1, are for minor bug fixes; full backwards - compatibility is retained. +* [An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.ipynb) +* [Support, versioning, and migration between versions](support.md) -If you are relying on sourmash in a pipeline or application, we -suggest specifying your version requirements at the major release, -e.g. in conda you would specify `sourmash>=3,<4`. +### Reference material -See [the Versioning docs](support.md) for more information on what our -versioning policy means in detail, and how to migrate between major -versions! +* [Full table of contents for all docs](toc.md) +* [UNIX command-line documentation](command-line.md) +* [Genbank and GTDB databases and taxonomy files](databases.md) +* [Python examples using the API](api-example.md) +* [Publications about sourmash](publications.md) +* [A guide to the internal design and structure of sourmash](sourmash-internals.md) +* [Funding acknowledgements](funding.md) -## Limitations +## Developing and extending sourmash -**sourmash cannot find matches across large evolutionary distances.** +* [Releasing a new version of sourmash](release.md) -sourmash seems to work well to search and compare data sets for -nucleotide matches at the species and genus level, but does not have much -sensitivity beyond that. (It seems to be particularly good at -strain-level analysis.) You should use protein-based analyses -to do searches across larger evolutionary distances. + ```{toctree} --- -maxdepth: 2 +maxdepth: 1 +hidden: true --- +sidebar command-line -tutorials -using-sourmash-a-guide -classifying-signatures +api-example databases -api -more-info -support -developer ``` - -# Indices and tables - -* {ref}`genindex` -* {ref}`modindex` -* {ref}`search` diff --git a/doc/publications.md b/doc/publications.md new file mode 100644 index 0000000000..44e8e9c4f9 --- /dev/null +++ b/doc/publications.md @@ -0,0 +1,53 @@ +# Publications + +Below are publications from the sourmash team. + +## sourmash fundamentals + +[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), Irber et al., 2022. This is the core technical paper describing both FracMinHash and `sourmash gather`. + +[Large-scale sequence comparisons with sourmash](https://f1000research.com/articles/8-1006), +Pierce et al., 2019. This is the original sourmash use case paper. + +## Evaluation and benchmarking + +[Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0), +Portik et al., 2022. This paper shows that sourmash is extremely +sensitive and very specific for taxonomic classification. + +## Petabase-scale search + +[Biogeographic Distribution of Five Antarctic Cyanobacteria Using Large-Scale k-mer Searching with sourmash branchwater](https://www.biorxiv.org/content/10.1101/2022.10.27.514113v1), +Lumian et al., 2022. This paper uses sourmash and branchwater to +search ~500,000 public metagenomes for 5 query genomes, validates the +results using mapping, and discusses the biogeography of the query +species. + +[Sourmash Branchwater Enables Lightweight Petabyte-Scale Sequence Search](https://www.biorxiv.org/content/10.1101/2022.11.02.514947v1), +Irber et al., 2022. This paper describes the technical underpinnings +of the first version of sourmash branchwater, for petabase-scale search. + +## Advanced uses of sourmash + +[Single-cell transcriptomics for the 99.9% of species without reference genomes](https://www.biorxiv.org/content/10.1101/2021.07.09.450799v1.abstract), +Botvinnik et al., 2021. This paper uses sourmash (and many other +techniques!) to analyze single cell data from the Chinese horseshoe +bat. + +[Meta-analysis of metagenomes via machine learning and assembly graphs reveals strain switches in Crohn’s disease](https://www.biorxiv.org/content/10.1101/2022.06.30.498290v1.abstract), +Reiter et al., 2022. This paper uses sourmash and +[spacegraphcats](https://spacegraphcats.github.io/spacegraphcats/) to +detect and analyze strain-specific signals in fecal microbiomes from +the iHMP. + +[Protein k-mers enable assembly-free microbial metapangenomics](https://www.biorxiv.org/content/10.1101/2022.06.27.497795v1), +Reiter et al., 2022. This paper develops a technique to use protein +k-mers to analyze metapangenome graphs from metagenomes. + +## Additional works + +Dr. Luiz Irber's PhD thesis, +[Decentralizing Indices for Genomic Data](https://github.com/luizirber/phd/releases), +describes several additional features of the sourmash ecosystem, +including wort, which monitors the SRA for new data sets and sketches +them automatically. diff --git a/doc/sidebar.md b/doc/sidebar.md new file mode 100644 index 0000000000..b0d5e6e22d --- /dev/null +++ b/doc/sidebar.md @@ -0,0 +1,46 @@ +## Tutorials and examples + +These tutorials are command line tutorials that should work on Mac OS +X and Linux. They require about 5 GB of disk space and 5 GB of RAM. + +* [The first sourmash tutorial - making signatures, comparing, and searching](tutorial-basic.md) + +* [Using sourmash LCA to do taxonomic classification](tutorials-lca.md) + +* [Analyzing the genomic and taxonomic composition of an environmental genome using GTDB and sample-specific MAGs with sourmash](tutorial-lemonade.md) + +* [Some sourmash command line examples!](sourmash-examples.ipynb) + +## How-To Guides + +* Installing sourmash + +* [Classifying genome sketches](classifying-signatures.md) + +* [Working with private collections of genome sketches](sourmash-collections.ipynb) + +* [Using the `LCA_Database` API](using-LCA-database-API.ipynb) + +* [Building plots from `sourmash compare` output](plotting-compare.ipynb). + +* [A short guide to using sourmash output with R](other-languages.md). + +## How sourmash works under the hood + +* [An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.ipynb) +* [Support, versioning, and migration between versions](support.md) + +## Reference material + +* [UNIX command-line documentation](command-line.md) +* [Genbank and GTDB databases and taxonomy files](databases.md) +* [Python examples using the API](api-example.md) +* [Publications about sourmash](publications.md) +* [A guide to the internal design and structure of sourmash](sourmash-internals.md) +* [Funding acknowledgements](funding.md) + +## Developing and extending sourmash + +* [Releasing a new version of sourmash](release.md) + +## [Full table of contents for all docs](toc.md) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md new file mode 100644 index 0000000000..a615844290 --- /dev/null +++ b/doc/sourmash-internals.md @@ -0,0 +1,772 @@ +# A guide to the internal design and structure of sourmash + +```{contents} Contents +:depth: 3 +``` + +sourmash was created in 2015, and has been repeatedly reorganized, +refactored, and optimized to support ever larger databases, faster +queries, and new use cases. We've also regularly added new +functionality and features. So sourmash can be pretty complicated +internally, and our user-facing documentation only covers a fraction +of its potential! + +This document is a brain dump intended for expert users and sourmash +developers who want to understand how, why, and when to use various +sourmash features. It is unlikely ever to be comprehensive, so the +information you are interested in may not yet exist in this document, +but we are always happy to add to it - +[just ask in an issue!](https://github.com/sourmash-bio/sourmash/issues) + +## Signatures and sketches + +sourmash operates on sketches. Each sketch is a collection of hashes, +which are in turn built from k-mers by applying a hash function +(currently always murmurhash) and a filtering function. Each sketch +is contained in a signature wrapper that contains some metadata. + +Internally, sketches (class `MinHash`) contain the following information: +* a set of hashes; +* an optional abundance for each hash (when `track_abund` is True); +* a seed; +* a k-mer size; +* a molecule type; +* either a `num` (for MinHash) or a `scaled` value (for FracMinHash); + +Signature objects (class `SourmashSignature`) contain a sketch (property `.minhash`) as well as additional information: +* an optional `name` +* an optional `filename` +* a license (currently must be CC0); +* an `md5sum(...)` method that returns a hash of the sketch. + +For now, we tend to refer to signatures and sketches interchangeably, +because they are almost entirely 1:1 in the code base (but see [sourmash#616](https://github.com/sourmash-bio/sourmash/issues/616)). + +The default signature interchange/serialization format is JSON, optionally +gzipped. This format is written and read by Rust code. + +In general, a lot of effort in sourmash is spent managing collections of +signatures _before_ actually doing comparisons with them; see manifests, +and `Index` objects, below. + +### Making sketches + +Sketches are produced by hashing k-mers with murmurhash and then +keeping either the lowest `num` hashes (for MinHashes sketches) or +keeping all hashes below `2**64 / scaled` (for FracMinHash sketches). +This has the effect of selecting approximately one hash for every +`scaled` k-mers - so, when sketching a set of 100,000 distinct k-mers, +a scaled value of 1,000 would yield approximately 100 hashes to be +retained in the sketch. + +The default MinHash sketches use parameters so that they are +compatible with mash sketches. + +See [utils/compute-dna-mh-another-way.py](https://github.com/sourmash-bio/sourmash/blob/latest/utils/compute-dna-mh-another-way.py) for details on how k-mers are +hashed. + +Note that if hashes are produced some other way (with a different hash +function) or from some source other than DNA, sourmash can still work +with them; only `sourmash sketch` actually cares about DNA sequences, +everything else works with hashes. + +### Compatibility checking + +The point of the signatures and sketches is to enable certain kinds of +rapid comparisons - Jaccard similarity and number of overlapping k-mers, +specifically. However, these comparisons can only be done between +compatible sketches. + +Here, "compatible" means - +* the same MurmurHash seed (default 42); +* the same k-mer size/ksize (see k-mer sizes, below); +* the same molecule type (see molecule types, below); +* the same `num` or `scaled` (although see [this downsampling discussion](api-example.md#downsampling-signatures), and the next two sections); + +sourmash uses selectors (`Index.select(...)`) to select sketches with +compatible ksizes, molecule types, and sketch types. + +### Scaled (FracMinHash) sketches support similarity and containment + +Per our discussion in [Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), FracMinHash sketches can always be compared +by downsampling to the max of the two scaled values. (This is not always +indexed collections of sketches, e.g. SBTs; see [sourmash#1799](https://github.com/sourmash-bio/sourmash/issues/1799).) + +In practice, sourmash does all necessary downsampling dynamically, but +returns the original sketches. This means that (for example) you can +do a low-resolution/high-scaled search quickly by specifying a +high `scaled` value, and then use a higher resolution comparison with +the resulting matches for a more refined and accurate analysis (see +below, [Speeding up `gather` and `search`](#speeding-up-gather-and-search).) + +### Num (MinHash) sketches support Jaccard similarity + +"Regular" MinHash (or "num MinHash") sketches are implemented the same +way as in mash. However, they are less well supported in sourmash, +because they don't offer the same opportunities for metagenome +analysis. (See also [sourmash#1354](https://github.com/sourmash-bio/sourmash/issues/1354).) + +Num MinHash sketches can always be compared by downsampling to a +common `num` value. This may need to be done manually using `sourmash +sig downsample`, however. + +### Conversion between Scaled (FracMinHash) and Num (MinHash) signatures with `downsample` + +As discussed in the previous sections, it is possible to adjust the `scaled` and `num` values to compare two FracMinHash signatures or two Num MinHash signatures. However, it is also possible to covert between the `scaled` and `num` signatures with the `sourmash sig downsample` command. For more details, review the [command line docs for `sig downsample`](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-signature-downsample-decrease-the-size-of-a-signature). + +### Operations you can do safely with FracMinHash sketches + +As described in +[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), +FracMinHash sketches support a wide range of operations that mirror +actions taken on the underlying data set _without_ revisiting the +underlying data. This allows users to build sketches once (requiring +the original data) and then do all sorts of manipulations on the +sketches, and know that the results of the sketch manipulations +represent what would happen if they did the same thing on the original +data. For example, + +* set unions, intersections, and subtractions all perform the same + when done on the sketches as when applied to the underling data. + So, for example, you can sketch two files separately and merge + the sketches (with `sig merge`), and get the same result as if you'd + concatenated the files first and then sketched them. +* if you filter hashes on abundance with `sig filter`, you get the + same result as if you filtered the data set on k-mer abundance and + then sketched it. +* downsampling: you can sketch the original data set at a high resolution + (e.g. scaled=100) and then downsample it later (to e.g. scaled=1000), + and get the same result as if you'd sketched the data set at scaled=1000. + +## K-mer sizes + +There is no explicit restriction on k-mer sizes built into sourmash. + +For highly specific genome and metagenome comparisons, we typically +use k=21, k=31, or k=51. For a longer discussion, see [Assembly Free Analysis with K-mers](https://github.com/mblstamps/stamps2022/blob/main/kmers_and_sourmash/2022-stamps-assembly-free%20class.pdf) from STAMPS 2022 +and a more general overview at [Using sourmash:a practical guide](using-sourmash-a-guide.md). + +## Molecule types - DNA, protein, Dayhoff, and hydrophobic-polar + +sourmash supports four different sequence encodings, which we refer to +as "molecule": DNA (`--dna`), protein (`--protein`), [Dayhoff encoding](https://en.wikipedia.org/wiki/Margaret_Oakley_Dayhoff#Table_of_Dayhoff_encoding_of_amino_acids), +(`--dayhoff`), and [hydrophobic-polar](sourmash-sketch.md#protein-encodings) (`--hp`). + +All FracMinHash sketches have exactly one molecule type, and can only +be compared to the same molecule type (and ksize). + +DNA moltype sketches can be constructed from DNA input sequences using +`sourmash sketch dna`. + +Protein, Dayhoff, and HP moltype sketches can be constructed from +protein input sequences using `sourmash sketch protein`, or from DNA +input sequences using `sourmash sketch translate`; `translate` will +translate in all six reading frames (see also +[orpheum](https://github.com/czbiohub/orpheum) from +[Botvinnik et al., 2021](https://www.biorxiv.org/content/10.1101/2021.07.09.450799v1)). +By default protein sketches will be created; dayhoff sketches can be +created by including `dayhoff` in the param string, e.g. `sourmash +sketch protein -p dayhoff`, and hydrophobic-polar sketches can be +built with `hp` in the param string, e.g. `sourmash sketch protein -p hp`. + +## Manifests + +Manifests are catalogs of sketches: they include most of the information +about a sketch, except for the actual hashes. The idea of manifests is +that you can use them to identify *which* sketches you are interested +in before actually working with them (loading them, for example). + +sourmash makes extensive use of signature manifests to support rapid +selection and lazy loading of signatures based on signature metadata +(name, ksize, moltype, etc.) See +[Blog post: Scaling sourmash to millions of samples](http://ivory.idyll.org/blog/2021-sourmash-scaling-to-millions.html) +for some of the motivation. + +Manifests are an internal format that is not meant to be particularly +human readable, but the CSV format can be loaded into a spreadsheet +program if you're curious :). + +If a collection of signatures is in a zipfile (`.zip`) or SBT zipfile (`.sbt.zip`), manifests must +be named `SOURMASH-MANIFEST.csv`. They can also be stored directly on +disk in CSV/gzipped CSV, or in a sqlite database; see +`sourmash sig manifest`, `sourmash sig check`, and `sourmash sig collect` +for manifest creation, management, and export utilities. + +Where signatures are stored individually in `Index` collections, +e.g. as separate files in a zipfile, manifests may be stored alongside +them; for other subclasses of `Index` such as the inverted indices, +manifests are generated dynamically by the class itself. + +Currently (sourmash 4.x) manifests do not contain information about the +hash seed or sketch license. This will be fixed in the future - see [sourmash#1849](https://github.com/sourmash-bio/sourmash/issues/1849). + +Manifests are very flexible and, especially when stored in a sqlite +database, can be extremely performant for organizing hundreds of +thousands to millions of sketches. Please see `StandaloneManifestIndex` +for a lazy-loading `Index` class that supports such massive-scale +organization. + +## Index implementations + +The `Index` class and its various subclasses (in `sourmash.index`) are +containers that provide an API for organizing, selecting, and +searching (potentially) large numbers of signatures. + +`sourmash sig summarize` is a good way to determine what type of `Index` +class is used to handle a collection. + +Loading and saving of `Index` objects is handled separately from the +class: loading can be done in Python via the +`sourmash.load_file_as_index(...)` method, while creation and/or +updating of `Index` objects is done via +`sourmash.sourmash_args.SaveSignaturesToLocation(...)`. These are the +same APIs used by the command-line functionality. + +There are quite a few different `Index` subclasses and they all have +distinct features. We have a high-level guide to which collection +type to use [here](command-line.md#loading-many-signatures). + +Conceptually, `Index` classes are either organized around storing +individual signatures often with metadata that permits loading, +selecting, and/or searching them more efficiently +(e.g. `ZipFileLinearIndex` and `SBTs`); or they store signatures +as inverted indices (`LCA_Database` and `SqliteIndex`) that permit +certain kinds of fast queries. + +Unless otherwise noted, the `Index` classes below can be loaded +concurrently in "read only" mode - that is, you should build the +collection _once_, and then use it from multiple processes. We +currently do not test for or support concurrent read/write. Note also +that (generally speaking) memory footprints will be additive, so +loading the same `LCA_Database` twice will consume twice the memory. +(If you're interested in concurrency, we suggest using the sqlite +containers - see `SqliteIndex`.) + +### In-memory storage and search. + +The simplest way to handle collections of signatures is to load them +into memory, but it is also the least performant and most memory +intensive mechanism! + +`LinearIndex` and `MultiIndex` both support sketches loaded from +JSON files; both will load the sketches once and then keep them in +memory. `LinearIndex` does not use manifests while `MultiIndex` builds +a manifest as it loads the sketches. + +Note that `MultiIndex` is the class used to load signatures from +pathlists, directory hierarchies, and so on; because it stores +sketches in memory, this can incur a significant memory penalty (see +[sourmash#1899](https://github.com/sourmash-bio/sourmash/issues/1899)). Therefore where possible we suggest building a standalone +manifest (`StandaloneManifestIndex`) to do lazy loading from the disk +instead; you can use `sourmash sig collect` to do this. + +### Zipfile collections + +`ZipFileLinearIndex` stores signature files in a zip file with an +accompanying manifest. This is the most versatile and compressed +option for working with large collections of sketches - it supports +rapid selection and loading of specific sketches from disk, and can +store and search any mixture of sketch types (ksize, molecule type, +scaled values, etc.) + +By default, `ZipFileLinearIndex` stores one signature (equiv. one +sketch) in each member file in the zip archive. Each signature is +stored uncompressed. The accompanying manifest stores the full member +file path in `internal_location`, so that sketches can be retrieved +directly. + +Searching a `ZipFileLinearIndex` is done linearly, as implied by the +name. This is fine for `gather` but if you are doing repeated queries +with `search` you may want to use an SBT or LCA database instead; see +below. + +In the future we expect to parallelize searching `ZipFileLinearIndex` +files in Rust; see [sourmash#1752](https://github.com/sourmash-bio/sourmash/issues/1752). + +`ZipFileLinearIndex` does support zip files without manifests as well +as multiple signatures in a single file; this was originally intended +to support simply zipping entire directory hierarchies into a zipfile. +However, this slows down performance and is not recommended. If you +have an existing zipfile (or really any collection of signatures) and +you want to turn them into a proper `ZipFileLinearIndex`, you can use +`sig cat -o combined.zip` to create a +`ZipFileLinearIndex` file named `combined.zip` that will have a +manifest and signatures broken out into individual files. + +### Sequence Bloom Trees (SBTs) + +Sequence Bloom Trees (SBTs; see +[the Kingsford Lab page for details](http://www.cs.cmu.edu/~ckingsf/software/bloomtree/)) +provide a faster (but more memory intensive) on-disk storage and +search mechanism. In brief, SBTs implement a binary tree organization +of an arbitrary number of signatures; each internal node is a Bloom +filter containing all of the hashes for the nodes below them. This +permits potentially rapid elimination of irrelevant nodes on search. + +SBTs are restricted to storing and searching sketches with the +same/single k-mer size and molecule type, as well as either a single +num value or a single scaled value. + +We suggest using SBTs when you are doing multiple Jaccard search or +containment searches with genomes via `sourmash search`. + +### Lowest common ancestor (LCA) databases + +The `LCA_Database` index class stores signatures in an inverted index, +where a Python dictionary is used to link individual hashes back to +signatures and/or taxonomic lineages. This supports the individualized +hash analyses used in the `lca` submodule. + +LCA databases only support a single ksize, moltype, and scaled. They +can only be used with FracMinHash (scaled) sketches. + +The default `LCA_Database` class is serialized via JSON, and loads +everything into memory when requested. The load time incurs a +significant latency penalty when used from the command line, as well +as having a potentially large memory footprint; this makes it +difficult to use the default `LCA_Database` for very large databases, +e.g. genbank bacteria. + +The newer `LCA_SqliteDatabase` (based on `SqliteIndex`, described +below) also supports LCA-style queries, and is stored on disk, is fast to +load, and uses very little memory. The tradeoff is that the underlying +sqlite database can be quite large. `LCA_SqliteDatabase` should also +support rapid concurrent access (see [sourmash#909](https://github.com/sourmash-bio/sourmash/issues/909)). + +Both types of LCA database can be constructed with `sourmash lca index`. + +### SqliteIndex + +The `SqliteIndex` storage class uses sqlite3 to store hashes and +sketch information for search and retrieval; see +[this blog post](http://ivory.idyll.org/blog/2022-storing-ulong-in-sqlite-sourmash.html) +for background information and details. These are fast, low-memory, +on-disk databases, with the tradeoff that they can be quite large. +This is probably currently the best solution for concurrent access to +sketches via e.g. a Web server (see also [sourmash#909](https://github.com/sourmash-bio/sourmash/issues/909)). + +`SqliteIndex` can only contain FracMinHash sketches and can only store +sketches with the same scaled parameter. However, it can store +multiple ksizes and moltypes as long as the same scaled is used. + +`SqliteIndex` objects can be constructed using `sourmash sig cat +... -o filename.sqldb`. + +### Standalone manifests + +The `StandaloneManifestIndex` class loads standalone manifests generated +by `sourmash sig collect`. They support rapid selection and lazy loading +on potentially extremely large collections of signatures. + +The underlying mechanism uses the `internal_location` field of +manifests to point to the container file. When particular sketches are +requested, the container file is loaded into an `Index` object with +`sourmash.load_file_as_index` and the `md5` values of the requested +sketches are used as a picklist to retrieve the desired signatures. + +Thus, while standalone manifests can point at any kind of container, +including JSON files or LCA databases, they are most efficient when +`internal_location` points at a file with either a single sketch in +it, or a manifest that supports direct loading of sketches. Therefore, +we suggest using standalone manifest indices. + +Note that searching a standalone manifest is currently done through a +linear iteration, and does not use any features of indexed containers +such as SBTs or LCAs. This is fine for `gather` with the default +approach, but is probably suboptimal for a `search`. + +### Pathlists and `--from-file` + +All (or most) sourmash commands natively support taking in lists of +signature collections via pathlists, `--from-file`, or paths to +directories. This is useful for situations where you have thousands of +signature files and don't want to provide them explicitly on the +command line; you can simply put a list of the files in a text file, +and pass it in directly (or use `--from-file` to pass it in). + +Both pathlists and files passed to `--from-file` contain a list of +paths to be loaded; relatives paths will be interpreted relative to +the current working directory of sourmash. Pathlists should be +universally available on sourmash commands. When `--from-file` is +available for a command, sourmash will behave as if the file paths in +the file were provided on the command line. + +We suggest avoiding pathlists. Instead, we suggest using `--from-file` +or a standalone manifest index (generated with `sourmash sig +collect`). This is because the signatures from pathlists are loaded +into memory (see `MultiIndex`, above) it is generally a bad idea to +use them - they may be slow to load and may consume a lot of +memory. They also do not support good loading error messages; +see [sourmash#1414](https://github.com/sourmash-bio/sourmash/issues/1414). + +### Extensions for outputting index classes + +Most commands that support saving signatures will save them in a +variety of formats, based on the extension provided (see +[sourmash#1890](https://github.com/sourmash-bio/sourmash/issues/1890) +for exceptions). The supported extensions are - + +* `.zip` for `ZipFileLinearIndex` +* `.sqldb` for `SqliteIndex` +* `.sig` or `.sig.gz` for JSON/gzipped JSON +* `dirname/` to save in a directory hierarchy + +The default signature save format is JSON, if the extension is not +recognized. + +## Speeding up `gather` and `search` + +There are two primary search commands in sourmash: `gather` and +`search`. + +`gather` calculates a minimum metagenome cover as discussed in [Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). It +is mostly intended for querying a database with a metagenome, although +it can be used with genome queries, as well. This approach relies on overlaps +between genomes and metagenomes and can only be used with FracMinHash sketches. + +`search` does a straight Jaccard similarity search on MinHash and +FracMinHash sketches (or, with `--containment`, a containment search +on FracMinHash sketches). It is typically used to find matches to a +query genome sketch in a large database of sketches. + +The `prefetch` command does a containment search and is intended for +power users; it is a standalone implementation of the prefetch +algorithm discussed below for `gather`. It only works with FracMinHash +sketches. + +Note that all of these commands work with any and all `Index` +collection/container types, and will return the same results however +the collections are organized - see the "online behavior" section, +below. In practice this means that you can provide additional +collections of signatures via the command line without building a +combined index of all your signatures. It also means that the only +reason to choose different collections/containers is for +optimization - you should select the containers that help you achieve +the desired performance characteristics for your search +(i.e. the right memory/time/disk space tradeoffs). + +### Running `search` many times on the same database + +`search` typically is used to search a large database of sketches for +all similarity or containment matches above a threshold. Depending on +the query and the database, certain kinds of database indices may make +search much faster, especially when only a few matches are expected. + +If you are doing many searches against the same database, indexing the +database as an SBT (with `sourmash index`) or as a `SqliteIndex`/sqldb +database is likely to provide a significant speed increase, albeit +with increased memory usage (SBT) or increased disk space (sqldb). + +Conversely, `ZipFileLinearIndex` and the default `LCA_Database` are likely +to be poor choices for many searches - the former only supports linear +searches, and the latter needs to be loaded from disk and deserialized each +time. + +### Running `gather` once + +`gather` is typically used to search a metagenome against a large +database of sketches, as part of finding a minimum set cover. This can +be quite slow! Our current implementation (as of sourmash 4.1.0, [pull request sourmash#1370](https://github.com/sourmash-bio/sourmash/pull/1370)) does a single pass across the database to find all matches +with a Jaccard similarity or containment above the provided threshold, and then organizes +the matches for rapid min-set-cov analysis. This single pass across the +database is called a "prefetch", and it is also implemented in the +`prefetch` subcommand. + +With this single pass approach, benchmarks - [sourmash#2014](https://github.com/sourmash-bio/sourmash/issues/2014) - show that a +linearly searchable database is performant enough to be used with +`gather`. We therefore suggest using a `ZipFileLinearIndex` container +with gather, or in cases where low-memory concurrency is desired, a +`SqliteIndex` container. + +### Using `prefetch` and `gather` together + +If you want to use `prefetch` independently of `gather`, you can use +the prefetch output as a picklist passed into gather - see +[picklists](#picklists), below. This can be useful when you want to +experiment with different threshold parameters for `gather` - first, +do a very sensitive/low-threshold search with `prefetch` and save the +results to a CSV file with `-o`, + +Repeated gathers and searches. CTB + +Using prefetch explicitly. CTB + +### Using a higher scaled value + +With FracMinHash sketches, you can downsample the query to make both +`search` and `gather` _much_ faster. A good rule of thumb is to use a +scaled value that is about 5x smaller than the minimum overlap to +detect; so, if you want to be able to detect 50kb of similarity, you +can use a scaled value of 10,000. Conversely, the default scaled value +of 1,000 (for DNA sketches) should robustly detect overlaps of 5kb. + +You can supply `--scaled` to `gather` and `prefetch` to dynamically +downsample the query FracMinHash. For `search` you will need to use +`sourmash sig downsample` to generate a downsampled sketch. + +### Running `gather` many times - `multigather` + +In situations where loading the search database is slow (e.g. +`LCA_Database` or zipfiles with very large manifests), the `sourmash +multigather` command supports many queries against many databases. + +(We don't particularly suggest using `multigather`; we would prefer +to make search databases faster. But it's there! :) + +### Much faster search and gather with branchwater + +We also have a reasonably stable plugin, +[pyo3_branchwater](https://github.com/sourmash-bio/pyo3_branchwater), +that implements multithreaded operations using Rust. It is 100-1000 +times faster than sourmash, and 5-50 times lower memory. In exchange, +it's not quite as flexible as the full sourmash package. But if you're +running into speed or memory problems, you should give it a try! + +## Taxonomy and assigning lineages + +All sourmash taxonomy handling is done within the `lca` and `tax` +subcommands (CLI) and submodules (Python). + +In the case of the `lca` subcommands, the taxonomic information is +incorporated into the LCA database construction (see the `lca index` +command), while the `tax` subcommands load taxonomic information +on demand from taxonomy databases (CSVs or databases). + +sourmash anchors all taxonomy to identifiers, and uses the signature +name to do so - this is the name as set by the `--name` parameter to +`sourmash sketch`, and output by `sourmash sig describe` as the +`signature:` field. + +### Identifier handling + +sourmash prefers identifiers to be the first space-separated token in +the signature name. This token can contain any alphanumeric letters +other than space, and should contain at most one period. The version +of the identifier will be the component after the period. + +So, for example, for a signature name of + +``` +CP001941.1 Aciduliprofundum boonei T469, complete genome +``` +the identifier would be `CP001941.1` and the version would be 1. +There are no other constraints placed on the identifier, and +versions are not handled in any special way other than as below. + +The `lca index` and `tax` commands both support some modified +identifier handling in sourmash 3.x and 4.x, but in the future, we +plan to deprecate these as they mostly cause confusion and internal +complexity. + +The two modifiers are: + +* `--keep-full-identifiers` will use the entire signature +name instead of just the first space-separated token. It is by default +off (set to False). + +* `--keep-identifier-versions` turns on keeping the full identifier, +including what is after the first period. It is by default off (set to +False), stripping identifiers of their version on load. When it is on (True), identifiers are not stripped of their version on load. + +### Taxonomies, or lineage spreadsheets + +sourmash supports arbitrary (free) taxonomies, and new taxonomic +lineages can be created and used internally as long as they +are provided in the appropriate spreadsheet format. + +You can also mix and match taxonomies as you need; for example, it is +entirely legitimate in sourmash-land to combine the GTDB taxonomy for +bacterial and archaeal sequence classification, with the NCBI taxonomy +for eukaryotic and viral sequence classification. (You probably don't +want to mix and match within superkingdoms, though!) + +As of sourmash v4, lineage spreadsheets should contain columns for +superkingdom, phylum, class, order, family, genus, and species. Some +commands may also support a 'strain' column, although this is +inconsistently handled within sourmash internally at the moment. + +For spreadsheet organization, `lca index` expects the columns to be +present in order from superkingdom on down, while the `tax` +subcommands use CSV column headers instead. We are planning to +consolidate around the `tax` subcommand handling in the future (see [sourmash#2198](https://github.com/sourmash-bio/sourmash/issues/2198)). + +An example spreadsheet is +[here, bacteria_refseq_lineage.csv](https://github.com/sourmash-bio/sourmash/blob/latest/tests/test-data/tax/bacteria_refseq_lineage.csv). (The +`taxid` column is not used by most sourmash functions and is mostly +ignored, but it is needed for the `kreport` and `bioboxes` report +formats.) + +### `LCA_SqliteDatabase` - a special case + +The `LCA_SqliteDatabase` index class can serve multiple purposes: as +an index of sketches (for regular search and gather); as a taxonomy +database for use with the `tax` subcommands; and as an LCA database +for use with the `lca` subcommands. + +When used as a taxonomy database, an `LCA_SqliteDatabase` file +contains the same SQL tables as a sqlite taxonomy database. + +When used as an LCA database, an `LCA_SqliteDatabase` dynamically loads +the taxonomic lineages from the sqlite database and applies them to the +individual hashes, permitting the same kind of hash-to-lineage query +capability as the `LCA_Database`. + +## Picklists + +Picklists are a generic mechanism used to select a (potentially small) +subset of signatures for search/display. + +The general idea of picklists is that you create a list of signatures +you're interested in - by name, or identifier, or md5sum - and then +supply that list in a csvfile on the command line via `--picklist`. +For example, `--picklist list.csv:colname:ident` would load the +values in the column named `colname` in the file `list.csv` as identifiers +to be used to restrict the search. + +The support picklist column types are `name`, `ident` +(space-delimited identifier), `identprefix` (identifier with version +removed), `md5`, `md5prefix8`, and `md5short`. Generally the `md5` +and derived values are used to reference signatures found some other +way with sourmash, while the identifiers are more broadly useful. + +There are also four special column types that can be used without a column +name: `gather`, `prefetch`, `search`, and `manifest`. These take the +CSV output of the respective sourmash commands as inputs for picklists, +so that you can use prefetch to generate a picklist and then use that +picklist with `--picklist prefetch_out.csv.gz::prefetch`. + +### Differing internal behavior + +Picklists behave differently with different `Index` classes. + +For indexed databases like SBT, LCA, and `SqliteIndex`, the search is +done _first_, and then only those results that match the picklist are +selected. + +For linear search databases like `ZipFileLinearIndex` or standalone +manifests, picklists are _first_ used to subselect the desired +signatures, and only those signatures are searched. + +This means that picklists can dramatically speed up searches on some +`Index` types, but won't affect performance on others. But +the results will be the same. + +### Taxonomy / lineage spreadsheets as picklists + +Note that lineage CSV spreadsheets, as consumed by `sourmash tax` commands +and as output by `sourmash tax grep`, can be used as `ident` picklists. + + + +## Online and streaming; and adding to collections of sketches. + +One of the big challenges with Big Data is looking at it all at once - +loading all your data into memory, for example, will fail with really large +data sets. The ability to look at subsets of data without looking at _all_ +of it is called "streaming" (much like when you watch a streaming +movie online - you can start watching the movie without downloading the +whole video, and you can also usually jump to a particular +location in the video without downloading the intervening bits.) + +Another related challenge is analyzing data against a database that is +constantly growing, either because you're adding to it or because it's +being updated by others. For example, in genomics, often you want to +repeat the same analysis you did last time but with more reference +genomes. With many software packages, this requires rebuilding your +indexed database, which can be challenging for large genomes. In +computer science parlance, the ability to add new data at the end +_without_ performing an expensive reindexing operation is referred +to as "online". + +sourmash tackles these challenges in a few different ways, and does its +best to support streaming and online behavior. + +First, all sourmash commands can take multiple databases and will +return the same results with multiple databases as they would with a +single database containing the same sketches, unless otherwise +noted. This allows you to incrementally expand your sketch collections +over time without building new databases. _Performance_ may vary +(i.e. if you're using an SBT to do search, and you add an unindexed +collection of sketches to the search, the search may take longer than +if you'd add the new sketches to the SBT) but the _results_ will be +the same. In this sense, many of the sourmash algorithms are online. + +Second, several sourmash algorithms use _streaming_ when searching +databases - in particular, `prefetch` will load and unload sketches as +it goes, as long as the underlying collection data structure supports +it (`.sig.gz` and LCA JSON databases do _not_, but zip files, SBTs, +and SQLite databases _do_). This lets you do containment searches +against really large collections without consuming large amounts of +memory. Another example is the `manysearch` command in the +[pyo3_branchwater](https://github.com/sourmash-bio/pyo3_branchwater) +plugin, which loads and searches a limited number of metagenomes from +a large collection, rather than loading the entire collection into +memory - which would be impossible. + +Last but not least, one of the interesting guarantees that FracMinHash +sketches provide is that no hash is ever _removed_ when sketching. +This supports various types of input streaming, which we haven't spent +too much time exploring, but (for example) means that "watching" +sequencing runs and/or downloads of sequencing data, and reporting +interim results with certainty, is possible. If you're interested +in making use of this, please reach out! + +### Gather on multiple collections, and order of search and reporting + +Since `sourmash gather` will pick only one "best match" if there +are several (and will ignore the others), the order of searching +can matter for large collections. How does this work? + +In brief, sourmash doesn't guarantee a particular load order for +sketches in a single collection, but it _does_ guarantee that +collections are loaded and searched in their entirety in the order +that you provide them. So, for example, if you have a large zipfile +database of sketches that contains duplicates, you can't predict which +of the duplicates will be chosen as a match; but you _can_ build your +own collection of prioritized matches as a separate database, and put +it first on the command line. A practical application of this might +be to list the GTDB "representatives" database first on the command +line, with the full GTDB database second, in order to prioritize +choosing representative genomes as matches over the rest. + +This also plays a role in the order of reporting for `prefetch` +output - `prefetch` will report matching sketches in the order it +encounters them, which will match the order in which collections are +given to `sourmash prefetch` on the command line. + +## Formats natively understood by sourmash + +sourmash should always autodetect the format of a collection or +database, in most cases based on its content (and not its +filename). Please file a bug report if this doesn't work for you! + +`sourmash sig summarize` is a good way to examine the properties of a +signature collection. + +### Reading and writing gzipped CSV files + +(As of sourmash v4.5) + +When a CSV filename is specified (e.g. `sourmash gather ... -o +mygather.csv`), you can always provide a name that ends with `.gz` to +produce a gzip-compressed file instead. This can save quite a bit of +space for prefetch results and manifests in particular! + +All sourmash commands that take in a CSV (via manifest, or picklist, +or taxonomy) will autodetect a gzipped CSV based on content (the file +does not need to end with `.gz`). The one exception is manifests, +where the CSV needs to end with `.gz` to be loaded as a gzipped CSV; +see +[sourmash#2214](https://github.com/sourmash-bio/sourmash/issues/2214) +for an issue to fix this. diff --git a/doc/toc.md b/doc/toc.md new file mode 100644 index 0000000000..5e41325360 --- /dev/null +++ b/doc/toc.md @@ -0,0 +1,41 @@ +# Full table of contents + +```{toctree} +--- +maxdepth: 2 +--- + +api-example.md +api.md +classifying-signatures.md +command-line.md +databases-advanced.md +databases.md +dev_plugins.md +developer.md +faq.md +funding.md +index.md +kmers-and-minhash.ipynb +legacy-databases.md +more-info.md +other-languages.md +plotting-compare.ipynb +publications.md +release.md +requirements.md +sourmash-collections.ipynb +sourmash-examples.ipynb +sourmash-internals.md +sourmash-sketch.md +storage.md +support.md +tutorial-basic.md +tutorial-lemonade.md +tutorial-lin-taxonomy.md +tutorial-long.md +tutorials-lca.md +tutorials.md +using-LCA-database-API.ipynb +using-sourmash-a-guide.md +``` diff --git a/doc/tutorial-basic.md b/doc/tutorial-basic.md index ebc8cb9ea3..1ea7cc1416 100644 --- a/doc/tutorial-basic.md +++ b/doc/tutorial-basic.md @@ -308,4 +308,4 @@ and `gather`; see `sourmash index`, above, [the LCA tutorial][4], or [2]:https://pubmed.ncbi.nlm.nih.gov/23387867/ [3]:index.md [4]:tutorials-lca.md -[5]:sourmash-collections.md +[5]:sourmash-collections.ipynb