update new -> index

sourmash-bio · Oct 13, 2023 · 535d7eb · 535d7eb
1 parent 1f44043
commit 535d7eb
Show file tree

Hide file tree

Showing 2 changed files with 80 additions and 271 deletions.
diff --git a/doc/index.md b/doc/index.md
@@ -1,191 +1,114 @@
 # Welcome to sourmash!
 
-sourmash is a command-line tool and Python library for computing
-[hash sketches](https://en.wikipedia.org/wiki/MinHash) from DNA
-sequences, comparing them to each other, and plotting the results.
-This allows you to estimate sequence similarity between even very
-large data sets quickly and accurately.
-
-sourmash can be used to quickly search large databases of genomes
-for matches to query genomes and metagenomes; see [our list of
-available databases](databases.md).
-
-sourmash also includes k-mer based taxonomic exploration and
-classification routines for genome and metagenome analysis. These
-routines can use the NCBI and GTDB taxonomies but do not depend on them
-specifically.
-
-We have [several tutorials](tutorials.md) available! Start with
-[Making signatures, comparing, and searching](tutorial-basic.md).
-
-The paper [Large-scale sequence comparisons with sourmash (Pierce et al., 2019)](https://f1000research.com/articles/8-1006)
-gives an overview of how sourmash works and what its major use cases are.
-Please also see the `mash` [software](http://mash.readthedocs.io/en/latest/) and
-[paper (Ondov et al., 2016)](http://dx.doi.org/10.1186/s13059-016-0997-x) for
-background information on how and why MinHash works.
-
-**Questions? Thoughts?** Ask us on the [sourmash issue tracker](https://github.com/sourmash-bio/sourmash/issues/)!
-
-**Want to migrate to sourmash v4?** sourmash v4 is now available, and
-has a number of incompatibilites with v2 and v3. Please see 
-[our migration guide](support.md#migrating-from-sourmash-v3x-to-sourmash-v4x)!
-
-----
-
-To use sourmash, you must be comfortable with the UNIX command line;
-programmers may find the [Python library and API](api.md) useful as well.
-
-If you use sourmash, please cite us!
-
-> Brown and Irber (2016),
-> **sourmash: a library for MinHash sketching of DNA**.
-> Journal of Open Source Software, 1(5), 27, [doi:10.21105/joss.00027](https://joss.theoj.org/papers/3d793c6e7db683bee7c03377a4a7f3c9)
-
-## sourmash in brief
-
-sourmash uses MinHash-style sketching to create "signatures", compressed
-representations of DNA/RNA sequence.  These signatures can then
-be stored, searched, explored, and taxonomically annotated.
-
-* `sourmash` provides command line utilities for creating, comparing,
-  and searching signatures, as well as plotting and clustering
-  signatures by similarity (see [the command-line docs](command-line.md)).
-
-* `sourmash` can **search very large collections of signatures** to find matches
-  to a query.
-
-* `sourmash` can also **identify parts of metagenomes that match known genomes**,
-  and can **taxonomically classify genomes and metagenomes** against databases
-  of known species.
+```{contents} Contents
+:depth: 3
+```
 
-* `sourmash` can be used to **search databases of public sequences**
-  (e.g. all of GenBank) and can also be used to create and search databases
-  of **private sequencing data**.
+sourmash is a command-line tool and Python/Rust library for
+**metagenome analysis** and **genome comparison** with k-mers.  It
+supports the compositional analysis of metagenomes, rapid search of
+large sequence databases, and flexible taxonomic analysis with both
+NCBI and GTDB taxonomies. sourmash works well with sequences 30kb or
+larger, including bacterial and viral genomes.
 
-* `sourmash` supports saving, loading, and communication of signatures
-  via [JSON](http://www.json.org/), a ~human-readable and editable format.
+You might try sourmash if you want to -
 
-* `sourmash` also has a simple Python API for interacting with signatures,
-  including support for online updating and querying of signatures
-  (see [the API docs](api.md)).
+* identify which reference genomes to map your metagenomic reads to
+* search all Genbank microbial genomes with a sequence query
+* cluster many genomes by similarity
+* taxonomically classify genomes or metagenomes against NCBI and/or GTDB;
+* search thousands of metagenomes with a query genome or sequence
 
-* `sourmash` relies on an underlying Rust core for performance.
+Our **vision**: sourmash strives to support biologists in analyzing
+modern sequencing data at high resolution and with full context,
+including all public reference genomes and metagenomes.
 
-* `sourmash` is developed [on GitHub](https://github.com/sourmash-bio/sourmash)
-  and is **freely and openly available** under the BSD 3-clause license.
-  Please see [the README](https://github.com/sourmash-bio/sourmash/blob/latest/README.md)
-  for more information on development, support, and contributing.
+## How does sourmash work?
 
-You can take a look at sourmash analyses on real data
-[in a saved Jupyter notebook](https://github.com/sourmash-bio/sourmash/blob/latest/doc/sourmash-examples.ipynb),
-and experiment with it yourself
-[interactively in a Jupyter Notebook](https://mybinder.org/v2/gh/sourmash-bio/sourmash/latest?labpath=doc%2Fsourmash-examples.ipynb)
-at [mybinder.org](http://mybinder.org).
+Underneath, sourmash uses [FracMinHash sketches](https://www.biorxiv.org/content/10.1101/2022.01.11.475838) for fast and
+lightweight sequence comparison; FracMinHash builds on
+[MinHash sketching](https://en.wikipedia.org/wiki/MinHash) to support both Jaccard similarity
+_and_ containment analyses with k-mers.  This significantly expands
+the range of operations that can be done quickly and in low
+memory. sourmash also implements a number of new and powerful analysis
+techniques, including minimum metagenome covers and alignment-free ANI
+estimation.
 
-## Installing sourmash
+sourmash is inspired by [mash](https://mash.readthedocs.io), and
+supports most mash analyses. sourmash also implements an expanded set
+of functionality for metagenome and taxonomic analysis.
 
-You can use pip:
-```bash
-$ pip install sourmash
-```
+sourmash development was initiated with a grant from the Moore
+Foundation under the Data Driven Discovery program, and has been
+supported by further funding from the NIH and NSF. Please see
+[funding acknowledgements](funding.md) for details!
 
-or conda:
-```bash
-$ conda install -c conda-forge -c bioconda sourmash
-```
+## Mission statement
 
-Please see [the README file in github.com/sourmash-bio/sourmash](https://github.com/sourmash-bio/sourmash/blob/latest/README.md)
-for more information.
+The project mission is to provide practical tools and approaches for
+analyzing extremely large sequencing data sets, with an emphasis on
+high resolution results. We design around the following principles:
 
-## Memory and speed
+* genomic and metagenomic analyses should be able to make use of all
+  available reference genomes.
+* metagenomic analyses should support assembly independent approaches,
+  to avoid biases stemming from low coverage or high strain
+  variability.
+* private and public databases should be equally well supported.
+* a variety of data structures and algorithms are necessary to support
+  a wide set of use cases, including efficient command-line analysis,
+  real-time queries, and massive-scale batch analyses.
+* our tools should be well behaved members of the bioinformatics
+  analysis tool ecosystem, and use common installation approaches,
+  standard formats, and semantic versioning.
+* our tools should be robustly tested, well documented, and supported.
+* we discuss scientific and computational tradeoffs and make specific
+  recommendations where possible, admitting uncertainty as needed.
 
-sourmash has relatively small disk and memory requirements compared to
-many other software programs used for genome search and taxonomic
-classification.
+## Using sourmash
 
-`sourmash search` and `sourmash gather` can be used to search 100k
-genbank microbial genomes ([using our prepared databases](databases.md))
-with about 20 GB of disk and in under 1 GB of RAM.
-Typically a search for a single genome takes about 30 seconds on a laptop.
+### Tutorials and examples
 
-`sourmash lca` can be used to search/classify against all genbank
-microbial genomes with about 200 MB of disk space and about 10 GB of
-RAM. Typically a metagenome classification takes about 1 minute on a
-laptop.
+These tutorials are command line tutorials that should work on Mac OS
+X and Linux. They require about 5 GB of disk space and 5 GB of RAM.
 
-## sourmash versioning
+* [The first sourmash tutorial - making signatures, comparing, and searching](tutorial-basic.md)
 
-We support the use of sourmash in pipelines and applications
-by communicating clearly about bug fixes, feature additions, and feature
-changes. We use version numbers as follows:
+* [Using sourmash LCA to do taxonomic classification](tutorials-lca.md)
 
-* Major releases, like v4.0.0, may break backwards compatibility at
-  the command line as well as top-level Python/Rust APIs.
-* Minor releases, like v4.1.0, will remain backwards compatible but
-  may introduce significant new features.
-* Patch releases, like v4.1.1, are for minor bug fixes; full backwards
-  compatibility is retained.
+* [Analyzing the genomic and taxonomic composition of an environmental genome using GTDB and sample-specific MAGs with sourmash](tutorial-lemonade.md)
 
-If you are relying on sourmash in a pipeline or application, we
-suggest specifying your version requirements at the major release,
-e.g. in conda you would specify `sourmash>=3,<4`.
+* [Some sourmash command line examples!](sourmash-examples.ipynb)
 
-See [the Versioning docs](support.md) for more information on what our
-versioning policy means in detail, and how to migrate between major
-versions!
+### How-To Guides
 
-## Limitations
+* Installing sourmash
 
-**sourmash cannot find matches across large evolutionary distances.**
+* [Classifying genome sketches](classifying-signatures.md)
 
-sourmash seems to work well to search and compare data sets for
-nucleotide matches at the species and genus level, but does not have much
-sensitivity beyond that.  (It seems to be particularly good at
-strain-level analysis.)  You should use protein-based analyses
-to do searches across larger evolutionary distances.
+* [Working with private collections of genome sketches.](sourmash-collections.ipynb)
 
-**sourmash signatures can be very large.**
+* [Using the `LCA_Database` API.](using-LCA-database-API.ipynb)
 
-We use a modification of the MinHash sketch approach that allows us
-to search the contents of metagenomes and large genomes with no loss
-of sensitivity, but there is a tradeoff: there is no guaranteed limit
-to signature size when using 'scaled' signatures.
+* [Building plots from `sourmash compare` output](plotting-compare.ipynb).
 
-## Logo
+* [A short guide to using sourmash output with R](other-languages.md).
 
-The sourmash logo was designed by Stéfanie Fares Sabbag,
-with feedback from Clara Barcelos,
-Taylor Reiter and Luiz Irber.
+### How sourmash works under the hood
 
-<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img
-alt="Creative Commons License" style="border-width:0"
-src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />
+* [An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.ipynb)
+* [Support, versioning, and migration between versions](support.md)
 
-The logo
-is licensed under a <a rel="license"
-href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons
-Attribution-ShareAlike 4.0 International License</a>.
+### Reference material
 
-## Contents:
+* [UNIX command-line documentation](command-line.md)
+* [Genbank and GTDB databases and taxonomy files](databases.md)
+* [Python examples using the API](api-example.md)
+* [Publications about sourmash](publications.md)
+* [A guide to the internals of sourmash](sourmash-internals.md)
+* [Funding acknowledgements](funding.md)
 
-```{toctree}
----
-maxdepth: 2
----
-
-command-line
-tutorials
-using-sourmash-a-guide
-classifying-signatures
-databases
-api
-more-info
-support
-developer
-```
+## Developing and extending sourmash
 
-# Indices and tables
+* [Releasing a new version of sourmash](release.md)
 
-* {ref}`genindex`
-* {ref}`modindex`
-* {ref}`search`
diff --git a/doc/new.md b/doc/new.md