Skip to content

Commit

Permalink
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
Browse files Browse the repository at this point in the history
…o update_doc_add_dev_and_install
  • Loading branch information
ctb committed Oct 16, 2023
2 parents 16bf8b5 + 009e462 commit 499e172
Show file tree
Hide file tree
Showing 3 changed files with 66 additions and 29 deletions.
83 changes: 60 additions & 23 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
```

From the command line, sourmash can be used to create
[MinHash sketches][0] from DNA and protein sequences, compare them to
[FracMinHash sketches][0] from DNA and protein sequences, compare them to
each other, and plot the results; these sketches are saved into
"signature files". These signatures allow you to estimate sequence
similarity and containment quickly and accurately in large
Expand Down Expand Up @@ -45,7 +45,10 @@ Next, compare all the signatures to each other:
sourmash compare *.sig -o cmp.dist
```

Finally, plot a dendrogram: ``` sourmash plot cmp.dist --labels ```
Finally, plot a dendrogram:
```
sourmash plot cmp.dist --labels
```
This will output three files, `cmp.dist.dendro.png`,
`cmp.dist.matrix.png`, and `cmp.dist.hist.png`, containing a
clustering & dendrogram of the sequences, a similarity matrix and
Expand Down Expand Up @@ -562,15 +565,16 @@ As with all reference-based analysis, results can be affected by the
and redundancy of reference databases.

For more details on how `gather` works and can be used to classify
signatures, see [classifying-signatures](classifying-signatures.md).
signatures, see <project:classifying-signatures.md>

### `sourmash tax metagenome` - summarize metagenome content from `gather` results

`sourmash tax metagenome` summarizes gather results for each query metagenome by
taxonomic lineage.
taxonomic lineage.

example command to summarize a single `gather csv`, where the query was gathered
against `gtdb-rs202` representative species database:
Here is an example command to summarize a single `gather csv`, where
the query was gathered against `gtdb-rs202` representative species
database:

```
sourmash tax metagenome
Expand All @@ -589,10 +593,10 @@ The possible output formats are:
#### `csv_summary` output format

`csv_summary` is the default output format. This outputs a `csv` with lineage
summarization for each taxonomic rank. This output currently consists of six
columns, `query_name,rank,fraction,lineage,query_md5,query_filename`, where
`fraction` is the fraction of the query matched to the reported rank and
lineage.
summarization for each taxonomic rank. This output currently consists of six
columns, `query_name,rank,fraction,lineage,query_md5,query_filename`, where
`fraction` is the fraction of the query matched to the reported rank and
lineage.

example `csv_summary` output from the command above:

Expand All @@ -611,7 +615,7 @@ o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri
HSMA33MX,species,0.016,d__Bacteria;p__Bacteroidota;c__Bacteroidia;
o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus
```
> The `query_md5` and `query_filename` columns are omitted here for brevity.
The `query_md5` and `query_filename` columns are omitted here for brevity.

#### `krona` output format

Expand Down Expand Up @@ -1005,14 +1009,14 @@ commands.

All `sourmash tax` commands must be given one or more taxonomy files as
parameters to the `--taxonomy` argument. These files can be either CSV
files or (as of sourmash 4.2.1) sqlite3 databases. sqlite3 databases
files or (as of sourmash 4.2.1) SQLite databases. SQLite databases
are much faster for large taxonomies, while CSV files are easier to view
and modify using spreadsheet software.

`sourmash tax prepare` is a utility function that can ingest and validate
multiple CSV files or sqlite3 databases, and output a CSV file or a sqlite3
multiple CSV files or SQLite databases, and output a CSV file or a SQLite
database. It can be used to combine multiple taxonomies into a single file,
as well as change formats between CSV and sqlite3.
as well as change formats between CSV and SQLite.

The following command will take in two taxonomy files and combine them into
a single taxonomy SQLite database.
Expand Down Expand Up @@ -1077,7 +1081,7 @@ Note: `tax grep` only searches taxonomic ranks, not identifier strings.
Use `sig grep` to search for identifiers in sketch collections.

Currently only CSV output (optionally gzipped) is supported; use `sourmash tax prepare` to
convert CSV output from `tax grep` into a sqlite3 taxonomy database.
convert CSV output from `tax grep` into a SQLite taxonomy database.

### `sourmash tax summarize` - print summary information for lineage spreadsheets or taxonomy databases

Expand Down Expand Up @@ -1188,7 +1192,7 @@ a status of `disagree` with the genus-level assignment of *Shewanella*;
species level assignments would not be reported.
Here, the assigned rank is the rank immediately *above* where there is
a taxonomic disagreement, and the taxid & lineage refer to the name at
that rank (the least-common-ancestor at which an assignment can be
that rank (the lowest common ancestor at which an assignment can be
made).

For another example, if you saw this line in the CSV file:
Expand Down Expand Up @@ -1364,8 +1368,8 @@ agreement/disagreement. Please see the blog post
[Why are taxonomic assignments so different for Tara bins?](http://ivory.idyll.org/blog/2017-taxonomic-disagreements-in-tara-mags.html)
for an example use case.

[0]:https://en.wikipedia.org/wiki/MinHash
[1]:http://mash.readthedocs.io/en/latest/__
[0]:https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2
[1]:http://mash.readthedocs.io/en/latest/
[2]:http://biorxiv.org/content/early/2015/10/26/029827
[3]:https://en.wikipedia.org/wiki/Jaccard_index

Expand Down Expand Up @@ -1403,6 +1407,12 @@ sourmash signature cat file1.sig file2.sig -o all.zip
will combine all signatures in `file1.sig` and `file2.sig` and put them
in the file `all.zip`.

#### Using picklists with `sourmash sig cat`

As of sourmash 4.2.0, `cat` also supports picklists, a feature by
which you can select signatures based on values in a CSV file. See
[Using picklists to subset large collections of signatures](#using-picklists-to-subset-large-collections-of-signatures), below.

### `sourmash signature describe` - display detailed information about signatures

Display signature details.
Expand Down Expand Up @@ -1767,9 +1777,36 @@ and/or containment might be very close to zero.

For example,
```
sourmash signature overlap file1.sig file2.sig
sourmash signature overlap tests/test-data/63.fa.sig \
tests/test-data/47.fa.sig
```
will display the detailed comparison of the two files like so:
```text
loaded one signature each from tests/test-data/63.fa.sig and tests/test-data/47.fa.sig
first signature:
signature filename: tests/test-data/63.fa.sig
signature: NC_011663.1 Shewanella baltica OS223, complete genome
md5: 38729c6374925585db28916b82a6f513
k=31 molecule=DNA num=0 scaled=1000
second signature:
signature filename: tests/test-data/47.fa.sig
signature: NC_009665.1 Shewanella baltica OS185, complete genome
md5: 09a08691ce52952152f0e866a59f6261
k=31 molecule=DNA num=0 scaled=1000
similarity: 0.32069
first contained in second: 0.48282
second contained in first: 0.48851
number of hashes in first: 5238
number of hashes in second: 5177
number of hashes in common: 2529
only in first: 2709
only in second: 2648
total (union): 7886
```
will display the detailed comparison of `file1.sig` and `file2.sig`.

`sig overlap` can only work with compatible sketches - if there are multiple
k-mer sizes or molecule types present in any of the signature files,
Expand Down Expand Up @@ -1970,7 +2007,7 @@ CSV file (based on the headers in the first line of the CSV file), and
`:include` or `:exclude`, can be added as a fourth parameter; if
omitted, the default is `:include`.

The following `coltype`s are currently supported by `sourmash sig extract`:
The following `coltype`s are currently supported for picklists:

* `name` - exact match to signature's name
* `md5` - exact match to signature's md5sum
Expand Down Expand Up @@ -2032,8 +2069,8 @@ slow, especially for many (100s or 1000s) of signatures.

All of the `sourmash` commands support loading collections of
signatures from zip files. You can create a compressed collection of
signatures using `zip -r collection.zip *.sig` and then specify
`collections.zip` on the command line.
signatures using `sourmash sig cat *.sig -o collections.zip` and then
specifying `collections.zip` on the command line in place of `*.sig`.

### Choosing signature output formats

Expand Down
10 changes: 5 additions & 5 deletions doc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ viral genomes.

You might try sourmash if you want to -

* identify which reference genomes to map your metagenomic reads to
* search all Genbank microbial genomes with a sequence query
* cluster many genomes by similarity
* taxonomically classify genomes or metagenomes against NCBI and/or GTDB
* search thousands of metagenomes with a query genome or sequence
* identify which reference genomes to use for metagenomic read mapping;
* search all Genbank microbial genomes with a sequence query;
* cluster hundreds or thousands of genomes by similarity;
* taxonomically classify genomes or metagenomes against NCBI and/or GTDB;
* search thousands of metagenomes with a query genome or sequence;

Our **vision**: sourmash strives to support biologists in analyzing
modern sequencing data at high resolution and with full context,
Expand Down
2 changes: 1 addition & 1 deletion doc/publications.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Pierce et al., 2019. This is the original sourmash use case paper.

[Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0),
Portik et al., 2022. This paper shows that sourmash is extremely
sensitive and very specific for taxonomic classification.
sensitive and very specific for taxonomic profiling.

## Petabase-scale search

Expand Down

0 comments on commit 499e172

Please sign in to comment.