Skip to content

Commit

Permalink
Merge branch 'update_smash' of github.com:sourmash-bio/sourmash_plugi…
Browse files Browse the repository at this point in the history
…n_branchwater into try-skipmers
  • Loading branch information
ctb committed Dec 21, 2024
2 parents 19a9537 + bab2d89 commit a8dba08
Show file tree
Hide file tree
Showing 16 changed files with 446 additions and 311 deletions.
32 changes: 16 additions & 16 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,18 @@ name = "sourmash_plugin_branchwater"
crate-type = ["cdylib"]

[dependencies]
pyo3 = { version = "0.23.2", features = ["extension-module", "anyhow"] }
pyo3 = { version = "0.23.3", features = ["extension-module", "anyhow"] }
rayon = "1.10.0"
serde = { version = "1.0.216", features = ["derive"] }
#sourmash = { version = "0.17.2", features = ["branchwater"] }
sourmash = {git = "https://github.com/sourmash-bio/sourmash/", branch = "latest", features = ["branchwater"]}
sourmash = { git = "https://github.com/sourmash-bio/sourmash.git", branch = "latest", features = ["branchwater"] }
serde_json = "1.0.133"
niffler = "2.4.0"
log = "0.4.22"
env_logger = { version = "0.11.5" }
simple-error = "0.3.1"
anyhow = "1.0.94"
zip = { version = "=2.0", default-features = false }
zip = { version = "2.0", default-features = false }
tempfile = "3.14"
needletail = "0.5.1"
csv = "1.3.1"
Expand Down
41 changes: 28 additions & 13 deletions doc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,11 @@ be processed differently. The plugin commands are also a bit less
user friendly, because (for now) we're more focused on speed than
polish and user experience.

**Note:** As of v0.9.5, the outputs of `fastgather` and `fastmultigather` almost completely match the output of `sourmash gather`; see below for details.

## Input file formats

sourmash supports a variety of different storage formats for sketches (see [sourmash docs](https://sourmash.readthedocs.io/en/latest/command-line.html#choosing-signature-output-formats)), and the branchwater plugin works with some (but not all) of them. Branchwater _also_ supports an additional database type, a RocksDB-based inverted index, that is not (yet) supported natively by sourmash (through v4.8.11).

**As of v0.9.8, we recommend using zip files or standalone manifest CSVs pointing to zip files whenever you need to provide multiple sketches.**
**We recommend using zip files or standalone manifest CSVs pointing to zip files whenever you need to provide multiple sketches.**

| command | command input | database format |
| -------- | -------- | -------- |
Expand All @@ -58,8 +56,8 @@ When working with large collections of small sketches such as genomes, we sugges
* in particular, _single_ sketches can be loaded on demand, supporting lower memory requirements for certain kinds of searches.

For all these reasons, zip files are the most efficient and effective
basic storage type for sketches in sourmash, and as of the branchwater
plugin v0.9.0, they are fully supported!
basic storage type for sketches in sourmash, and the branchwater
plugin fully supports them!

You can create zipfiles with sourmash like so:
```
Expand Down Expand Up @@ -152,7 +150,7 @@ at the start in order to generate a manifest. To avoid memory issues,
the signatures are not kept in memory, but instead re-loaded as
described below for each command (see: Notes on concurrency and
efficiency). This makes using pathlists less efficient than `zip`
files (as of v0.9.0) or manifests (as of v0.9.8).
files.

## Running the commands

Expand Down Expand Up @@ -304,7 +302,7 @@ version of `sourmash gather`.
sourmash scripts fastgather query.sig.gz database.zip -o results.csv --cores 4
```

As of v0.9.5, `fastgather` outputs the same columns as `sourmash gather`, with only a few exception
`fastgather` outputs the same columns as `sourmash gather`, with only a few exception
* `match_name` is output instead of `name`;
* `match_md5` is output instead of `md5`;
* `match_filename` is output instead of `filename`, and the value is different;
Expand Down Expand Up @@ -392,6 +390,11 @@ To report _any_ overlap between two sketches, set the threshold to 0.
(This will produce many, many results when searching a collection of
metagenomes!)

Using `-A/--output-all-comparisons` will ignore the threshold parameter
and output all comparisons done. Against a RocksDB database, only matches
with some overlap will be reported; with collections of sketches, all
pairs will be reported.

By default, `manysearch` will display the contents of the CSV file in a
human-readable format. This can be disabled with `-N/--no-pretty-print`
when executing large searches.
Expand Down Expand Up @@ -452,14 +455,14 @@ pathlist format, and specify the desired output directory; we suggest
using the `.rocksdb` extension for RocksDB databases, e.g. `-o
gtdb-rs214-k31.rocksdb`.

By default, as of v0.9.7, `index` will store a copy of the sketches
By default, `index` will store a copy of the sketches
along with the inverted index. This will substantially increase the
disk space required for large databases. You can provide an optional
`--no-internal-storage` to `index` to store them externally, which
reduces the disk space needed for the index. Read below for technical
details!

As of v0.9.8, `index` can take any of the supported input types, but
`index` can take any of the supported input types, but
unless you are using a zip file or a pathlist of JSON files, it may
need to load all the sketches into memory before indexing
them. Moreover, you can only use external storage with a zip file. We
Expand All @@ -470,9 +473,6 @@ the sketches are being loaded into memory.

#### Internal vs external storage of sketches in a RocksDB index

(The below applies to v0.9.7 and later of the plugin; for v0.9.6 and
before, only external storage was implemented.)

RocksDB indexes support containment queries (a la the
[branchwater application](https://github.com/sourmash-bio/branchwater)),
as well as `gather`-style mixture decomposition (see
Expand All @@ -489,7 +489,7 @@ the original source sketches used to construct the database, wherever
they reside on your disk.

The sketches *are not used* by `manysearch`, but *are used* by
`fastmultigather`: with v0.9.6 and later, you'll get an error if you
`fastmultigather`: you'll get an error if you
run `fastmultigather` against a RocksDB index where the sketches
cannot be loaded.

Expand Down Expand Up @@ -521,6 +521,21 @@ in downstream software packages (this plugin, and
[the branchwater application code](https://github.com/sourmash-bio/branchwater)).
The above documentation applies to sourmash core v0.15.0.

## Notes on versioning and semantic versioning guarantees

Unlike sourmash,
[which provides guarantees that command-line options and outputs will not change within minor versions](https://sourmash.readthedocs.io/en/latest/support.html#versioning-and-stability-of-features-and-apis),
we make no guarantees of stability within the branchwater plugin. This
is because the branchwater plugin is intended to move fast and
occasionally break things.

Eventually we expect to provide all of the branchwater plugin's functionality within the sourmash package, at which time the sourmash guarantees will apply!

However, we do not expect command line options and output file formats
to change quickly.

We will also endeavor to avoid changing column names in CSV output, although, we may change the _order_ of column names on occasion. Please use the column headers (column names) to select specific columns.

## Notes on concurrency and efficiency

Each command does things somewhat differently, with implications for
Expand Down
14 changes: 11 additions & 3 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ mod singlesketch;
use camino::Utf8PathBuf as PathBuf;

#[pyfunction]
#[pyo3(signature = (querylist_path, siglist_path, threshold, ksize, scaled, moltype, output_path=None, ignore_abundance=false))]
#[pyo3(signature = (querylist_path, siglist_path, threshold, ksize, scaled, moltype, output_path=None, ignore_abundance=false, output_all_comparisons=false))]
#[allow(clippy::too_many_arguments)]
fn do_manysearch(
querylist_path: String,
Expand All @@ -39,13 +39,15 @@ fn do_manysearch(
moltype: String,
output_path: Option<String>,
ignore_abundance: Option<bool>,
output_all_comparisons: Option<bool>,
) -> anyhow::Result<u8> {
let againstfile_path: PathBuf = siglist_path.clone().into();
let selection = build_selection(ksize, scaled, &moltype);
eprintln!("selection scaled: {:?}", selection.scaled());
let allow_failed_sigpaths = true;

let ignore_abundance = ignore_abundance.unwrap_or(false);
let output_all_comparisons = output_all_comparisons.unwrap_or(false);

// if siglist_path is revindex, run rocksdb manysearch; otherwise run manysearch
if is_revindex_database(&againstfile_path) {
Expand All @@ -57,6 +59,7 @@ fn do_manysearch(
threshold,
output_path,
allow_failed_sigpaths,
output_all_comparisons,
) {
Ok(_) => Ok(0),
Err(e) => {
Expand All @@ -73,6 +76,7 @@ fn do_manysearch(
output_path,
allow_failed_sigpaths,
ignore_abundance,
output_all_comparisons,
) {
Ok(_) => Ok(0),
Err(e) => {
Expand Down Expand Up @@ -232,7 +236,7 @@ fn do_check(index: String, quick: bool) -> anyhow::Result<u8> {
}

#[pyfunction]
#[pyo3(signature = (querylist_path, siglist_path, threshold, ksize, scaled, moltype, estimate_ani, estimate_prob_overlap, output_path=None))]
#[pyo3(signature = (querylist_path, siglist_path, threshold, ksize, scaled, moltype, estimate_ani, estimate_prob_overlap, output_all_comparisons, output_path=None))]
#[allow(clippy::too_many_arguments)]
fn do_multisearch(
querylist_path: String,
Expand All @@ -243,6 +247,7 @@ fn do_multisearch(
moltype: String,
estimate_ani: bool,
estimate_prob_overlap: bool,
output_all_comparisons: bool,
output_path: Option<String>,
) -> anyhow::Result<u8> {
let _ = env_logger::try_init();
Expand All @@ -258,6 +263,7 @@ fn do_multisearch(
allow_failed_sigpaths,
estimate_ani,
estimate_prob_overlap,
output_all_comparisons,
output_path,
) {
Ok(_) => Ok(0),
Expand All @@ -270,7 +276,7 @@ fn do_multisearch(

#[pyfunction]
#[allow(clippy::too_many_arguments)]
#[pyo3(signature = (siglist_path, threshold, ksize, scaled, moltype, estimate_ani, write_all, output_path=None))]
#[pyo3(signature = (siglist_path, threshold, ksize, scaled, moltype, estimate_ani, write_all, output_all_comparisons, output_path=None))]
fn do_pairwise(
siglist_path: String,
threshold: f64,
Expand All @@ -279,6 +285,7 @@ fn do_pairwise(
moltype: String,
estimate_ani: bool,
write_all: bool,
output_all_comparisons: bool,
output_path: Option<String>,
) -> anyhow::Result<u8> {
let selection = build_selection(ksize, scaled, &moltype);
Expand All @@ -290,6 +297,7 @@ fn do_pairwise(
allow_failed_sigpaths,
estimate_ani,
write_all,
output_all_comparisons,
output_path,
) {
Ok(_) => Ok(0),
Expand Down
Loading

0 comments on commit a8dba08

Please sign in to comment.