MRG: refactor fastgather/fastmultigather CSV output to use mpsc channels #567

ctb · 2025-01-03T17:43:45Z

Note: PR into #568

This PR adds support for a single output file with -o for fastmultigather on a non-RocksDB database. It also refactors fastgather to use the same underlying mpsc mechanism, so all three gather output mechanisms share more code in common. The ultimate goal is to enable a better internal API, ref #569 and #552.

In particular, this means that now -o works the same way on a RocksDB database and on a non-RocksDB database :).

This PR also disables the current functionality of creating individual output files for each query, which simplifies matters greatly, but does break backwards compatibility. See long-ranging discussions over in sourmash-bio/sourmash#2722 and sourmash-bio/sourmash#2328.

Finally, one last breakage: --create-empty-results now only creates empty prefetch results files, and not gather results files.

This PR also:

Uses new sourmash Collection::min_max_scaled method (ref update code to use min_max_scaled #527);
makes sure utils::prefetch(...) only yields non-zero overlaps, preventing infinite loops;
fixes clippy messages about complex return tuples/types;

This PR:

TODO:

adjust output_empty_results - ref MRG: optionally, create empty results files for in-memory fastmultigather #446
update docs
fix remaining tests
write example/demo code to split results out by query

ctb · 2025-01-06T23:37:08Z

Ready for review @bluegenes!

doc/README.md

src/python/sourmash_plugin_branchwater/__init__.py

bluegenes

a couple documentation qualms - otherwise lgtm 🎉

…water into fix_skip_test

…water into refactor_gather_csv

ctb · 2025-01-07T15:10:17Z

Here's the complete updated doc section:

Running `fastmultigather`

fastmultigather takes a collection of query metagenomes and a collection of sketches as a database, and outputs a CSV file containing all the matches.

sourmash scripts fastmultigather queries.manifest.csv database.zip --cores 4 --save-matches -o results.csv

We suggest using standalone manifest CSVs wherever possible, especially if
the queries are large.

The main advantage that fastmultigather has over running
fastgather on multiple queries is that fastmultigather only needs
to load the database once for all queries, unlike with fastgather;
this can be a significant time savings for large databases.

Output files for `fastmultigather`

fastmultigather will output a gather file containing all results in
one file, specified with -o/--output. fastmultigather gather CSVs
provide the same columns as fastgather, above.

In addition, on a database of sketches (but not on RocksDB indexes)
fastmultigather will output a prefetch file containing all
overlapping matches between that query and the database. The prefetch
CSV will be named {signame}.prefetch.csv, where {signame} is the
name of your sourmash signature.

--save-matches is an optional flag that will save the matched hashes
for each query in a separate sourmash signature
{signame}.matches.sig. This can be useful for debugging or for
further analysis.

Warning: At the moment, if two different queries have the same
{signame}, the output files for one query will be overwritten by
the results from the other query. The behavior here is undefined in
practice, because of multithreading: we don't know what queries will
be executed when or files will be written first.

ctb added 6 commits January 3, 2025 13:42

refactor CSV output for fastgather/fastmultigather to use mpsc

04ea44b

cargo fmt

656f870

tests mostly pass

fd3ce53

fix skipmer test

8bc9d33

upd comment

cc17722

Merge branch 'fix_skip_test' into refactor_gather_csv

c6a34f8

ctb changed the base branch from main to fix_skip_test January 4, 2025 12:09

ctb added 9 commits January 4, 2025 08:10

black

88a6466

Merge branch 'fix_skip_test' into refactor_gather_csv

e755b0b

switch to min_max_scaled for rocksdb

ec91bc1

black

42ecb2e

ensure overlap is > 0

3f40c6b

rm print

0e483ce

cleanup

ff40d6b

fix clippy messages about too-complex returns

41f1b07

cargo fmt

2f05442

ctb mentioned this pull request Jan 5, 2025

MRG: refactor the internal Rust API to support container-level functionality #569

Merged

2 tasks

ctb added 3 commits January 5, 2025 13:29

upd overlap

46da554

fix

746ea88

fix tests

ebbd67b

ctb changed the title ~~WIP: refactor fastgather/fastmultigather CSV output to use mpsc channels~~ MRG: refactor fastgather/fastmultigather CSV output to use mpsc channels Jan 6, 2025

ctb added 2 commits January 6, 2025 15:35

update docs

547484a

upd

bf9a5b7

bluegenes reviewed Jan 7, 2025

View reviewed changes

doc/README.md Outdated Show resolved Hide resolved

bluegenes reviewed Jan 7, 2025

View reviewed changes

src/python/sourmash_plugin_branchwater/__init__.py Show resolved Hide resolved

bluegenes approved these changes Jan 7, 2025

View reviewed changes

ctb added 3 commits January 7, 2025 06:22

Merge branch 'main' of github.com:sourmash-bio/sourmash_plugin_branch…

089f6ab

…water into fix_skip_test

break test again

6091cf8

do heinous dev stuff

92d634a

ctb added 4 commits January 7, 2025 06:40

fix fix comment

8fea8a7

Merge branch 'fix_skip_test' into refactor_gather_csv

4734806

upd

ea52473

upd

38326e3

Base automatically changed from fix_skip_test to main January 7, 2025 14:55

ctb added 2 commits January 7, 2025 06:55

Merge branch 'main' of github.com:sourmash-bio/sourmash_plugin_branch…

a1a646d

…water into refactor_gather_csv

do not require -o after all

401562c

ctb merged commit 40880f0 into main Jan 7, 2025
3 checks passed

ctb deleted the refactor_gather_csv branch January 7, 2025 15:12

ctb mentioned this pull request Jan 15, 2025

MRG: bump plugin version to v0.9.13 #584

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRG: refactor fastgather/fastmultigather CSV output to use mpsc channels #567

MRG: refactor fastgather/fastmultigather CSV output to use mpsc channels #567

ctb commented Jan 3, 2025 •

edited

Loading

ctb commented Jan 6, 2025

bluegenes left a comment

ctb commented Jan 7, 2025

MRG: refactor fastgather/fastmultigather CSV output to use mpsc channels #567

MRG: refactor fastgather/fastmultigather CSV output to use mpsc channels #567

Conversation

ctb commented Jan 3, 2025 • edited Loading

ctb commented Jan 6, 2025

bluegenes left a comment

Choose a reason for hiding this comment

ctb commented Jan 7, 2025

Running fastmultigather

Output files for fastmultigather

ctb commented Jan 3, 2025 •

edited

Loading

Running `fastmultigather`

Output files for `fastmultigather`