Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: refactor fastgather/fastmultigather CSV output to use mpsc channels #567

Merged
merged 29 commits into from
Jan 7, 2025

Conversation

ctb
Copy link
Collaborator

@ctb ctb commented Jan 3, 2025

Note: PR into #568

This PR adds support for a single output file with -o for fastmultigather on a non-RocksDB database. It also refactors fastgather to use the same underlying mpsc mechanism, so all three gather output mechanisms share more code in common. The ultimate goal is to enable a better internal API, ref #569 and #552.

In particular, this means that now -o works the same way on a RocksDB database and on a non-RocksDB database :).

This PR also disables the current functionality of creating individual output files for each query, which simplifies matters greatly, but does break backwards compatibility. See long-ranging discussions over in sourmash-bio/sourmash#2722 and sourmash-bio/sourmash#2328.

Finally, one last breakage: --create-empty-results now only creates empty prefetch results files, and not gather results files.

This PR also:

  • Uses new sourmash Collection::min_max_scaled method (ref update code to use min_max_scaled #527);
  • makes sure utils::prefetch(...) only yields non-zero overlaps, preventing infinite loops;
  • fixes clippy messages about complex return tuples/types;

This PR:

TODO:

@ctb ctb changed the base branch from main to fix_skip_test January 4, 2025 12:09
@ctb ctb changed the title WIP: refactor fastgather/fastmultigather CSV output to use mpsc channels MRG: refactor fastgather/fastmultigather CSV output to use mpsc channels Jan 6, 2025
@ctb
Copy link
Collaborator Author

ctb commented Jan 6, 2025

Ready for review @bluegenes!

doc/README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple documentation qualms - otherwise lgtm 🎉

Base automatically changed from fix_skip_test to main January 7, 2025 14:55
@ctb
Copy link
Collaborator Author

ctb commented Jan 7, 2025

Here's the complete updated doc section:

Running fastmultigather

fastmultigather takes a collection of query metagenomes and a collection of sketches as a database, and outputs a CSV file containing all the matches.

sourmash scripts fastmultigather queries.manifest.csv database.zip --cores 4 --save-matches -o results.csv

We suggest using standalone manifest CSVs wherever possible, especially if
the queries are large.

The main advantage that fastmultigather has over running
fastgather on multiple queries is that fastmultigather only needs
to load the database once for all queries, unlike with fastgather;
this can be a significant time savings for large databases.

Output files for fastmultigather

fastmultigather will output a gather file containing all results in
one file, specified with -o/--output. fastmultigather gather CSVs
provide the same columns as fastgather, above.

In addition, on a database of sketches (but not on RocksDB indexes)
fastmultigather will output a prefetch file containing all
overlapping matches between that query and the database. The prefetch
CSV will be named {signame}.prefetch.csv, where {signame} is the
name of your sourmash signature.

--save-matches is an optional flag that will save the matched hashes
for each query in a separate sourmash signature
{signame}.matches.sig. This can be useful for debugging or for
further analysis.

Warning: At the moment, if two different queries have the same
{signame}, the output files for one query will be overwritten by
the results from the other query. The behavior here is undefined in
practice, because of multithreading: we don't know what queries will
be executed when or files will be written first.

@ctb ctb merged commit 40880f0 into main Jan 7, 2025
3 checks passed
@ctb ctb deleted the refactor_gather_csv branch January 7, 2025 15:12
@ctb ctb mentioned this pull request Jan 15, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants