Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: add skipmers; switch to reading frame approach for translation, skipmers #3395

Merged
merged 22 commits into from
Dec 20, 2024

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Nov 13, 2024

This PR enables skipmers ONLY in the rust code.

  • enables two skipmer types: m1n3, m2n3
  • switches SeqToHashes to use reading frame struct, which simplifies/unifies the code across the different methods. The reading frame code handles any modifications needed - i.e. translation or skipping. Then we just kmerize the reading frame as usual. The main difference for translation is that we no longer need to store a buffer of all hashes from the reading frames.

Since this changes the SeqToHashes strategy a bit, there's one python test where we now see a different error (modified).

Skipmer References:

Copy link

codspeed-hq bot commented Nov 13, 2024

CodSpeed Performance Report

Merging #3395 will not alter performance

Comparing try-skipmers (e077599) with latest (b69c960)

Summary

✅ 21 untouched benchmarks

Copy link

codecov bot commented Nov 13, 2024

Codecov Report

Attention: Patch coverage is 83.82353% with 22 lines in your changes missing coverage. Please review.

Project coverage is 86.35%. Comparing base (b69c960) to head (e077599).
Report is 1 commits behind head on latest.

Files with missing lines Patch % Lines
src/core/src/signature.rs 87.37% 13 Missing ⚠️
src/core/src/ffi/minhash.rs 0.00% 5 Missing ⚠️
src/core/src/encodings.rs 66.66% 2 Missing ⚠️
src/core/src/sketch/minhash.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #3395      +/-   ##
==========================================
- Coverage   86.39%   86.35%   -0.04%     
==========================================
  Files         137      137              
  Lines       16125    16195      +70     
  Branches     2219     2219              
==========================================
+ Hits        13931    13985      +54     
- Misses       1887     1903      +16     
  Partials      307      307              
Flag Coverage Δ
hypothesis-py 25.43% <ø> (ø)
python 92.40% <ø> (ø)
rust 62.44% <83.82%> (+0.32%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mr-eyes
Copy link
Member

mr-eyes commented Nov 19, 2024

Hi Tessa,
Will you be allowing user-defined n, m,k here? And will you decide to construct the Skipmer after accepting the hash value range or you will construct all Skipmers then hash them and either accept or skip?

@bluegenes
Copy link
Contributor Author

bluegenes commented Nov 20, 2024

Hi Tessa, Will you be allowing user-defined n, m,k here?

Hi Mo! At the moment I just got the basics working, and am testing it out over in branchwater. Is there a strong reason for flexible n,m?

And will you decide to construct the skipmer after accepting the hash value range or you will construct all Skipmers then hash them and either accept or skip?

By "hash value range" do you mean the FracMinHash selection process (i.e. max hash)? I'm just using the SeqToHashes approach, so my reading of it is that we construct all, then add if it meets the threshold.

Do you have something more efficient/flexible already implemented? And/or what things do you think would be useful here?

@mr-eyes
Copy link
Member

mr-eyes commented Nov 20, 2024

Hi Mo! At the moment I just got the basics working, and am testing it out over in branchwater. Is there a strong reason for flexible n,m?

Not a strong one. Adding skipmers to sourmash sketching is an excellent addition. So, having the flexibility to change n,m, and k would be good for changing the dispersity/contiguity of the extracted skipmers and, therefore, helping in different applications.

By "hash value range" do you mean the FracMinHash selection process (i.e. max hash)? I'm just using the SeqToHashes approach, so my reading of it is that we construct all, then add if it meets the threshold.

Gotcha! Just expect that small n,m will have a noticeable slowdown in sketching time.

Do you have something more efficient/flexible already implemented? And/or what things do you think would be useful here?

Not really flexible in that context, but you might find the skipmers implementation in kmerDecoder helpful. https://github.com/dib-lab/kmerDecoder/blob/master/src/KD_skipmers.cpp
and this very old example: https://github.com/mr-eyes/OLD_kmerDecoder/blob/d5eb475875ecbe1f3440e1448a56a5ab3b1984fc/python_preview/skipmers.ipynb

@bluegenes
Copy link
Contributor Author

bluegenes commented Nov 21, 2024

Hi Mo! At the moment I just got the basics working, and am testing it out over in branchwater. Is there a strong reason for flexible n,m?

Not a strong one. Adding skipmers to sourmash sketching is an excellent addition. So, having the flexibility to change n,m, and k would be good for changing the dispersity/contiguity of the extracted skipmers and, therefore, helping in different applications.

Got it. I think this shouldn't be too hard if we get a good implementation in, and we could have users specify m= and n= in the param string. I think I'll probably leave this to the future, but I can try to add m,n variables in to make future changes easier.

By "hash value range" do you mean the FracMinHash selection process (i.e. max hash)? I'm just using the SeqToHashes approach, so my reading of it is that we construct all, then add if it meets the threshold.

Gotcha! Just expect that small n,m will have a noticeable slowdown in sketching time.

Good point. I haven't done any thinking about optimization yet.

Do you have something more efficient/flexible already implemented? And/or what things do you think would be useful here?

Not really flexible in that context, but you might find the skipmers implementation in kmerDecoder helpful. https://github.com/dib-lab/kmerDecoder/blob/master/src/KD_skipmers.cpp and this very old example: https://github.com/mr-eyes/OLD_kmerDecoder/blob/d5eb475875ecbe1f3440e1448a56a5ab3b1984fc/python_preview/skipmers.ipynb

After reading your implementation, it seems that the main difference is that you take the entire sequence and skipmerize it (remove the skipped bases), then take k-mers/hashes from that sequence as usual. Is that right? Was that a pretty significant speedup compared with just generating skipmers as you go?

@mr-eyes
Copy link
Member

mr-eyes commented Nov 21, 2024

Got it. I think this shouldn't be too hard if we get a good implementation in, and we could have users specify m= and n= in the param string. I think I'll probably leave this to the future, but I can try to add m,n variables in to make future changes easier.

Parameterizing it should be an ideal solution, yes! Thank you!

After reading your implementation, it seems that the main difference is that you take the entire sequence and skipmerize it (remove the skipped bases), then take k-mers/hashes from that sequence as usual. Is that right? Was that a pretty significant speedup compared with just generating skipmers as you go?

I haven't documented any benchmark here, but I believe I did it that way for performance.

@bluegenes
Copy link
Contributor Author

bluegenes commented Nov 21, 2024

Parameterizing it should be an ideal solution, yes! Thank you!

Hey Mo! Reading through the 2017 skipmer paper again, they note that triplet (n=3) skipmer patterns performed best, namely m=2,n=3 and m=1,n=3. Do you have a good argument for allowing more patterns than that?

I'm using the hashfunctions (moltype) enum to build sketches and ensure that only compatible sketches are comparable down the road. We don't want incompatible skipmer sketches to be compared. I think unless we currently have evidence to show other combos are useful, I may just enable these two and make them two enums, e.g. Murmur64Skipm2n3 and Murmur64Skipm1n3 or similar. I don't think there's any reason we couldn't add more later.

Open to other ideas, though!

@mr-eyes
Copy link
Member

mr-eyes commented Nov 21, 2024

Parameterizing it should be an ideal solution, yes! Thank you!

Hey Mo! Reading through the 2017 skipmer paper again, they note that triplet (n=3) skipmer patterns performed best, namely m=2,n=3 and m=1,n=3. Do you have a good argument for allowing more patterns than that?

I'm using the hashfunctions (moltype) enum to build sketches and ensure that only compatible sketches are comparable down the road. We don't want incompatible skipmer sketches to be compared. I think unless we currently have evidence to show other combos are useful, I may just enable these two and make them two enums, e.g. Murmur64Skipm2n3 and Murmur64Skipm1n3 or similar. I don't think there's any reason we couldn't add more later.

Open to other ideas, though!

I don't really have a use case in mind for different configurations. So this is good enough for the implementation, and as you said, we could add more later if needed.

Make skipmers robust, but keep #3395 functional in the meantime.

This PR:
- enables second skipmer types, so we have m1n3 in addition to m2n3
- switches to a reading frame approach for both translation + skipmers,
which means we first build the reading frame, then kmerize, rather than
building kmers + translating/skipping on the fly
- avoids "extended length" needed for skipping on the fly

Since this changes the `SeqToHashes` strategy a bit, there's one python
test where we now see a different error.

Future thoughts:
- with the new structure, it would be straightforward to add validation
to exclude protein k-mers with invalid amino acids (`X`). I guess I'm
not entirely sure what happens to those atm...
@bluegenes bluegenes changed the title EXP: skipmers MRG: add skipmers; switch to reading frame approach for translation, skipmers Dec 12, 2024
@bluegenes
Copy link
Contributor Author

@ctb @mr-eyes @luizirber ready for review.

@ctb ctb added the rust label Dec 13, 2024
@ctb
Copy link
Contributor

ctb commented Dec 13, 2024

On a first pass, looks good to me! I'm not thrilled with the Murmur64Skipm1n3 abbreviation style and would prefer something longer, but I don't have good suggestions and am not particularly against it, either; so, if you had a longer option in mind that you like, please consider it :)

Now I'm curious how hard it would be to add these to the Python layer 🤔

@ctb
Copy link
Contributor

ctb commented Dec 16, 2024

@luizirber any concerns, at least a hot-take level?

@bluegenes
Copy link
Contributor Author

ref #659 -- I think might need minor modification for this...

@luizirber luizirber self-requested a review December 19, 2024 21:59
Copy link
Member

@luizirber luizirber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Some minor comments, but thanks for all the test coverage increases =]

Comment on lines +20 to +21
- dependency-name: "js-sys"
- dependency-name: "web-sys"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these off-target? I'm guessing it's because they generate complications downstream on the plugins when updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updating these prevented installing with the branchwater plugin :/

src/core/src/signature.rs Outdated Show resolved Hide resolved
@bluegenes bluegenes merged commit 5680efc into latest Dec 20, 2024
41 of 44 checks passed
@bluegenes bluegenes deleted the try-skipmers branch December 20, 2024 17:47
ctb added a commit that referenced this pull request Dec 20, 2024
## [0.18.0] - 2024-12-20

MSRV: 1.66

Changes/additions:

* add skipmer capacity to sourmash python layer via ffi (#3446)
* add skipmers; switch to reading frame approach for translation,
skipmers (#3395)
* Refactor: Use to_writer/from_reader across the codebase (#3443)
* adjust `Signature::name()` to return `Option<String>` instead of
`filename()` and `md5sum()` (#3434)
* propagate zipfile errors (#3431)

Updates:

* Bump proptest from 1.5.0 to 1.6.0 (#3437)
* Bump roaring from 0.10.8 to 0.10.9 (#3438)
* Bump serde from 1.0.215 to 1.0.216 (#3436)
* Bump statrs from 0.17.1 to 0.18.0 (#3426)
* Bump roaring from 0.10.7 to 0.10.8 (#3423)
* Bump needletail from 0.6.0 to 0.6.1 (#3427)
* Bump web-sys from 0.3.72 to 0.3.74 (#3411)
* Bump js-sys from 0.3.72 to 0.3.74 (#3412)
* Bump roaring from 0.10.6 to 0.10.7 (#3413)
* Bump serde_json from 1.0.132 to 1.0.133 (#3402)
* Bump serde from 1.0.214 to 1.0.215 (#3403)
@ctb ctb mentioned this pull request Jan 11, 2025
ctb added a commit that referenced this pull request Jan 11, 2025
Release issue: #3481

----

NOTE: This release adds basic support for skipmers, but they are not
yet fully supported.

Minor new features:

* add genbank plant db to docs (#3429)
* add skipmer capacity to sourmash python layer via ffi (#3446)
* add skipmers; switch to reading frame approach for translation,
skipmers (#3395)
* additional moltype specification needed for `sig downsample` with
skipmers (#3457)
* update with misc animal genomes (#3422)

Cleanup and documentation updates:

* add comment about semver and column headings (#3433)

Developer updates:

* Deps: update to rocksdb 0.23 (#3456)
* Refactor: Use to_writer/from_reader across the codebase (#3443)
* adjust `Signature::name()` to return `Option<String>` instead of
`filename()` and `md5sum()` (#3434)
* bump version to 4.8.13-dev (#3474)
* fix comment in _set_num_scaled (#3451)
* propagate zipfile errors (#3431)
* update rust CHANGELOG in preparation for r0.18.0 (#3450)
* CI: github actions updates (#3476)

Dependabot updates:

* Bump itertools from 0.13.0 to 0.14.0 (#3471)
* Bump needletail from 0.6.0 to 0.6.1 (#3427)
* Bump proptest from 1.5.0 to 1.6.0 (#3437)
* Bump roaring from 0.10.7 to 0.10.8 (#3423)
* Bump roaring from 0.10.8 to 0.10.9 (#3438)
* Bump serde from 1.0.215 to 1.0.216 (#3436)
* Bump serde from 1.0.216 to 1.0.217 (#3464)
* Bump serde_json from 1.0.133 to 1.0.134 (#3453)
* Bump statrs from 0.17.1 to 0.18.0 (#3426)
* Bump tempfile from 3.14.0 to 3.15.0 (#3472)
* Bump thiserror from 2.0.3 to 2.0.6 (#3425)
* Bump thiserror from 2.0.6 to 2.0.7 (#3435)
* Bump thiserror from 2.0.7 to 2.0.8 (#3448)
* Bump thiserror from 2.0.8 to 2.0.9 (#3452)
* Update maturin requirement from <1.8.0,>=1 to >=1,<1.9.0 (#3465)
* [pre-commit.ci] pre-commit autoupdate (#3428)
* [pre-commit.ci] pre-commit autoupdate (#3439)
* [pre-commit.ci] pre-commit autoupdate (#3454)
* [pre-commit.ci] pre-commit autoupdate (#3473)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants