MRG: add `kmers_and_hashes` method to get canonical kmers + hashes #40

ctb · 2024-09-18T15:11:03Z

Tackles #21.

This PR:

makes hash_kmer accessible to Python;
provides a method kmers_and_hashes that returns a list of (canonical_kmer, hashval) tuples without modifying the HashMap;
adds appropriate tests.

ctb · 2024-09-18T15:46:54Z

@Adamtaranto I am not attached to this code or implementation. Please feel free to munge at will.

src/lib.rs

Adamtaranto · 2024-09-24T09:21:25Z

@ctb what about making the kmer_index attribute of SeqToHashes accessible via SeqToHashes.get_kmer_index ?

something like:

pub fn get_kmer_index(&self) -> usize {
        self.kmer_index
    }

That way we wouldn't have to worry about keeping seq slices and hashes in sync, we could just extract the seq slice from wherever the hasher is up to.

…dd_kmers_and_hashes

ctb · 2024-09-25T13:55:27Z

Ready for review! The code works and is well tested, I think, but could be improved in a few ways. Still it's at a good resting point so would like to suggest merging.

Adamtaranto

Looks good.

I'm in favour of outsourcing this to sourmash as a variant of SeqToHashes (SeqToHashesAndKmers?) So that we aren't doubling the seq copies in memory / looping over the seq twice.

Adamtaranto · 2024-09-26T04:23:52Z

src/lib.rs

+        let seq = seq.to_ascii_uppercase();
+        let seqb = seq.as_bytes();


Is the full contig/chromosome in memory twice at this point?

no, it's a view (via the reference, &, see as_bytes).

If you calculate it, the memory consumption of the underlying string is minimal compared to catastrophic expansion of it into k-mers, even for human-size chromosomes.

Adamtaranto · 2024-09-26T04:32:15Z

src/lib.rs

+            let substr_b_rc = revcomp(&seqb[start..start + ksize]);
+            let substr_rc =
+                std::str::from_utf8(&substr_b_rc).expect("invalid utf-8 sequence for rev comp");
+            let hashval = hasher.next().expect("should not run out of hashes");
+
+            // Three options:
+            // * good kmer, all is well, store canonical k-mer and hashval;
+            // * bad k-mer allowed by skip_bad_kmers, and signaled by
+            //   hashval == 0): return empty string & 0;
+            // * bad k-mer not allowed, raise error
+            if let Ok(hashval) = hashval {
+                if hashval > 0 {
+                    let canonical_kmer = if substr < substr_rc {
+                        substr
+                    } else {
+                        substr_rc
+                    };
+                    v.push((canonical_kmer.to_string(), hashval));


Does sourmash have a canonicalize function we can recycle?

Else pop out as own function to make this cleaner?

I didn't see anything in SeqToHashes, but will look again. Since that struct is old, and deals with more than just DNA k-mers, it may not have anything built in (and/or may not make it public).

Adamtaranto · 2024-09-26T04:35:06Z

src/lib.rs

+                    };
+                    v.push((canonical_kmer.to_string(), hashval));
+                } else {
+                    v.push(("".to_owned(), 0));


It works, but I am uneasy about hoping that hasher output and seq slices stay in sync.

Agreed, but also, that's why we have tests :). With the way Rust behaves, and with reading the code, it's hard for me to find a situation where this misbehaves.

Medium term => definitely want to expose a better API. If we get the nicer code working here (iterator, in particular) it is easier to transplant to sourmash.

(and to be clear, this is something I've wanted for sourmash for a while - right now we extract the hashes and kmers in Python, which is unbearably slow! So definitely want this in the sourmash library. Just need it to be pretty general, e.g. including proteins; and that codebase is also more complex to modify and requires approval on the PR, which takes time.)

Adamtaranto · 2024-09-26T04:42:26Z

I will try to add the following:

Make kmers_and_hashes iterable
Add hash2kmers hashmap
opt to track kmers
update consume to store kmers
update count to store kmers

ctb · 2024-09-26T13:22:49Z

If you are OK with the function signature, I'd suggest merging and then refactoring internally; generally my preference is to get the public API working and tested, and only then care about speed :). But your call!

ctb · 2024-09-26T14:02:42Z

ref sourmash-bio/sourmash#3339

ctb added 2 commits September 18, 2024 08:10

fn for kmers_and_hashes

e779695

finish basic implementation

8770d9e

ctb mentioned this pull request Sep 18, 2024

Store hash to kmer map #21

Closed

Adamtaranto reviewed Sep 19, 2024

View reviewed changes

src/lib.rs Show resolved Hide resolved

Adamtaranto reviewed Sep 19, 2024

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

Adamtaranto reviewed Sep 19, 2024

View reviewed changes

src/lib.rs Show resolved Hide resolved

Adamtaranto and others added 4 commits September 23, 2024 19:47

Merge branch 'main' into add_kmers_and_hashes

39af0c8

cargo fmt

0f2c535

Merge branch 'main' of github.com:dib-lab/oxli into add_kmers_and_hashes

3e3125d

Merge branch 'main' into add_kmers_and_hashes

55d278b

ctb and others added 12 commits September 24, 2024 12:11

basic tests

bf83c04

much improved code

49bb3b6

Merge branch 'main' into add_kmers_and_hashes

dabd3a5

Merge branch 'add_kmers_and_hashes' of github.com:dib-lab/oxli into a…

a8a07be

…dd_kmers_and_hashes

Style fixes by Ruff

95a5965

Merge branch 'main' of github.com:dib-lab/oxli into add_kmers_and_hashes

0833c1f

rename allow_bad_kmers to skip_bad_kmers

12ab55b

expand & test error handling

2472659

Style fixes by Ruff

ec51e8b

add more explicit test for kmers <=> hash vals in kmers_and_hashes

91a1260

Merge branch 'add_kmers_and_hashes' of github.com:dib-lab/oxli into a…

529a3a5

…dd_kmers_and_hashes

cargo fmt

6ad6e99

ctb changed the title ~~WIP: add code to get kmers + hashes~~ MRG: add kmers_and_hashes method to get canonical kmers + hashes Sep 25, 2024

ctb requested a review from Adamtaranto September 25, 2024 13:51

ctb mentioned this pull request Sep 25, 2024

propose updates to sourmash to make kmers_and_hashes code less cumbersome #66

Open

Adamtaranto reviewed Sep 26, 2024

View reviewed changes

Adamtaranto self-requested a review September 26, 2024 13:46

Adamtaranto approved these changes Sep 26, 2024

View reviewed changes

ctb merged commit eab7a5e into main Sep 26, 2024
18 checks passed

ctb deleted the add_kmers_and_hashes branch September 26, 2024 13:51

ctb mentioned this pull request Sep 26, 2024

returning more & richer information from rust SeqToHash struct/iterator sourmash-bio/sourmash#3339

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRG: add `kmers_and_hashes` method to get canonical kmers + hashes #40

MRG: add `kmers_and_hashes` method to get canonical kmers + hashes #40

ctb commented Sep 18, 2024 •

edited

Loading

ctb commented Sep 18, 2024

Adamtaranto commented Sep 24, 2024

ctb commented Sep 25, 2024

Adamtaranto left a comment

Adamtaranto Sep 26, 2024

ctb Sep 26, 2024

Adamtaranto Sep 26, 2024

ctb Sep 26, 2024

Adamtaranto Sep 26, 2024

ctb Sep 26, 2024

ctb Sep 26, 2024

Adamtaranto commented Sep 26, 2024

ctb commented Sep 26, 2024

ctb commented Sep 26, 2024

		let seq = seq.to_ascii_uppercase();
		let seqb = seq.as_bytes();

MRG: add kmers_and_hashes method to get canonical kmers + hashes #40

MRG: add kmers_and_hashes method to get canonical kmers + hashes #40

Conversation

ctb commented Sep 18, 2024 • edited Loading

ctb commented Sep 18, 2024

Adamtaranto commented Sep 24, 2024

ctb commented Sep 25, 2024

Adamtaranto left a comment

Choose a reason for hiding this comment

Adamtaranto Sep 26, 2024

Choose a reason for hiding this comment

ctb Sep 26, 2024

Choose a reason for hiding this comment

Adamtaranto Sep 26, 2024

Choose a reason for hiding this comment

ctb Sep 26, 2024

Choose a reason for hiding this comment

Adamtaranto Sep 26, 2024

Choose a reason for hiding this comment

ctb Sep 26, 2024

Choose a reason for hiding this comment

ctb Sep 26, 2024

Choose a reason for hiding this comment

Adamtaranto commented Sep 26, 2024

ctb commented Sep 26, 2024

ctb commented Sep 26, 2024

MRG: add `kmers_and_hashes` method to get canonical kmers + hashes #40

MRG: add `kmers_and_hashes` method to get canonical kmers + hashes #40

ctb commented Sep 18, 2024 •

edited

Loading