-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: add kmers_and_hashes
method to get canonical kmers + hashes
#40
Conversation
@Adamtaranto I am not attached to this code or implementation. Please feel free to munge at will. |
@ctb what about making the something like: pub fn get_kmer_index(&self) -> usize {
self.kmer_index
} That way we wouldn't have to worry about keeping seq slices and hashes in sync, we could just extract the seq slice from wherever the hasher is up to. |
…dd_kmers_and_hashes
…dd_kmers_and_hashes
kmers_and_hashes
method to get canonical kmers + hashes
Ready for review! The code works and is well tested, I think, but could be improved in a few ways. Still it's at a good resting point so would like to suggest merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
I'm in favour of outsourcing this to sourmash as a variant of SeqToHashes
(SeqToHashesAndKmers
?) So that we aren't doubling the seq copies in memory / looping over the seq twice.
let seq = seq.to_ascii_uppercase(); | ||
let seqb = seq.as_bytes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the full contig/chromosome in memory twice at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it's a view (via the reference, &
, see as_bytes).
If you calculate it, the memory consumption of the underlying string is minimal compared to catastrophic expansion of it into k-mers, even for human-size chromosomes.
let substr_b_rc = revcomp(&seqb[start..start + ksize]); | ||
let substr_rc = | ||
std::str::from_utf8(&substr_b_rc).expect("invalid utf-8 sequence for rev comp"); | ||
let hashval = hasher.next().expect("should not run out of hashes"); | ||
|
||
// Three options: | ||
// * good kmer, all is well, store canonical k-mer and hashval; | ||
// * bad k-mer allowed by skip_bad_kmers, and signaled by | ||
// hashval == 0): return empty string & 0; | ||
// * bad k-mer not allowed, raise error | ||
if let Ok(hashval) = hashval { | ||
if hashval > 0 { | ||
let canonical_kmer = if substr < substr_rc { | ||
substr | ||
} else { | ||
substr_rc | ||
}; | ||
v.push((canonical_kmer.to_string(), hashval)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does sourmash have a canonicalize function we can recycle?
Else pop out as own function to make this cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see anything in SeqToHashes
, but will look again. Since that struct is old, and deals with more than just DNA k-mers, it may not have anything built in (and/or may not make it public).
}; | ||
v.push((canonical_kmer.to_string(), hashval)); | ||
} else { | ||
v.push(("".to_owned(), 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works, but I am uneasy about hoping that hasher output and seq slices stay in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but also, that's why we have tests :). With the way Rust behaves, and with reading the code, it's hard for me to find a situation where this misbehaves.
Medium term => definitely want to expose a better API. If we get the nicer code working here (iterator, in particular) it is easier to transplant to sourmash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and to be clear, this is something I've wanted for sourmash for a while - right now we extract the hashes and kmers in Python, which is unbearably slow! So definitely want this in the sourmash library. Just need it to be pretty general, e.g. including proteins; and that codebase is also more complex to modify and requires approval on the PR, which takes time.)
I will try to add the following:
|
If you are OK with the function signature, I'd suggest merging and then refactoring internally; generally my preference is to get the public API working and tested, and only then care about speed :). But your call! |
Tackles #21.
This PR:
hash_kmer
accessible to Python;kmers_and_hashes
that returns a list of(canonical_kmer, hashval)
tuples without modifying the HashMap;