Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: add kmers_and_hashes method to get canonical kmers + hashes #40

Merged
merged 18 commits into from
Sep 26, 2024

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Sep 18, 2024

Tackles #21.

This PR:

  • makes hash_kmer accessible to Python;
  • provides a method kmers_and_hashes that returns a list of (canonical_kmer, hashval) tuples without modifying the HashMap;
  • adds appropriate tests.

@ctb ctb mentioned this pull request Sep 18, 2024
@ctb
Copy link
Contributor Author

ctb commented Sep 18, 2024

@Adamtaranto I am not attached to this code or implementation. Please feel free to munge at will.

src/lib.rs Outdated Show resolved Hide resolved
@Adamtaranto
Copy link
Collaborator

@ctb what about making the kmer_index attribute of SeqToHashes accessible via SeqToHashes.get_kmer_index ?

something like:

pub fn get_kmer_index(&self) -> usize {
        self.kmer_index
    }

That way we wouldn't have to worry about keeping seq slices and hashes in sync, we could just extract the seq slice from wherever the hasher is up to.

@ctb ctb changed the title WIP: add code to get kmers + hashes MRG: add kmers_and_hashes method to get canonical kmers + hashes Sep 25, 2024
@ctb ctb requested a review from Adamtaranto September 25, 2024 13:51
@ctb
Copy link
Contributor Author

ctb commented Sep 25, 2024

Ready for review! The code works and is well tested, I think, but could be improved in a few ways. Still it's at a good resting point so would like to suggest merging.

Copy link
Collaborator

@Adamtaranto Adamtaranto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I'm in favour of outsourcing this to sourmash as a variant of SeqToHashes (SeqToHashesAndKmers?) So that we aren't doubling the seq copies in memory / looping over the seq twice.

Comment on lines +514 to +515
let seq = seq.to_ascii_uppercase();
let seqb = seq.as_bytes();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the full contig/chromosome in memory twice at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it's a view (via the reference, &, see as_bytes).

If you calculate it, the memory consumption of the underlying string is minimal compared to catastrophic expansion of it into k-mers, even for human-size chromosomes.

Comment on lines +535 to +552
let substr_b_rc = revcomp(&seqb[start..start + ksize]);
let substr_rc =
std::str::from_utf8(&substr_b_rc).expect("invalid utf-8 sequence for rev comp");
let hashval = hasher.next().expect("should not run out of hashes");

// Three options:
// * good kmer, all is well, store canonical k-mer and hashval;
// * bad k-mer allowed by skip_bad_kmers, and signaled by
// hashval == 0): return empty string & 0;
// * bad k-mer not allowed, raise error
if let Ok(hashval) = hashval {
if hashval > 0 {
let canonical_kmer = if substr < substr_rc {
substr
} else {
substr_rc
};
v.push((canonical_kmer.to_string(), hashval));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sourmash have a canonicalize function we can recycle?

Else pop out as own function to make this cleaner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see anything in SeqToHashes, but will look again. Since that struct is old, and deals with more than just DNA k-mers, it may not have anything built in (and/or may not make it public).

};
v.push((canonical_kmer.to_string(), hashval));
} else {
v.push(("".to_owned(), 0));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works, but I am uneasy about hoping that hasher output and seq slices stay in sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but also, that's why we have tests :). With the way Rust behaves, and with reading the code, it's hard for me to find a situation where this misbehaves.

Medium term => definitely want to expose a better API. If we get the nicer code working here (iterator, in particular) it is easier to transplant to sourmash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and to be clear, this is something I've wanted for sourmash for a while - right now we extract the hashes and kmers in Python, which is unbearably slow! So definitely want this in the sourmash library. Just need it to be pretty general, e.g. including proteins; and that codebase is also more complex to modify and requires approval on the PR, which takes time.)

@Adamtaranto
Copy link
Collaborator

I will try to add the following:

  • Make kmers_and_hashes iterable
  • Add hash2kmers hashmap
  • opt to track kmers
  • update consume to store kmers
  • update count to store kmers

@ctb
Copy link
Contributor Author

ctb commented Sep 26, 2024

If you are OK with the function signature, I'd suggest merging and then refactoring internally; generally my preference is to get the public API working and tested, and only then care about speed :). But your call!

@Adamtaranto Adamtaranto self-requested a review September 26, 2024 13:46
@ctb ctb merged commit eab7a5e into main Sep 26, 2024
18 checks passed
@ctb ctb deleted the add_kmers_and_hashes branch September 26, 2024 13:51
@ctb
Copy link
Contributor Author

ctb commented Sep 26, 2024

ref sourmash-bio/sourmash#3339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants