Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store hash to kmer map #21

Closed
Adamtaranto opened this issue Sep 10, 2024 · 9 comments · Fixed by #70
Closed

Store hash to kmer map #21

Adamtaranto opened this issue Sep 10, 2024 · 9 comments · Fixed by #70
Assignees
Labels
enhancement New feature or request

Comments

@Adamtaranto
Copy link
Collaborator

A common use of kmer counters is to dump kmers and their counts to a text file that can be used in other tools. To do this we need to track hash to kmer pairs.

Suggested solution:

Init KmerCountTable objects with a second hashmap that maps hashes to their corresponding canonical kmer sequence.
Add new hash:kmer pairs as kmers are consumed.

To save time and space we could disable this feature by default.

# Enable kmer tracking
new_table = oxli.KmerCountTable(ksize=3, raw=True)

Alternative solutions:

  • Could we just use the canonical kmers as keys?

Issues:

  • If a kmer is added with .count_hash then the kmer sequence is unknown. (maybe just report unpaired hashed as warning when retrieving kmers?)
  • If adding KmerCountTables together we would need to also add hash:kmer pairs to the new object (or just add all new hashes into the output table?)
  • Tables might get large if consuming large noisy datasets with a high kmer length. (Make kmer tracking optional?)

@ctb thoughts?

@Adamtaranto Adamtaranto added the enhancement New feature or request label Sep 10, 2024
@Adamtaranto
Copy link
Collaborator Author

Kmer extraction from strings in the Consume and count functions is handled by sourmash::signature::SeqToHashes.

To avoid duplication in kmer extraction, it might be better to pull that function into this project and modify it to return a (kmer,hash) pair. Then we can optionally store the hash:kmer mapping as a second table in the KmerCountTable class.

@ctb if that sounds sensible, would you mind stripping down the signature module from sourmash so it just has the bits needed for SeqToHashes? I think I can handle updating the output and storing the pairs in Oxli.

@ctb
Copy link
Contributor

ctb commented Sep 13, 2024

What about having a second class that handles this, and wraps the first class invisibly? We could support the same API and provide the same tests for many situations. Then users could choose.

It should (in my experience) be a completely optional and generally discouraged feature, because it will consume a lot of memory and people think they need it far more than they do 😜

I am somewhat against the code duplication involved in extracting code from sourmash, largely because that project is well tested and well maintained and I don't want to duplicate that work ;). Can always revisit in future if needs become divergent (although then I might advocate for adding necessary functionality to sourmash...)

Note that sourmash also supports various translated and protein k-mer hashing, which we aren't making use of yet, but could.

@Adamtaranto
Copy link
Collaborator Author

I like the idea of wrapping SeqToHashes in a second class and tacking the kmer on to the end. I can see how to do that for single kmers (i.e. in hash_kmer()) but not sure how it would work when processing long sequences with consume().

Agree that kmer tracking should be disabled by default.

@ctb
Copy link
Contributor

ctb commented Sep 18, 2024

kmers_and_hashes here, https://github.com/sourmash-bio/sourmash/blob/26b50f3e3566006fd6356a4f8b4d47c5e381aeec/src/sourmash/minhash.py#L393, shows how to build the associated set of k-mers from a sequence.

I'd like to suggest we implement that in Rust in oxli, and then I can backport it into sourmash. I think my rust fu is up to it, finally ;). Thoughts?

@ctb
Copy link
Contributor

ctb commented Sep 18, 2024

rough draft in #40.

Note that because of the DNA reverse-comp canonicalization performed by the hashing, we only have L-K+1 hashes and hence the current code only emits "forward" k-mers, from the top strand. There is probably value in supporting reverse complements too. The code to do so is in there, just commented out.

The least confusing thing to do is to leave it as-is, but we might want to support other use cases.

One option: support revcomp via a bool option, so that L-K+1 tuples of (kmer_rc, hashval) are returned.

Another option: optionally or always return both forward and reverse complement k-mers, with the corresponding canonical hashes.

Another option: support non-canonical k-mer calculation in this method. I suspect that will be confusing to users.

Thoughts?

@Adamtaranto Adamtaranto self-assigned this Sep 19, 2024
@Adamtaranto
Copy link
Collaborator Author

I'd expect output to be (canonical kmer, hash).

Use something like this:

# Import oxli
from oxli import KmerCountTable

# New table
kct = KmerCountTable(ksize=3, raw=True)

# Check hash for  AAA / TTT
kct.hash_kmer("TTT")
#>>>10679328328772601858
kct.hash_kmer("AAA")
#>>>10679328328772601858

# Count kmer
kct.count("TTT")
#>>> 1

# Lookup kmer using hash, return canonical kmer
kct.unhash(10679328328772601858)
#>>> "AAA"

@ctb
Copy link
Contributor

ctb commented Sep 19, 2024

perfect, thanks!

@ctb
Copy link
Contributor

ctb commented Sep 25, 2024

#40 is up for review - it doesn't actually implement the full hash to kmer map, but it provides the necessary code in a method kmers_and_hashes.

I think next steps could/would include:

  • turn the Rust code into an iterator so it can be reused for this purpose;
  • making the code faster by only computing revcomp once;

but, for now, I feel like #40 is a solid step forward and could usefully be merged.

@Adamtaranto
Copy link
Collaborator Author

Cool, I will try make it more iteratory.

Will also see about integrating it into the count and consume functions to populate a hash:kmer map if raw option is set when creating the KmerCountTable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants