-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement a kmers()
method
#74
Comments
Hi Joe, A kmer-tracking feature is in the works in PR #70 that will store the canonical k-mers corresponding to hashes. I've got the functionality down but it is in need of optimisation (currently much slower than just counting with hashes) - so khmer may be faster atm. Usage would be something like: from oxli import KmerCountTable
def getKmersThatAppearOnce(seq:str, k:int) -> set[str]:
"""gets all the canonical k-mers that appear exactly once in a sequence
Args:
seq (str): the sequence to get kmers for
k (int): the kmer length
Returns:
set[str]: a set of all canonical k-mers that appear exactly once in the sequence
"""
# Init new count table with k-mer tracking enabled
kct = KmerCountTable(ksize=k, store_kmers=True)
# Consume sequence string
kct.consume(seq)
# drop all records with count > 1
kct.maxcut(1)
# dump_kmers() will return a list of (kmer,count) tuples
return { x for x, y in kct.dump_kmers(file=None)} Would this do the job? |
Oh fantastic! I'll try it out tomorrow. Looking forward to the optimized version. |
Cool, let me know how the speed compares to khmer. We still need to make consume() multithreaded as well. Probably only practical for bacterial genomes atm. |
I hope to get a chance to work on oxli this sunday. In the meantime, I would be surprised if oxli weren't faster than khmer when compiled in release mode (e.g. a wheel). khmer was not particularly optimized IIRC. But we can and should measure, and of course multithread! Memory usage is a different issue... I would expect oxli to be nowhere near as memory efficient as khmer was. For now, at least. |
I tried to use it using the There is no method called Update: I have also tried using the
|
I think you might have left out the First, install oxli from the "integrate_kmer_tracking" feature branch. # Clone this repo
git clone [email protected]:oxli-bio/oxli.git && cd oxli
# Create a dev env
conda env create -f environment.yml -n oxli
# Activate the new env
conda activate oxli
# Fetch list of remote branches
git fetch
# View all available branches
git branch -v -a
# Switch to the branch with kmer tracking features
git switch integrate_kmer_tracking
# Local install with testing dependancies
pip install -e '.[test]' Test out the # Create new table with kmer tracking enabled
kct = KmerCountTable(ksize=4, store_kmers=True)
seq = "AAAAAT"
# Consume sequence string
kct.consume(seq)
# Inspect kmers and counts
print(kct.dump_kmers(file=None, sortcounts=False, sortkeys=False))
# drop all records with count > 1
kct.maxcut(1)
#Inspect remaining kmers
print(kct.dump_kmers(file=None, sortcounts=False, sortkeys=False)) Note: There was a bug in Error:
Let me know if the above doesn't work and instead try something similar to your original solution: kct = KmerCountTable(ksize=4, store_kmers=True)
seq = "AAAAAT"
# Consume sequence string
kct.consume(seq)
{ x for x,y, in kct.dump_kmers() if y == 1} |
To the original request, do we want to add a function to report all the canonical kmers in the table? @ctb An equivalent to Atm I haven't been editing the Using |
I was able to test the runtime for getting kmers. It did raise an error using from khmer import Countgraph
from oxli import KmerCountTable
def khmer_test(seq:str, k:int) -> set[str]:
"""runs khmer
Args:
seq (str): sequence to get kmers
k (int): the kmer length
Returns:
set[str]: a set of kmers that appear once
"""
# create the count graph
kh = Countgraph(k, 1e7, 1)
# consume the sequences so kmers can be counted
kh.consume(seq)
# extract the kmers that appear once
out = {x for x in kh.get_kmers(seq) if kh.get(x) == 1}
def oxli_test(seq:str, k:int) -> set[str]:
"""runs oxli in a similar manner to how i use khmer
Args:
seq (str): sequence to get kmers
k (int): the kmer length
Returns:
set[str]: a set of kmers that appear once
"""
# create the count table
kt = KmerCountTable(k, store_kmers=True)
# consume the sequence
kt.consume(seq)
# get the kmers that appear once
out = {x for x,y in kt.dump_kmers(file=None) if y == 1} I wrote another program (not included) to time runs on three different E. coli genomes using these two methods. Here are the genomes I used:
Here are the results I obtained. Times in
Do you expect the |
Something I noticed is that I have deleted my previous comment because it was unnecessarily complicated. Here is a simple explanation of what I am seeing. from Bio.Seq import Seq
from khmer import Countgraph
from oxli import KmerCountTable
# set values to work with
k = 20
fwd = 'ATTCCGGAACCTTGAGAGATTCACGG'
rev = str(Seq(fwd).reverse_complement())
# make graphs
fcg = Countgraph(k, 1e7, 1)
rcg = Countgraph(k, 1e7, 1)
kct = KmerCountTable(k, store_kmers=True)
# consume sequences; khmer requires both
fcg.consume(fwd)
rcg.consume(rev)
kct.consume(fwd)
# get the forward and reverse kmers using khmer
f_kmers = [x for x in fcg.get_kmers(fwd) if fcg.get(x) == 1]
r_kmers = [x for x in rcg.get_kmers(rev) if rcg.get(x) == 1]
# get the cannonical kmers using oxli
o_kmers = [x for x,y in kct.dump_kmers(file=None) if y == 1] Here are the contents of the lists of kmers:
Is it possible to get all kmers instead of just cannonical kmers? In my use case, it would be more useful to get all 14 kmers instead of just the 7 cannonical ones. |
It might be a little faster to do maxcut with oxli if you you want to grab the latest version of that branch. I think the extra time is in calculating the revcomp and then choosing the canonical kmer in oxli. This should be addressed in #70. Only storing canonical kmers is pretty well baked into oxli, so I think the best path is to revcomp the kmers from |
get_kmers()
methodkmers()
method
I see. If that is the case, then I don't think The way I currently use # pseudocode
fwd = importSeq(fn)
rev = fwd.reverse_complement()
# use khmer to get kmers that appear once for each strand
fwd_kmers = getKmersThatAppearOnce(str(fwd))
rev_kmers = getKmersThatAppearOnce(str(rev))
# get all kmers that appear once across both strands
all_kmers = fwd_kmers.symmetric_difference(rev_kmers) To do the same with from Bio.Seq import Seq
# pseudocode
fwd = importSeq(fn)
# use oxli to get kmers that appear once
cannon_kmers = getKmersThatAppearOnce(str(fwd))
# get all kmers that appear once across both strands
all_kmers = cannon_kmers.union({str(Seq(x).reverse_complement()) for x in cannon_kmers} when i apply these strategies to real sequence data (E. coli genome), i do not get equivalent sets. In my test, the number of kmers I got from the can you think of a way for me to resolve this issue? do you think i am approaching this the wrong way? |
quick update as well. i ran the two different
|
Hi Joe, thanks for the update. Re your example code, below. When you use oxli to get When you reverse complement the cannonical kmer set and then take the union, you will add in the rc for all the non-palindromic kmers. Palindromes will necessarily only occur once in a set.
This should only really end up giving you fewer kmers than the khmer method, as khmer is also allowing kmers that occur once on the fwd and once on the reverse. Do you see the same final count with both the Could you provide a minimal sequence that replicates this issue? I will try to figure out where the difference is coming from. |
yes i see the same using something weird is definitely going on. from oxli import KmerCountTable
from khmer import Countgraph
from Bio.Seq import Seq
# constant
K = 20
# pseudo code
fwd = importSeq(fn)
# get oxli kmers
kct = KmerCountTable(K, store_kmers=True)
kct.consume(fwd)
kct.maxcut(1)
oxli_kmers = {x for x,y in kct.dump_kmers(file=None)}
# get khmer countgraph
fcg = Countgraph(K, 1e7, 1)
fcg.consume(fwd)
# save sets of kmers for each count
counts = dict()
for kmer in oxli_kmers:
# get the count of the kmer in the forward strand using the `khmer` countgraph
c = fcg.get(kmer)
# save the kmer for this count; create an empty setif need be
counts[c] = counts.get(c, set())
counts[c].add(kmer)
# print
for k in sorted(counts.keys()):
print(f'{k}:\t{len(counts[k])}') When I printed the results, I had 37 different values for counts:
i do not yet have a minimal dataset for you to evaluate. |
here is a sequence and code that causes some sort of issue. from Bio.Seq import Seq
from khmer import Countgraph
from oxli import KmerCountTable
# constants; 10,000bp sequences
FWD = 'GTATACAGATCGTGCGATCTACTGTGGATAACTCTGTCAGGAAGCTTGGATCAACCGGTAGTTATCCAAAGAACAACCGTTGTTCAGTTTTTGAGTTGTGTATAACCCCTCATTCTGATCCCAGCTTATACGGTCCAGGATCACCGATCATTCACAGTTAATGATCCTTTCCAGGTTGTTGATCTTAAAAGCCGGATCCTTGTTATCCACAGGGCAGTGCGATCCTAATAAGAGATAACAATAGAACAGATCTCTAAATAAATAGATCTTCTTTTTAATACCCAGGATCCCAGGTCTTTCTCAAGCCGACAAAGTTGAGTAGAATCCACGGCCCGGGCTTCAATCCATTTTCATACCGCGTTATGCGAGGCAATCACCATGTTTTATCCGGATCCTTTTGACGTCATCATCATTGGCGGGGGTCATGCAGGCACCGAGGCCGCGATGGCTGCGGCGCGTATGGGTCAACAGACTCTGCTTTTGACACACAATATCGACACTCTGGGGCAGATGAGCTGCAACCCGGCGATCGGCGGTATTGGGAAGGGACATCTGGTAAAAGAAGTGGATGCACTCGGCGGTCTGATGGCGAAAGCGATCGATCAGGCGGGTATCCAGTTTAGGATACTAAACGCAAGCAAGGGACCGGCAGTTCGCGCTACCCGAGCTCAGGCGGATCGTGTGCTCTACCGTCAGGCGGTACGTACGGCGCTGGAGAACCAACCGAACCTGATGATCTTCCAGCAGGCGGTTGAAGATCTTATTGTCGAAAACGATCGTGTGGTCGGTGCTGTTACCCAAATGGGACTGAAGTTCCGTGCCAAAGCCGTCGTGCTCACCGTTGGGACGTTCCTCGACGGTAAAATTCATATCGGTCTGGATAACTACAGCGGTGGCCGTGCTGGTGATCCGCCGTCCATTCCGCTTTCTCGCCGTTTGCGTGAACTTCCGCTGCGCGTTGGTCGTCTGAAAACCGGGACACCACCGCGTATTGATGCTCGAACCATCGACTTTAGCGTGCTTGCGCAACAGCATGGCGATAACCCAATGCCGGTATTCTCTTTTATGGGCAATGCGTCCCAGCATCCACAGCAGGTGCCGTGTTATATCACCCATACCAACGAGAAAACCCATGATGTGATCCGCAGTAACCTCGATCGTAGCCCAATGTACGCAGGGGTGATCGAAGGTGTCGGCCCACGCTACTGCCCGTCGATCGAAGACAAAGTCATGCGCTTCGCCGACAGAAATCAGCATCAGATCTTCCTTGAACCGGAAGGGCTGACTTCTAACGAAATTTATCCGAACGGTATCTCCACCAGCCTGCCGTTCGATGTGCAGATGCAAATCGTCCGCTCTATGCAGGGGATGGAAAACGCGAAGATCGTGCGTCCGGGTTATGCCATTGAGTATGACTTCTTCGATCCTCGCGACCTGAAACCGACGCTGGAGAGCAAGTTTATCCAGGGGCTGTTCTTTGCTGGTCAGATTAACGGCACTACCGGTTACGAAGAAGCCGCTGCGCAAGGTTTGCTGGCCGGTCTTAACGCTGCCCGTCTGTCTGATGACAAAGAAGGTTGGGCTCCGGCGCGTTCTCAGGCCTATCTCGGCGTACTGGTTGATGACCTGTGCACTTTAGGAACCAAAGAACCGTATCGTATGTTTACCTCGCGCGCAGAATATCGTCTGATGCTGCGCGAAGATAATGCGGATCTGCGTTTGACTGAAATCGGTCGTGAACTGGGCCTGGTGGATGACGAACGTTGGGCGCGCTTTAACGAGAAACTTGAGAATATCGAGCGTGAGCGTCAGCGTCTGAAATCGACCTGGGTAACCCCGTCGGCGGAAGCTGCAGCCGAAGTGAATGCTCACCTGACTGCGCCGCTTTCCCGTGAAGCCAGTGGTGAAGATCTGCTGCGTCGTCCGGAAATGACTTATGAAAAATTAACCACGCTGACGCCGTTTGCCCCTGCGTTGACAGACGAACAGGCGGCGGAACAGGTTGAGATTCAGGTTAAATACGAAGGTTATATCGCGCGCCAGCAAGATGAGATCGAAAAGCAGCTGCGTAACGAGAACACCTTGCTACCCGCGACACTGGATTACCGCCAGGTATCCGGTCTTTCTAACGAAGTGATCGCCAAACTTAACGATCACAAACCGGCCTCTATCGGCCAGGCTTCACGTATTTCTGGCGTCACGCCTGCGGCCATCTCCATTCTGCTGGTGTGGCTGAAAAAACAGGGTATGCTGCGTCGTAGCGCATAACGCATTAAAAATGCCTGGTAAGCCCCCGCTTACCAGGCAACGCATCAAGAACAGGTAATCACCGTGCTCAACAAACTCTCCTTACTGCTGAAAGACGCAGGTATTTCGCTTACCGATCACCAGAAAAACCAGCTTATTGCCTACGTGAATATGCTGCATAAATGGAACAAAGCGTACAACCTGACTTCGGTCCGCGATCCTAATGAGATGCTGGTACGCCATATTCTCGATAGCATTGTGGTGGCACCGTATCTGCAAGGTGAACGGTTTATCGATGTCGGCACCGGTCCTGGACTGCCTGGCATTCCACTCTCTATCGTGCGTCCTGAAGCCCATTTCACTCTGTTGGATAGCCTTGGTAAACGCGTGCGTTTCCTTCGTCAGGTGCAACATGAGCTTAAACTGGAGAATATTGAGCCAGTACAGAGCAGGGTAGAAGAGTTTCCTTCAGAGCCGCCATTTGATGGCGTAATTAGCCGCGCTTTTGCCTCTCTGAACGATATGGTGAGCTGGTGCCACCATCTTCCTGGTGAGCAAGGCCGTTTCTACGCGCTGAAAGGGCAAATGCCGGAAGATGAAATCGCTTTGTTGCCCGAAGAATATCAGGTCGAATCAGTGGTTAAACTTCAGGTTCCAGCCCTGGATGGCGAACGTCATCTGGTGGTGATTAAAGCAAATAAAATTTAATTTTTATCAAAAAAATCATAAAAAATTGACCGGTTAGACTGTTAACAACAACCAGGTTTTCTACTGATATAACTGGTTACATTTAACGCCACGTTCACTCTTTTGCATCAACAAGATAACGTGGCTTTTTTTGGTAAGCAGAAAATAAGTCATTAGTGAAAATATCAGTCTGCTAAAAATCGGCGCTAAGAACCATCATTGGCTGTTAAAACATTATTAAAAATGTCAATGGGTGGTTTTTGTTGTGTAAATGTCATTTATTAAAACAGTATCTGTTTTTAGACTGAAATATCATAAACTTGCAAAGGCATCATTTGCCAAGTAAATAAATATGCTGTGCGCGAACATGCGCAATATGTGATCTGAAGCACGCTTTATCACCAGTGTTTACGCGTTATTTACAGTTTTTCATGATCGAACAGGGTTAGTAGAAAAGTCGCAATTGTATGCACTGGAAAAATATTTAAACATTTATTCACCTTTTGGCTACTTATTGTTTGAAATCACGGGGGCGCACCGTATAATTTGACCGCTTTTTGATGCTTGACTCTAAGCCTTAAAGAAAGTTTTATACGACACGCGGCATACCTCGAAGGGAGCAGGAGTGAAAAACGTGATGTCTGTGTCGCTCGTGAGTCGAAACGTTGCTCGGAAGCTTCTGCTCGTTCAGTTACTGGTGGTGATAGCAAGTGGATTGCTGTTCAGCCTCAAAGACCCCTTCTGGGGCGTCTCTGCAATAAGCGGGGGCCTGGCAGTCTTTCTGCCTAACGTTTTGTTTATGATATTTGCCTGGCGTCACCAGGCGCATACACCAGCGAAAGGCCGGGTGGCCTGGACATTCGCATTTGGTGAAGCTTTCAAAGTTCTGGCGATGTTGGTGTTACTGGTGGTGGCGTTGGCGGTTTTAAAGGCGGTATTCTTGCCGCTGATCGTTACGTGGGTTTTGGTGCTGGTGGTTCAGATACTGGCACCGGCTGTAATTAACAACAAAGGGTAAAAGGCATCATGGCTTCAGAAAATATGACGCCGCAGGATTACATAGGACACCACCTGAATAACCTTCAGCTGGACCTGCGTACATTCTCGCTGGTGGATCCACAAAACCCCCCAGCCACCTTCTGGACAATCAATATTGACTCCATGTTCTTCTCGGTGGTGCTGGGTCTGTTGTTCCTGGTTTTATTCCGTAGCGTAGCCAAAAAGGCGACCAGCGGTGTGCCAGGTAAGTTTCAGACCGCGATTGAGCTGGTGATCGGCTTTGTTAATGGTAGCGTGAAAGACATGTACCATGGCAAAAGCAAGCTGATTGCTCCGCTGGCCCTGACGATCTTCGTCTGGGTATTCCTGATGAACCTGATGGATTTACTGCCTATCGACCTGCTGCCGTACATTGCTGAACATGTACTGGGTCTGCCTGCACTGCGTGTGGTTCCGTCTGCGGACGTGAACGTAACGCTGTCTATGGCACTGGGCGTATTTATCCTGATTCTGTTCTACAGCATCAAAATGAAAGGCATCGGCGGCTTCACGAAAGAGTTGACGCTGCAGCCGTTCAATCACTGGGCGTTCATTCCTGTCAACTTAATCCTTGAAGGGGTAAGCCTGCTGTCCAAACCAGTTTCACTCGGTTTGCGACTGTTCGGTAACATGTATGCCGGTGAGCTGATTTTCATTCTGATTGCTGGTCTGTTGCCGTGGTGGTCACAGTGGATCCTGAATGTGCCGTGGGCCATTTTCCACATCCTGATCATTACGCTGCAAGCCTTCATCTTCATGGTTCTGACGATCGTCTATCTGTCGATGGCGTCTGAAGAACATTAATTTACCAACACTACTACGTTTTAACTGAAACAAACTGGAGACTGTCATGGAAAACCTGAATATGGATCTGCTGTACATGGCTGCCGCTGTGATGATGGGTCTGGCGGCAATCGGTGCTGCGATCGGTATCGGCATCCTCGGGGGTAAATTCCTGGAAGGCGCAGCGCGTCAACCTGATCTGATTCCTCTGCTGCGTACTCAGTTCTTTATCGTTATGGGTCTGGTGGATGCTATCCCGATGATCGCTGTAGGTCTGGGTCTGTACGTGATGTTCGCTGTCGCGTAGTAAGCGTTGCTTTTATTTAAAGAGCAATATCAGAACGTTAACTAAATAGAGGCATTGTGCTGTGAATCTTAACGCAACAATCCTCGGCCAGGCCATCGCGTTTGTCCTGTTCGTTCTGTTCTGCATGAAGTACGTATGGCCGCCATTAATGGCAGCCATCGAAAAACGTCAAAAAGAAATTGCTGACGGCCTTGCTTCCGCAGAACGAGCACATAAGGACCTTGACCTTGCAAAGGCCAGCGCGACCGACCAGCTGAAAAAAGCGAAAGCGGAAGCCCAGGTAATCATCGAGCAGGCGAACAAACGCCGCTCGCAGATTCTGGACGAAGCGAAAGCTGAGGCAGAACAGGAACGTACTAAAATCGTGGCCCAGGCGCAGGCGGAAATTGAAGCCGAGCGTAAACGTGCCCGTGAAGAGCTGCGTAAGCAAGTTGCTATCCTGGCTGTTGCTGGCGCCGAGAAGATCATCGAACGTTCCGTGGATGAAGCTGCTAACAGCGACATCGTGGATAAACTTGTCGCTGAACTGTAAGGAGGGAGGGGCTGATGTCTGAATTTATTACGGTAGCTCGCCCCTACGCCAAAGCAGCTTTTGACTTTGCCGTCGAACACCAAAGTGTAGAACGCTGGCAGGACATGCTGGCGTTTGCCGCCGAGGTAACCAAAAACGAACAAATGGCAGAGCTTCTCTCTGGCGCGCTTGCGCCAGAAACGCTTGCCGAGTCGTTTATCGCAGTTTGTGGTGAGCAACTGGACGAAAACGGTCAGAACCTTATTCGGGTAATGGCTGAAAATGGTCGTCTTAACGCGCTCCCGGATGTTTTGGAGCAGTTTATTCACCTTCGTGCCGTGAGTGAGGCTACCGCTGAGGTAGACGTCATTTCCGCTGCCGCACTGAGTGAACAACAGCTCGCGAAAATTTCTGCTGCGATGGAAAAACGTCTGTCACGCAAAGTTAAGCTGAATTGCAAAATCGATAAGTCTGTAATGGCAGGCGTTATCATCCGAGCGGGTGATATGGTCATTGATGGCAGCGTACGCGGTCGTCTTGAGCGCCTTGCAGACGTCTTGCAGTCTTAAGGGGACTGGAGCATGCAACTGAATTCCACCGAAATCAGCGAACTGATCAAGCAGCGCATTGCTCAGTTCAATGTTGTGAGTGAAGCTCACAACGAAGGTACTATTGTTTCTGTAAGTGACGGTGTTATCCGCATTCACGGCCTGGCCGATTGTATGCAGGGTGAAATGATCTCCCTGCCGGGTAACCGTTACGCTATCGCACTGAACCTCGAGCGCGACTCTGTTGGTGCGGTTGTTATGGGTCCGTATGCTGACCTTGCCGAAGGCATGAAAGTTAAGTGTACTGGCCGTATCCTGGAAGTTCCGGTTGGCCGTGGCCTGCTGGGCCGTGTGGTTAACACTCTGGGTGCACCAATCGACGGTAAAGGTCCGCTGGATCACGACGGCTTCTCTGCTGTAGAAGCAATCGCTCCGGGCGTTATCGAACGTCAGTCCGTAGATCAGCCGGTACAGACCGGTTATAAAGCCGTTGACTCCATGATCCCAATCGGTCGTGGTCAGCGTGAATTGATCATCGGTGACCGTCAGACCGGTAAAACCGCACTGGCTATCGATGCCATCATCAACCAGCGCGATTCCGGTATCAAATGTATCTATGTCGCTATCGGCCAGAAAGCGTCCACCATTTCTAACGTGGTACGTAAACTGGAAGAGCACGGCGCACTGGCTAACACCATCGTTGTGGTAGCAACCGCGTCTGAATCCGCTGCACTGCAATACCTGGCACCGTATGCCGGTTGCGCAATGGGCGAATACTTCCGTGACCGCGGTGAAGATGCGCTGATCATTTACGATGATCTGTCTAAACAGGCTGTTGCTTACCGTCAGATCTCGCTGCTGCTCCGTCGTCCGCCAGGACGTGAAGCATTCCCGGGCGACGTATTCTACCTCCACTCTCGTCTGCTGGAGCGTGCTGCGCGTGTTAACGCCGAATACGTTGAAGCTTTCACCAAAGGTGAAGTGAAAGGGAAAACCGGTTCACTGACCGCGCTGCCGATTATCGAAACTCAAGCGGGTGACGTTTCTGCGTTCGTTCCGACCAACGTAATCTCCATTACCGATGGTCAGATCTTCCTGGAAACCAACCTGTTCAACGCCGGTATTCGTCCTGCGGTTAACCCGGGTATTTCCGTATCCCGTGTTGGTGGTGCAGCACAGACCAAGATCATGAAAAAACTGTCCGGTGGTATCCGTACCGCTCTGGCACAGTATCGTGAACTGGCAGCGTTCTCTCAGTTTGCATCCGACCTTGACGATGCAACACGTAAGCAGCTTGACCACGGTCAAAAAGTGACCGAACTGCTGAAACAGAAACAGTATGCGCCGATGTCCGTTGCGCAGCAGTCTCTGGTTCTGTTCGCAGCAGAACGTGGTTACCTGGCGGATGTTGAACTGTCGAAAATCGGCAGCTTCGAAGCCGCTCTGCTGGCTTACGTCGACCGTGATCACGCTCCGTTGATGCAAGAGATCAACCAGACCGGTGGCTACAACGACGAAATCGAAGGCAAGCTGAAAGGCATCCTCGATTCCTTCAAAGCAACCCAATCCTGGTAACGTCTGGCGGCTTGCCTTAGGGCAGGCCGCAAGGCATTGAGGAGAAGCTCATGGCCGGCGCAAAAGAGATACGTAGTAAGATCGCAAGCGTCCAGAACACGCAAAAGATCACTAAAGCGATGGAGATGGTCGCCGCTTCCAAAATGCGTAAATCGCAGGATCGCATGGCGGCCAGCCGTCCTTATGCAGAAACCATGCGCAAAGTGATTGGTCACCTTGCACACGGTAATCTGGAATATAAGCACCCTTACCTGGAAGACCGCGACGTTAAACGCGTGGGCTACCTGGTGGTGTCGACCGACCGTGGTTTGTGCGGTGGTTTGAACATTAACCTGTTCAAAAAACTGCTGGCGGAAATGAAGACCTGGACCGACAAAGGCGTTCAATGCGACCTCGCAATGATCGGCTCGAAAGGCGTGTCGTTCTTCAACTCCGTGGGCGGCAATGTTGTTGCCCAGGTCACCGGCATGGGGGATAACCCTTCCCTGTCCGAACTGATCGGTCCGGTAAAAGTGATGTTGCAGGCCTACGACGAAGGCCGTCTGGACAAGCTTTACATTGTCAGCAACAAATTTATTAACACCATGTCTCAGGTTCCGACCATCAGCCAGCTGCTGCCGTTACCGGCATCAGATGATGATGATCTGAAACATAAATCCTGGGATTACCTGTACGAACCCGATCCGAAGGCGTTGCTGGATACCCTGCTGCGTCGTTATGTCGAATCTCAGGTTTATCAGGGCGTGGTTGAAAACCTGGCCAGCGAGCAGGCCGCCCGTATGGTGGCGATGAAAGCCGCGACCGACAATGGCGGCAGCCTGATTAAAGAGCTGCAGTTGGTATACAACAAAGCTCGTCAGGCCAGCATTACTCAGGAACTCACCGAGATCGTCTCGGGGGCCGCCGCGGTTTAAACAGGTTATTTCGTAGAGGATTTAAGATGGCTACTGGAAAGATTGTCCAGGTAATCGGCGCCGTAGTTGACGTCGAATTCCCTCAGGATGCCGTACCGCGCGTGTACGATGCTCTTGAGGTGCAAAATGGTAATGAGCGTCTGGTGCTGGAAGTTCAGCAGCAGCTCGGCGGCGGTATCGTGCGTACCATCGCAATGGGTTCCTCCGACGGTCTGCGTCGCGGTCTGGATGTAAAAGACCTCGAACACCCGATCGAAGTCCCGGTAGGTAAAGCGACTCTGGGCCGTATCATGAACGTACTGGGTGAACCGGTCGACATGAAAGGCGAGATCGGTGAAGAAGAGCGTTGGGCGATTCACCGCGCAGCACCTTCCTACGAAGAGCTGTCAAACTCTCAGGAACTGCTGGAAACCGGTATCAAAGTTATCGACCTGATGTGTCCGTTCGCTAAGGGCGGTAAAGTTGGTCTGTTCGGTGGTGCGGGTGTAGGTAAAACCGTAAACATGATGGAGCTCATTCGTAACATCGCGATCGAGCACTCCGGTTACTCTGTGTTTGCGGGCGTAGGTGAACGTACTCGTGAGGGGAACGACTTCTACCACGAAATGACCGACTCCAACGTTATCGATAAAGTATCCCTGGTGTATGGCCAGATGAACGAGCCGCCGGGAAACCGTCTGCGCGTAGCTCTGACCGGTCTGACCATGGCTGAGAAATTCCGTGACGAAGGTCGTGACGTTCTGCTGTTCGTTGACAACATCTATCGTTACACCCTGGCCGGTACGGAAGTATCCGCACTGCTGGGCCGTATGCCTTCAGCGGTAGGTTATCAGCCGACCCTGGCGGAAGAGATGGGCGTTCTGCAGGAACGTATCACCTCCACCAAAACCGGTTCTATCACCTCCGTACAGGCAGTATACGTACCTGCGGATGACTTGACTGACCCGTCTCCGGCAACCACCTTTGCGCACCTTGACGCAACCGTGGTACTGAGCCGTCAGATCGCGTCTCTGGGTATCTACCCGGCCGTTGACCCGCTGGACTCCACCAGCCGTCAGCTGGACCCGCTGGTGGTTGGTCAGGAACACTACGACACCGCGCGTGGCGTTCAGTCCATCCTGCAACGTTATCAGGAACTGAAAGACATCATCGCCATCCTGGGTATGGATGAACTGTCTGAAGAAGACAAACTGGTGGTAGCGCGTGCTCGTAAGATCCAGCGCTTCCTGTCCCAGCCGTTCTTCGTGGCAGAAGTATTCACCGGTTCTCCGGGTAAATACGTCTCCCTGAAAGACACCATCCGTGGCTTTAAAGGCATCATGGAAGGCGAATACGATCACCTGCCGGAGCAGGCGTTCTACATGGTCGGTTCCATCGAAGAAGCTGTGGA'
REV = str(Seq(FWD).reverse_complement())
K = 20
# functions
def get_khmer_kmers(fwd:str, rev:str, k:int) -> set[str]:
"""gets the kmers that appear once in the genome
Args:
fwd (str): the forward sequence
rev (str): the reverse sequence
k (int): the kmer length
Returns:
set[str]: kmers that appear once in the genome
"""
# make countgraphs
fcg = Countgraph(k, 1e7, 1)
rcg = Countgraph(k, 1e7, 1)
# consume sequences
fcg.consume(fwd)
rcg.consume(rev)
# get kmers that appear once on each strand; ignore chimeras made at junctions
out = {x for x in fcg.get_kmers(fwd) if fcg.get(x) == 1}
rks = {x for x in rcg.get_kmers(rev) if rcg.get(x) == 1}
# only keep kmers that appear once across both strands
out.symmetric_difference_update(rks)
return out
def get_oxli_kmers(seq:str, k:int) -> set[str]:
"""gets the kmers that appears once in the genome
Args:
seq (str): the sequence
k (int): the kmer length
Returns:
set[str]: kmers that appear once in the genome
"""
# build count table
kct = KmerCountTable(k, store_kmers=True)
kct.consume(seq)
# only keep kmers that appear once
kct.maxcut(1)
# get the kmers and add reverse complements to the set
out = {x for x,y in kct.dump_kmers(file=None)}
out.update({str(Seq(x).reverse_complement()) for x in out})
return out
# test case
khmer_kmers = get_khmer_kmers(FWD, REV, K)
oxli_kmers = get_oxli_kmers(FWD, K) Examining the sets reveals the following information:
I hope this helps. Notes:
|
@dr-joe-wirth thanks for preparing this example. @ctb I'm not familiar with the workings of khmer, can you take a look at this? Happy to jump in and chase down bugs if you have a lead. |
The first thing that comes to mind is that khmer has false positives - the countgraph will in some cases overcount things. @dr-joe-wirth if you change the 1e7 to 1e8, you should see fewer overcounts. Reading through the rest of the comments now ;) |
More comments: regarding choosing the canonical k-mer, this is implemented on the oxli side, not in the hashing mechanism we're taking from sourmash. So I think it's not that difficult to expose all k-mers, not just the canonical k-mers. The speed is a different problem. I'm surprised (and unhappy) that we're slower than khmer! |
so I may start there... with the speed... |
wonderful. i am really excited for this package. |
@ctb I can work on supporting an "all-kmers" mode if you want to open a new issue. |
Hello,
I use
khmer
inprimerForge
to retrieve kmers that appear exactly once in the genome My current workflow is this:I am very interested in replacing
khmer
in my program withoxli
, but currentlyoxli
does not have aget_kmers
method.I'd really like if you could implement this. Or at least if you could add how to get all the kmers in a sequence to the documentation if this is not something you wish to implement in
oxli
.Thanks so much for building this!
edit: I simplified the code to more clearly demonstrate my use case
The text was updated successfully, but these errors were encountered: