Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strand aware kmer counting #95

Open
Adamtaranto opened this issue Nov 17, 2024 · 2 comments
Open

Strand aware kmer counting #95

Adamtaranto opened this issue Nov 17, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@Adamtaranto
Copy link
Collaborator

Some users may wish to count kmers from the forward and reverse strands separately. Discussed in #74.

Propose adding an option (stranded/strand_aware?) to make a KmerCountTable store fwd/rev kmers separately.

Option 1: Would disable canonical kmer selection and instead store +1 for both the fwd and reverse strands.
Option 2: Count only from fwd strand. Do not canonicalise.

@dr-joe-wirth would Opt 1 suite your use case? Or do you need the counts in separate tables (Opt 2)? i.e. count fwd strand only, then revcomp the sequence and count again in another table.

@dr-joe-wirth
Copy link

Let me make sure I understand the options:

  1. store all the kmers for both the forward and reverse strands instead of storing all the canonical kmers for both the forward and reverse strands.
  2. store all the kmers from the forward strand only.

For my purposes, I want to get the kmers that appear exactly once. Currently, I use khmer like this:

  1. get kmers from the forward strand that appear once
  2. flip the sequence then get the kmers that appears once in the flipped sequence (reverse strand kmers)
  3. keep only the kmers that are not shared on the two strands (symmetric difference of sets)

So option 1 can work for my purposes if the count method reports the number of times a kmer appears in total, not just the number of times it appears on one strand. If that is not feasible, then I would prefer option 2.

Please let me know if that makes sense.

@Adamtaranto Adamtaranto self-assigned this Nov 19, 2024
@Adamtaranto Adamtaranto added the enhancement New feature or request label Nov 19, 2024
@Adamtaranto
Copy link
Collaborator Author

Adamtaranto commented Nov 19, 2024

dsDNA All kmers Fwd strand only Canonical
5' GGTA 3' GGT, GTA GGT, GTA - , GTA
3' CCAT 5' ACC, TAC - , - ACC , -

Ok, I think option 2 is probably best here. Keeps things consistent with khmer and allows for cases where users just want the fwd strand kmers. If you want "all kmers", revcomp the sequence and consume that too.

@ctb I will add this after PRs that change kmer counting/hashing behaviour are wrapped up #10 #83 #87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants