Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a technique to remove artificially duplicated reads from within khmer #259

Open
ctb opened this issue Jan 19, 2014 · 1 comment
Milestone

Comments

@ctb
Copy link
Member

ctb commented Jan 19, 2014

On Dec 17, 2013 2:06 PM, "C. Titus Brown" [email protected] wrote:

Haven't figured out an approach for khmer yet, but:

what about,

for each read,

check to see if first 32-mer has been seen before
if it has, discard read
otherwise, store first 32-mer, keep read

I can think of a modification to enable this for transcriptomes/high
coverage metagenomes, too.

It only works for exact matches in the first 32 bases, so some tuning (20?
32? 16?) might be useful. We could also use some fraction of first 3
k-mers, etc.

Michael:

That's how I would do it. Doesn't need to be perfect.

@camillescott
Copy link
Member

This is a bad idea IMO. We already have known problems with removing branches from the de Bruijn graph with diginorm; in the event where we have 2 branches, say each with coverage ~5, the chances of throwing away an entire branch with this approach are very high.

@ctb ctb changed the title Implement a technique to remote artificially duplicated reads from within khmer Implement a technique to remove artificially duplicated reads from within khmer Jun 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants