-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling adapter sequences #101
Comments
I've added an additional bounds check in self_qc to the read position vs. the start of the left primer and the end of the right primer that should fix this as long as the adapter sequence doesn't extend too far beyond the primer match distance threshold (otherwise we have the unfortunate such as in issue #99) Can you point me to some fastqs where this appears? I'd like to test the solution |
@iqbal-lab I understand that there are a number of real world samples where this has produced artefacts, would it be possible for me to see them? |
Sure, can you confirm first that this works on the simulated data? |
Yes, I am happy about the bounds checking fix for this one example. Since this is a synthetic test case that deviates from our QC model of real-world data in a relevant way (many nanopore reads of exactly matching lengths and adapter noise) I would like to confirm that the fix properly satisfies the real-world conditions. |
Anyway, here's a sample ERR8959196 from the oxford validation set which has loads of adapters. |
Thanks for the sample. The instances of this artefact seem a bit marginal to me:
The last column is the count of bases attributed to this artefact. I do not see a difference in any calls made in the final VCF. Do you have a list of the positions where you have observed this artefact? |
No, I am sorry but I do not feel comfortable with committing untested code. In my experience this can lead to confusion where it can be tricky to separate issues raised by experimental results from true-positive bugs.
We have a quality control model that is designed to help with adapter artefacts but it does not expect to see a lot of nanopore reads that have the same exact lengths. One way to avoid speculation about this stuff is to look at some examples. Is it the case that we have observed instances of this issue in real world data? If not I would be happy to move the more open-ended data exploration to a different venue. |
OK this issue is on hold until Friday |
We need to handle adapter sequences.
Observed behaviour:
Desired behaviour: the adapter sequence is masked. I think self-qc could do this. An alternative would be to remove adapters from the reads earlier on in the pipeline so they are never seen again.
Example is the same as in #99. Please also see #100 for more detail on the amplicon in question.
The adapter sequence is included in the consensus at the end of amplicon
SARS-CoV-2_76
. All the reads there end either with the primer, or with the primer plus a few bp of adapter. None of them should contribute toClean.Tot.cons
at the positions of the primer or past the end of the primer.First columns of
all_stats.tsv
at the end of the primer -- start of adapter -- start of dropped amplicon:The text was updated successfully, but these errors were encountered: