Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

empty SEQ fields in collapsed.sam reads #23

Open
C4t3 opened this issue Nov 30, 2016 · 3 comments
Open

empty SEQ fields in collapsed.sam reads #23

C4t3 opened this issue Nov 30, 2016 · 3 comments

Comments

@C4t3
Copy link

C4t3 commented Nov 30, 2016

Hi,
I'm having some problems with the samcollapser. In the unique.sam file I got some reads with empty CIGAR and SEQUENCE fields. I couldn't figure out why. The reads looks fine within the indexed.sam before collapsing.

@augustboyle
Copy link
Member

augustboyle commented Nov 30, 2016 via email

@kzkedzierska
Copy link

kzkedzierska commented Sep 30, 2019

I have the same problem. The issue is that I am using mipgen_smmip_collapser.py in a snakemake pipeline and analyzing quite a lot of samples. Removing the reads by hand is far from ideal solution ;)

Below my original issue, which I closed after noticing this one:

I tried to run the mipgen_smmip_collapser.py on some of my samples. Unfortunately, the output (collapsed.all_reads.unique.sam) is corrupted. I managed to trace it to (at least) one read being reported with CIGAR, SEQ and QUAL fields empty, and the POS different that the same read' POS in the input file. I run ValidateSamFile on the input and it seems to be ok (except not having a read group, but that had not been a problem with other files). Any idea what is causing this? Or how to further debug it?

I would very much appreciate any suggestions.

@darichter87
Copy link

Hi Katarzyna,

I actually had the same problem a few weeks ago.
You're already very close to the solution!

The empty reads in the "collapsed.all_reads.unique.sam" output are caused by a wrong behaviour of the filter for softclipped reads within the genome_sam_collapser.
This erroneous output is only observed for paired-end sequencing data.
The paired-end softclipping filter only accounts for reads with flags 163 or 99 being softclipped at the start and reads with flags 83 or 147 being softclipped at the end.
I am not really sure about the rational behind it, but considering all 4 flags for softclippings at the beginning as well as at the end of your reads will fix this problem!

Hence, you can fix this by altering line 719 within the "genome_sam_collapser.pyx" from
if options.single_end or (re.search("^\d+S", current_read.cigar) and (current_read.flag == 163 or current_read.flag == 99)) or (re.search("S$", current_read.cigar) and (current_read.flag == 83 or current_read.flag == 147)):
to
if options.single_end or (re.search("^\d+S", current_read.cigar) and (current_read.flag == 163 or current_read.flag == 99 or current_read.flag == 83 or current_read.flag == 147)) or (re.search("S$", current_read.cigar) and (current_read.flag == 83 or current_read.flag == 147 or current_read.flag == 163 or current_read.flag == 99)):
(Note: This code is not aesthetically pleasing but helps to understand what the problem was)

After changing this line you also need to recompile the script by
'python setup.py build_ext --inplace'

I hope this helps, in case you still had the problem.

I would be happy if one of the developers could comment on the rational behind the original softclipping filter logic. :)

Best,

Daniel

P.S.:
The source for these reads are "self-circularized" smMIPs, that were circularized without actually capturing the target sequence.
We have seen these reads a few times in our data and think they are derived from too strong molecular crowding (too high MIP concentrations) within the hybridization reaction and/or interactions between certain smMIPs in the pool.
Consequently, you will find reads in your data that mainly consist of both MIP-arms one after another followed by your adapter-sequence.
These reads might also contain short "off-target" sequences between the arms, leading to only one arm being mapped and the rest of the read being softclipped.
In my data, softclipping can occur on both ends of the reads for all legit read-flags (flags 83, 99, 147, 163), but as stated above only some combinations get filtered out by the softclipping-filter.
The hybridization arms of these unfiltered reads are then trimmed by default when MIPgen collapses these.
So, if the arms were the only sequence that could be mapped you end up with empty SEQ, CIGAR and QUAL fields in your output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants