empty SEQ fields in collapsed.sam reads #23

C4t3 · 2016-11-30T15:11:41Z

Hi,
I'm having some problems with the samcollapser. In the unique.sam file I got some reads with empty CIGAR and SEQUENCE fields. I couldn't figure out why. The reads looks fine within the indexed.sam before collapsing.

augustboyle · 2016-11-30T17:02:25Z

I can try to look at the reads that are failing if you send me your MIPs design file and a SAM of the problem reads. My boylee _at_ uw.edu address is still online. I have seen this happen in the past — I don’t think I was able to find a consistent cause. If you don’t have too many of them it shouldn’t be too burdensome to remove them, but I agree that understanding why it happens is ideal. Evan On November 30, 2016 at 7:11:43 AM, C4t3 ([email protected]) wrote: Hi, I'm having some problems with the samcollapser. In the unique.sam file I got some reads with empty CIGAR and SEQUENCE fields. I couldn't figure out why. The reads looks fine within the indexed.sam before collapsing. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#23>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF01p5hWo5v8ZajaQSTFxPCxd_uLeC2hks5rDZItgaJpZM4LATLC> .

kzkedzierska · 2019-09-30T14:46:14Z

I have the same problem. The issue is that I am using mipgen_smmip_collapser.py in a snakemake pipeline and analyzing quite a lot of samples. Removing the reads by hand is far from ideal solution ;)

Below my original issue, which I closed after noticing this one:

I tried to run the mipgen_smmip_collapser.py on some of my samples. Unfortunately, the output (collapsed.all_reads.unique.sam) is corrupted. I managed to trace it to (at least) one read being reported with CIGAR, SEQ and QUAL fields empty, and the POS different that the same read' POS in the input file. I run ValidateSamFile on the input and it seems to be ok (except not having a read group, but that had not been a problem with other files). Any idea what is causing this? Or how to further debug it?

I would very much appreciate any suggestions.

darichter87 · 2019-10-11T11:57:01Z

Hi Katarzyna,

I actually had the same problem a few weeks ago.
You're already very close to the solution!

The empty reads in the "collapsed.all_reads.unique.sam" output are caused by a wrong behaviour of the filter for softclipped reads within the genome_sam_collapser.
This erroneous output is only observed for paired-end sequencing data.
The paired-end softclipping filter only accounts for reads with flags 163 or 99 being softclipped at the start and reads with flags 83 or 147 being softclipped at the end.
I am not really sure about the rational behind it, but considering all 4 flags for softclippings at the beginning as well as at the end of your reads will fix this problem!

Hence, you can fix this by altering line 719 within the "genome_sam_collapser.pyx" from
if options.single_end or (re.search("^\d+S", current_read.cigar) and (current_read.flag == 163 or current_read.flag == 99)) or (re.search("S$", current_read.cigar) and (current_read.flag == 83 or current_read.flag == 147)):
to
if options.single_end or (re.search("^\d+S", current_read.cigar) and (current_read.flag == 163 or current_read.flag == 99 or current_read.flag == 83 or current_read.flag == 147)) or (re.search("S$", current_read.cigar) and (current_read.flag == 83 or current_read.flag == 147 or current_read.flag == 163 or current_read.flag == 99)):
(Note: This code is not aesthetically pleasing but helps to understand what the problem was)

After changing this line you also need to recompile the script by
'python setup.py build_ext --inplace'

I hope this helps, in case you still had the problem.

I would be happy if one of the developers could comment on the rational behind the original softclipping filter logic. :)

Best,

Daniel

P.S.:
The source for these reads are "self-circularized" smMIPs, that were circularized without actually capturing the target sequence.
We have seen these reads a few times in our data and think they are derived from too strong molecular crowding (too high MIP concentrations) within the hybridization reaction and/or interactions between certain smMIPs in the pool.
Consequently, you will find reads in your data that mainly consist of both MIP-arms one after another followed by your adapter-sequence.
These reads might also contain short "off-target" sequences between the arms, leading to only one arm being mapped and the rest of the read being softclipped.
In my data, softclipping can occur on both ends of the reads for all legit read-flags (flags 83, 99, 147, 163), but as stated above only some combinations get filtered out by the softclipping-filter.
The hybridization arms of these unfiltered reads are then trimmed by default when MIPgen collapses these.
So, if the arms were the only sequence that could be mapped you end up with empty SEQ, CIGAR and QUAL fields in your output.

JulieHachey mentioned this issue Jul 6, 2017

no masked feature file found #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

empty SEQ fields in collapsed.sam reads #23

empty SEQ fields in collapsed.sam reads #23

C4t3 commented Nov 30, 2016

augustboyle commented Nov 30, 2016 via email

kzkedzierska commented Sep 30, 2019 •

edited

Loading

darichter87 commented Oct 11, 2019

empty SEQ fields in collapsed.sam reads #23

empty SEQ fields in collapsed.sam reads #23

Comments

C4t3 commented Nov 30, 2016

augustboyle commented Nov 30, 2016 via email

kzkedzierska commented Sep 30, 2019 • edited Loading

darichter87 commented Oct 11, 2019

kzkedzierska commented Sep 30, 2019 •

edited

Loading