MD:Z SAM file tag error: Reference sequence cross-contamination #90

meyermj · 2018-05-10T21:53:01Z

This bug is in VN:v0.5.2 compiled on Nov 15 2017 at 11:37:33.

The MD:Z tag in the SAM output can be a very helpful supplement to the CIGAR string field because it details which bases have been inserted, deleted, and mismatched, whereas the CIGAR string only details the positions of indels. Because of the extra information supplied in the MD:Z field, one can re-create full alignments of reads to reference from the SAM file only (without having to re-read sequences from the reference genome). I however, I have noticed some irregularities in the MD:Z field produced by graphmap. Below I will detail how to reproduce the error I'm seeing.

I created a reference file of two unrelated, random sequences.

>ref0
TACGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTACGATCGATCGATCGATCGTTATTCGAGCGCGCGCGAAACGTATCGATCGTGCGCATCGTAGCTAGCTAGCTACGTACGTAC
>ref1
CGATGCTAGCTGATCGATCAAATTCACGTAGCTAGCTAGCTAGCTACTCGTAGCTACTGACGACTGACGTACTGACTGATCGATCGATCGAGCATCGATCGATCGATCGATCGATCGAC

I then created a single simulated read from ref1, with a single deletion of the 5th base (G).

>myRead
CGATCTAGCTGATCGATCAAATTCACGTAGCTAGCTAGCTAGCTACTCGTAGCTACTGACGACTGACGTACTGACTGATCGATCGATCGAGCATCGATCGATCGATCGATCGATCGAC

I used the following command to align myRead to the reference file containing both ref1 and ref2.

graphmap align -r refs.fasta -d read.fasta -o out.sam

Here is the SAM output:

myRead	0	ref1	1	40	4M1D114M	*	0	0	CGATCTAGCTGATCGATCAAATTCACGTAGCTAGCTAGCTAGCTACTCGTAGCTACTGACGACTGACGTACTGACTGATCGATCGATCGAGCATCGATCGATCGATCGATCGATCGAC	*	MD:Z:4^T114	NM:i:1	AS:i:582	H0:i:1	ZE:f:1.20556e-38	ZF:f:0.827464	ZQ:i:118	ZR:i:119

Of importance, you'll notice that the CIGAR string is correct, and the MD:Z field has the correct base counts (they both indicate 4 matches, 1 deletion, 114 matches), however, the base indicated as deleted in the MD:Z field is T, when it should be G. Coincidentally, T happens to be the 5th base of ref0, which made me suspect that the MD:Z is being constructed using the wrong reference. And in-fact, when I change the 5th base of ref0 to A, the MD:Z string changes to MD:Z:4^A114 to match ref0 (even though the read is mapped to ref1). Also, if I completely eliminate ref0 from my library, the correct MD:Z string is produced. Again, the composition or presence of ref0 in my library should have no bearing on the MD:Z field for my read, as it mapped to ref1. So, it appears that graphmap is using the wrong reference sequences to determine the bases that have been deleted for the MD:Z field.

The text was updated successfully, but these errors were encountered:

mdcao added a commit to mdcao/graphmap that referenced this issue May 10, 2018

fix bug with generating MD string on isovic#90

14a6859

mdcao mentioned this issue May 11, 2018

fix bug with generating MD string #91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MD:Z SAM file tag error: Reference sequence cross-contamination #90

MD:Z SAM file tag error: Reference sequence cross-contamination #90

meyermj commented May 10, 2018

MD:Z SAM file tag error: Reference sequence cross-contamination #90

MD:Z SAM file tag error: Reference sequence cross-contamination #90

Comments

meyermj commented May 10, 2018