Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MD:Z SAM file tag error: Reference sequence cross-contamination #90

Open
meyermj opened this issue May 10, 2018 · 0 comments
Open

MD:Z SAM file tag error: Reference sequence cross-contamination #90

meyermj opened this issue May 10, 2018 · 0 comments

Comments

@meyermj
Copy link

meyermj commented May 10, 2018

This bug is in VN:v0.5.2 compiled on Nov 15 2017 at 11:37:33.

The MD:Z tag in the SAM output can be a very helpful supplement to the CIGAR string field because it details which bases have been inserted, deleted, and mismatched, whereas the CIGAR string only details the positions of indels. Because of the extra information supplied in the MD:Z field, one can re-create full alignments of reads to reference from the SAM file only (without having to re-read sequences from the reference genome). I however, I have noticed some irregularities in the MD:Z field produced by graphmap. Below I will detail how to reproduce the error I'm seeing.

I created a reference file of two unrelated, random sequences.

>ref0
TACGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTACGATCGATCGATCGATCGTTATTCGAGCGCGCGCGAAACGTATCGATCGTGCGCATCGTAGCTAGCTAGCTACGTACGTAC
>ref1
CGATGCTAGCTGATCGATCAAATTCACGTAGCTAGCTAGCTAGCTACTCGTAGCTACTGACGACTGACGTACTGACTGATCGATCGATCGAGCATCGATCGATCGATCGATCGATCGAC

I then created a single simulated read from ref1, with a single deletion of the 5th base (G).

>myRead
CGATCTAGCTGATCGATCAAATTCACGTAGCTAGCTAGCTAGCTACTCGTAGCTACTGACGACTGACGTACTGACTGATCGATCGATCGAGCATCGATCGATCGATCGATCGATCGAC

I used the following command to align myRead to the reference file containing both ref1 and ref2.

graphmap align -r refs.fasta -d read.fasta -o out.sam

Here is the SAM output:

myRead	0	ref1	1	40	4M1D114M	*	0	0	CGATCTAGCTGATCGATCAAATTCACGTAGCTAGCTAGCTAGCTACTCGTAGCTACTGACGACTGACGTACTGACTGATCGATCGATCGAGCATCGATCGATCGATCGATCGATCGAC	*	MD:Z:4^T114	NM:i:1	AS:i:582	H0:i:1	ZE:f:1.20556e-38	ZF:f:0.827464	ZQ:i:118	ZR:i:119

Of importance, you'll notice that the CIGAR string is correct, and the MD:Z field has the correct base counts (they both indicate 4 matches, 1 deletion, 114 matches), however, the base indicated as deleted in the MD:Z field is T, when it should be G. Coincidentally, T happens to be the 5th base of ref0, which made me suspect that the MD:Z is being constructed using the wrong reference. And in-fact, when I change the 5th base of ref0 to A, the MD:Z string changes to MD:Z:4^A114 to match ref0 (even though the read is mapped to ref1). Also, if I completely eliminate ref0 from my library, the correct MD:Z string is produced. Again, the composition or presence of ref0 in my library should have no bearing on the MD:Z field for my read, as it mapped to ref1. So, it appears that graphmap is using the wrong reference sequences to determine the bases that have been deleted for the MD:Z field.

mdcao added a commit to mdcao/graphmap that referenced this issue May 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant