Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix vcf parsing #25

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Fix vcf parsing #25

wants to merge 1 commit into from

Conversation

xfengnefx
Copy link

Hi,

Phased variants is read from vcf file by finding "1|0" or "0|1" substring in each vcf records. This should be done only to the last column of a vcf record (in single sample vcf files), not the whole record. Similar for replacing the phasing.

Example: The following line is from a epi2me-labs/wf-human-variation + hapcut2 v1.3.1 run on this bam. The variant is unphased but line has a "0|1", which crashes the run by calling int() on a string:

chr6 145913508 . G A 25.38 PASS P;ANN=A|synonymous_variant|LOW|S HPRH|SHPRH|transcript|XM_017010691.2|protein_coding|24/30|c.4296C>T|p.Cys1432Cys|4457/5527|4296/ 5235|1432/1744||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcript|XM_006715439.4|protein_coding|2 4/31|c.4296C>T|p.Cys1432Cys|4457/11423|4296/5124|1432/1707||,A|synonymous_variant|LOW|SHPRH|SHPR H|transcript|XM_006715443.4|protein_coding|24/26|c.4296C>T|p.Cys1432Cys|4457/4780|4296/4524|1432 /1507||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcript|XM_017010693.2|protein_coding|24/31|c.42 96C>T|p.Cys1432Cys|4457/5304|4296/5073|1432/1690||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcri pt|XM_017010696.2|protein_coding|25/31|c.2853C>T|p.Cys951Cys|4074/5145|2853/3792|951/1263||,A|sy nonymous_variant|LOW|SHPRH|SHPRH|transcript|XM_024446394.1|protein_coding|25/31|c.2853C>T|p.Cys9 51Cys|4374/5445|2853/3792|951/1263||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcript|XM_01701069 2.1|protein_coding|24/30|c.4296C>T|p.Cys1432Cys|4695/5765|4296/5235|1432/1744||,A|synonymous_var iant|LOW|SHPRH|SHPRH|transcript|XM_024446393.1|protein_coding|25/31|c.3354C>T|p.Cys1118Cys|4590/ 5660|3354/4293|1118/1430||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcript|XM_011535719.3|protei n_coding|24/30|c.4296C>T|p.Cys1432Cys|4457/7072|4296/5034|1432/1677||,A|synonymous_variant|LOW|S HPRH|SHPRH|transcript|NM_001042683.3|protein_coding|24/30|c.4296C>T|p.Cys1432Cys|4956/7596|4296/ 5052|1432/1683||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcript|NM_001370327.1|protein_coding|2 4/30|c.4296C>T|p.Cys1432Cys|4530/7170|4296/5052|1432/1683||,A|synonymous_variant|LOW|SHPRH|SHPRH |transcript|NM_001370328.1|protein_coding|26/32|c.2853C>T|p.Cys951Cys|4114/6754|2853/3609|951/12 02||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcript|NM_173082.4|protein_coding|24/30|c.4308C>T| p.Cys1436Cys|4968/7261|4308/4980|1436/1659||,A|downstream_gene_variant|MODIFIER|SHPRH|SHPRH|tran script|XR_002956273.1|pseudogene||n.*4666C>T|||||4666|,A|non_coding_transcript_exon_variant|MODI FIER|SHPRH|SHPRH|transcript|XR_942391.3|pseudogene|24/29|n.4457C>T||||||,A|non_coding_transcript _exon_variant|MODIFIER|SHPRH|SHPRH|transcript|XR_942393.3|pseudogene|24/29|n.4457C>T||||||,A|non _coding_transcript_exon_variant|MODIFIER|SHPRH|SHPRH|transcript|XR_942392.3|pseudogene|24/29|n.4 457C>T||||||,A|non_coding_transcript_exon_variant|MODIFIER|SHPRH|SHPRH|transcript|XR_942390.3|ps eudogene|24/29|n.4457C>T|||||| GT:GQ:DP:AD:AF:PS 1/1:25:89:0,86:0.9663:.

Thanks!

Phased variants is read from vcf file by finding 
"1|0" or "0|1" substring in each vcf records. 
This should be done only to the last column. 
This fix still assumes input is a single sample vcf.
@Fu-Yilei
Copy link
Collaborator

Thank you for catching this bug. If you have tested the modified code I can merge the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants