Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed VCF files #43

Open
lsantuari opened this issue Nov 25, 2020 · 3 comments
Open

Malformed VCF files #43

lsantuari opened this issue Nov 25, 2020 · 3 comments
Assignees
Labels
data Problems with test data

Comments

@lsantuari
Copy link
Collaborator

SURVIVOR simSV generates a malformed VCF file, in particular for translocations (TRA).
We need to include first a VCF->BEDPE conversion with this script and a BEDPE->VCF conversion with this script.

@lsantuari lsantuari added bug Something isn't working data Problems with test data and removed bug Something isn't working labels Nov 25, 2020
@arnikz
Copy link
Contributor

arnikz commented Nov 25, 2020

Please provide a concrete example(s). I would opt for a single script (e.g., in R) that performs both steps (VCF->BEDPE-> VCF). We could add VariantAnnotation:writeVcf() to the former script.

@lsantuari
Copy link
Collaborator Author

lsantuari commented Nov 26, 2020

We follow the VCF specification version v4.3.

The example here is the htz-sv.vcf file generated with sv-gen and used as test data in sv-channels.
The reference sequence contains 2Mb of chromosome 12 ( [44000000-46000000] and 2Mb of chromosome 22 ([44000000-46000000]).

VCF header

htz-sv.vcf

##fileformat=VCFv4.2
##source=Sniffles
##fileDate=20200325
##contig=<ID=12,length=2274011>
##contig=<ID=22,length=2263095>
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=INVDUP,Description="InvertedDUP with unknown boundaries">
##ALT=<ID=TRA,Description="Translocation">
##ALT=<ID=INS,Description="Insertion">
##INFO=<ID=CHR2,Number=1,Type=String,Description="Chromosome for END coordinate in case of a translocation">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Length of the SV">
##INFO=<ID=SVMETHOD,Number=1,Type=String,Description="Type of approach used to detect SV">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=AF,Number=.,Type=Integer,Description="Allele Frequency.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT

It should be:

##fileformat=VCFv4.3
##source=SURVIVOR
##fileDate=20201126
##FILTER=<ID=PASS,Description="All filters passed">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=INS,Description="Novel sequence insertion">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=DUP:TANDEM,Description="Tandem duplication">
##ALT=<ID=BND,Description="Inter-chromosomal translocation">
##INFO=<ID=PRECISE,Number=0,Type=Flag,Description="Precise structural variation">
##INFO=<ID=MATEID,Number=1,Type=String,Description="ID of partner breakend">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVMETHOD,Number=1,Type=String,Description="Type of approach used to detect SV">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=12,length=2000000>
##contig=<ID=22,length=2000000>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HTZ
  • POS is the position before the SV event
  • REF is the reference base at position POS

SV calls

DEL

htz-sv

12	1828176	DEL1003SURVIVOR	N	<DEL>	.	LowQual	PRECISE;SVTYPE=DEL;SVMETHOD=SURVIVOR_sim;CHR2=12;END=1828344;SVLEN=168	GT:GL:GQ:FT:RC:DR:DV:RR:RV	1/1

It should be:

12	1828176	HTZ_1004	C	<DEL>	.	PASS	PRECISE;SVTYPE=DEL;SVMETHOD=SURVIVOR_simSV;END=1828344;SVLEN=168	GT	0/1

INS

htz-sv

22	840803	INS1000SURVIVOR	N	<INS>	.	LowQual	PRECISE;SVTYPE=INS;SVMETHOD=SURVIVOR_sim;CHR2=22;END=840965;SVLEN=162	GT:GL:GQ:FT:RC:DR:DV:RR:RV	1/1

It should be:

22	840803	HTZ_1001	T	<INS>	.	PASS	PRECISE;SVTYPE=INS;SVMETHOD=SURVIVOR_simSV;END=840803;SVLEN=162	GT	0/1

to complete:

INV

htz-sv

22	749910	INV3000SURVIVOR	N	<INV>	.	LowQual	PRECISE;SVTYPE=INV;SVMETHOD=SURVIVOR_sim;CHR2=22;END=750143;SVLEN=233	GT:GL:GQ:FT:RC:DR:DV:RR:RV	1/1

It should be:

22	749910	HTZ_5001	T	<INV>	.	PASS	PRECISE;SVTYPE=INV;END=750143;SVLEN=234	GT	0/1

DUP:TANDEM

htz-sv

12	151198	DUP0SURVIVOR	N	<DUP>	.	LowQual	PRECISE;SVTYPE=DUP;SVMETHOD=SURVIVOR_sim;CHR2=12;END=151405;SVLEN=207	GT:GL:GQ:FT:RC:DR:DV:RR:RV	1/1

It should be:

12	151197	HTZ_1	A	<DUP:TANDEM>	.	PASS	PRECISE;SVTYPE=DUP;END=151404;SVLEN=207	GT	0/1

BND (breakend notation for TRA)

htz-sv

22	455127	TRA4000SURVIVOR	N	<TRA>	.	LowQual	PRECISE;SVTYPE=TRA;SVMETHOD=SURVIVOR_sim;CHR2=12;END=90798;SVLEN=-364329	GT:GL:GQ:FT:RC:DR:DV:RR:RV	1/1
22	455473	TRA4000SURVIVOR	N	<TRA>	.	LowQual	PRECISE;SVTYPE=TRA;SVMETHOD=SURVIVOR_sim;CHR2=12;END=91144;SVLEN=-364329	GT:GL:GQ:FT:RC:DR:DV:RR:RV	1/1

It should be:

12	90797	HTZ_3975_1	C	C[22:455126[	1	PASS	SVTYPE=BND;MATEID=HTZ_3975_4	GT	0/1
12	90797	HTZ_3975_2	G	]22:455126]G	1	PASS	SVTYPE=BND;MATEID=HTZ_3975_3	GT	0/1
22	455126	HTZ_3975_3	A	A[12:90797[	1	PASS	SVTYPE=BND;MATEID=HTZ_3975_2	GT	0/1
22	455126	HTZ_3975_4	T	]12:90797]C	1	PASS	SVTYPE=BND;MATEID=HTZ_3975_1	GT	0/1
12	91144	HTZ_3976_1	C	C[22:455473[	1	PASS	SVTYPE=BND;MATEID=HTZ_3976_4	GT	0/1
12	91144	HTZ_3976_2	G	]22:455473]G	1	PASS	SVTYPE=BND;MATEID=HTZ_3976_3	GT	0/1
22	455473	HTZ_3976_3	T	T[12:91144[	1	PASS	SVTYPE=BND;MATEID=HTZ_3976_2	GT	0/1
22	455473	HTZ_3976_4	A	]12:91144]C	1	PASS	SVTYPE=BND;MATEID=HTZ_3976_1	GT	0/1

The positions should be ordered first by chromosome (as listed in the ##contig list in the header) and then by position.
The GT field is:

  • 0/1 for unphased heterozygous (HTZ-SV)
  • 1/1 for unphased homozygous (HMZ-SV)

We could opt to encode all SVs in the BND notation as it is done in GRIDSS.

@arnikz
Copy link
Contributor

arnikz commented Nov 26, 2020

We need to include first a VCF->BEDPE conversion with this script and a BEDPE->VCF conversion with this script.
We follow the VCF specification version v4.3.

The latter script refers to VCF v4.2 (but v4.3 in the original script).

We could opt to encode all SVs in the BND notation as it is done in GRIDSS

Good point. Could you add a reference to the code snippet that does this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Problems with test data
Projects
None yet
Development

No branches or pull requests

2 participants