-
Notifications
You must be signed in to change notification settings - Fork 1
Truth VCF file
A "truth" VCF file is required, which has the expected variants compared to the reference genome. This page describes the required format of the variants in the file. It assumes you are already familiar with VCF files.
Basic requirements:
- The VCF file must match the reference FASTA file, in particular the name in the
CHROM
column must be the same as in the FASTA file. By default (if you do not provide your own FASTA file), SARS-CoV-2 reference MN908947.3 is used, which means putting "MN908947.3" in theCHROM
column of the truth VCF file. - It must be a "single sample" VCF file - ie there is only one sample in the file, so all (non-header) lines have 10 columns.
- The
GT
field must be present in theFORMAT
column (and the value in the sample column) in every line. - The information in the
ID
andQUAL
columns is not used (put what you like in there!) - No two (or more) records can have their
REF
alleles overlap, since this is ambiguous and it would be impossible to decide what the truth is in this case.
The type of "truth" variants are listed below, with example VCF file lines.
Notes:
- It is assumed that any position that is not in the truth VCF file is the same as the reference genome. The VCF file should have all of the expected variants.
- The
FILTER
column is used to mark dropped amplicons or unsure regions (see examples below). Any other filters are ignored (in particular, you do not needPASS
in theFILTER
column).
A "normal" SNP, saying that the truth is a T
at position 100 instead of a C
:
MN908947.3 100 . C T . . . GT 1/1
In most cases, the GT
field must be used to indicate the truth allele.
If the GT
is 0/0
then the line of the VCF file will be ignored, eg:
MN908947.3 100 . C T . . . GT 0/0
SNPs are allowed to be heterozygous, inferred from the GT
value. Examples:
MN908947.3 100 . C T . . . GT 0/1
MN908947.3 200 . T C,G . . . GT 1/2
A "normal" insertion or deletion:
MN908947.3 100 . CG C . . . GT 1/1
MN908947.3 200 . T TAA . . . GT 1/1
Indels are not allowed to be heterozygous. If you try then an error will be thrown. Examples:
MN908947.3 100 . CG C . . . GT 0/1
MN908947.3 200 . T TAA . . . GT 0/1
To mark a dropped amplicon, use a record with POS
at the starting
position of the dropped amplicon. Put DROPPED_AMP
in the FILTER
column,
and put AMP_START
, AMP_END
key/values into the INFO
column.
These are the start and end coordinates of the dropped region -
they must be zero-based and AMP_END
should be the end position (NOT one
past the end!). Example, to drop an amplicon at position 320-725:
MN908947.3 321 . C N . DROPPED_AMP AMP_START=320;AMP_END=725 GT 1/1
The REF
or ALT
strings do not matter - the AMP_START
and AMP_END
entries dictate the region defined as dropped.
To mark a region as "unsure", put UNSURE
in the FILTER
column:
MN908947.3 100 . CGG C . UNSURE . GT 1/1
When UNSURE
is in the filter column, the GT
and ALT
columns are ignored.
All of the positions in the REF
allele (in this case the CGG
at 100-102)
will be flagged as unsure (they get replaced with N
in the output files).