Skip to content

Truth VCF file

martinghunt edited this page Mar 3, 2022 · 3 revisions

Truth VCF file

A "truth" VCF file is required, which has the expected variants compared to the reference genome. This page describes the required format of the variants in the file. It assumes you are already familiar with VCF files.

Basic requirements:

  • The VCF file must match the reference FASTA file, in particular the name in the CHROM column must be the same as in the FASTA file. By default (if you do not provide your own FASTA file), SARS-CoV-2 reference MN908947.3 is used, which means putting "MN908947.3" in the CHROM column of the truth VCF file.
  • It must be a "single sample" VCF file - ie there is only one sample in the file, so all (non-header) lines have 10 columns.
  • The GT field must be present in the FORMAT column (and the value in the sample column) in every line.
  • The information in the ID and QUAL columns is not used (put what you like in there!)
  • No two (or more) records can have their REF alleles overlap, since this is ambiguous and it would be impossible to decide what the truth is in this case.

The type of "truth" variants are listed below, with example VCF file lines.

Variants in the VCF file

Notes:

  • It is assumed that any position that is not in the truth VCF file is the same as the reference genome. The VCF file should have all of the expected variants.
  • The FILTER column is used to mark dropped amplicons or unsure regions (see examples below). Any other filters are ignored (in particular, you do not need PASS in the FILTER column).

SNPs

A "normal" SNP, saying that the truth is a T at position 100 instead of a C:

MN908947.3  100  .  C  T  .  .  .  GT  1/1

In most cases, the GT field must be used to indicate the truth allele. If the GT is 0/0 then the line of the VCF file will be ignored, eg:

MN908947.3  100  .  C  T  .  .  .  GT  0/0

SNPs are allowed to be heterozygous, inferred from the GT value. Examples:

MN908947.3  100  .  C  T    .  .  .  GT  0/1
MN908947.3  200  .  T  C,G  .  .  .  GT  1/2

Indels

A "normal" insertion or deletion:

MN908947.3  100  .  CG C   .  .  .  GT  1/1
MN908947.3  200  .  T  TAA .  .  .  GT  1/1

Indels are not allowed to be heterozygous. If you try then an error will be thrown. Examples:

MN908947.3  100  .  CG C    .  .  .  GT  0/1
MN908947.3  200  .  T  TAA  .  .  .  GT  0/1

Dropped amplicon

To mark a dropped amplicon, use a record with POS at the starting position of the dropped amplicon. Put DROPPED_AMP in the FILTER column, and put AMP_START, AMP_END key/values into the INFO column. These are the start and end coordinates of the dropped region - they must be zero-based and AMP_END should be the end position (NOT one past the end!). Example, to drop an amplicon at position 320-725:

MN908947.3  321  .  C  N  .  DROPPED_AMP  AMP_START=320;AMP_END=725  GT  1/1

The REF or ALT strings do not matter - the AMP_START and AMP_END entries dictate the region defined as dropped.

Unsure

To mark a region as "unsure", put UNSURE in the FILTER column:

MN908947.3  100  .  CGG  C  .  UNSURE  .  GT  1/1

When UNSURE is in the filter column, the GT and ALT columns are ignored. All of the positions in the REF allele (in this case the CGG at 100-102) will be flagged as unsure (they get replaced with N in the output files).

Clone this wiki locally