Skip to content

Number of variants in VCF and HTML summary do not match

Pablo Cingolani edited this page Aug 11, 2017 · 5 revisions

First of all, SnpEff probably giving you the right numbers, the mismatch might not be a bug, but a simple interpretation issue.

How to count variants / annotations properly

It is important to remember that the VCF format specification allows having multiple variants in a single line. Also, a single variant can have more than one annotation, due to:

  • Multiple transcripts (isoforms) of a gene.
  • Multiple (overlapping) genes in the genomic location of the variant.
  • the variant spanning multiple genes (e.g. a translocation, large deletion, etc.)

When you count the number of variants, you must keep all these in mind to count them properly. Obviously, SnpEff does take all this into account when counting the variants for the summary HTML.

Typical counting mistake

Many people who claim that there is a mismatch between the number of variants in the summary (HTML) file and the number of variants in the VCF file, are just making mistakes when counting the variants because they forget one or more of these previous items.

A typical scenario is, for example, that people are "counting missense variants" using something like this:

grep missense file.vcf | wc -l

This is counting "lines in a VCF file that have at least one missense variants", as opposed to counting "missense annotations" and, as mentioned previously, the number of lines in a VCF file is not the same as the number of annotations or the number of variants.