Skip to content

Latest commit

 

History

History
19 lines (10 loc) · 1.34 KB

README.md

File metadata and controls

19 lines (10 loc) · 1.34 KB

vcf-isec

A simple python implementation of Variant Call Format intersection and complements.

Background

Bioinformaticians store variants identified by next generation sequencing in a VCF file. The VCF specification was originally maintained by the 1000 Genomes Project, and the torch has since been passed to the Global Alliance for Genomics and Health Data Working group file format team.

Specifications for VCF v4.1 can be found here.

Essentially, a variant is represented as a separate line in the VCF, where the chromosome, position, reference base(s), and alternate base(s) identified at that position are found in columns 1, 2, 4, and 5, resp. Additional information pertaining to the variant is listed in the remaining fields of the VCF.

Task

A common task for bioinformaticians is to compare variants, whether to compare VCF files generated by different analytical pipelines or to simply compare variants between related individuals.

This script takes as input two VCFs and performs a comparison of the variants found in each file. The script outputs 3 VCFs, reflecting those variants that are shared and unique to each individual.

NOTE: An example VCF is provided at tests/resources/sample.vcf. VCFs can grow up to 4 million variants in size, as in the case of whole genome sequencing.