Clone the repo:
git clone [email protected]:sschmeier/vcfcompile.git
- Python 3
- Otherwise nothing special. Uses only standard libs for now.
Simple script to read a bunch of vcf-files with SNPs and find the list of unique SNPs. For each variant, extract for each file the variant's QD or QUAL value and put in table form. Gives an overview of overlapping variants and quality values in different samples.
Prints to standard out. Some stats go to standard error.
Untested on very large vcf-files. In the future need to implement cyvcf for speed. Right now its used for filtered SNPs.
python vcfcompile.py --snpeff data/*.vcf(.gz) > table.txt
CHROM | POS | ID | REF | ALT | GENES | FILE1.vcf.gz | FILE2.vcf.gz | ... |
---|---|---|---|---|---|---|---|---|
chr17 | 16382069 | rs1060079 | T | C | UBB:HIGH;UBB:LOW | 2.99 | 3.64 | ... |
... |
For a vcf-file that was compiled with gatk3 CombineVariants
.
The idea here is that the same sample was processed by different callers and the vcf is the combined file, we can use this script to investigate numbers number of variants called by any combination of callers.
python vcfSetStats.py file.vcf.gz > table.tsv
A table with caller combination, number of callers, number of variants called, pct of variants called.
- Make use of cyvcf (https://github.com/arq5x/cyvcf) for speed.
MIT, 2018-2019, copyright Sebastian Schmeier [email protected] // https://www.sschmeier.com