Pypgen provides various utilities for estimating standard genetic diversity measures including Gst, G'st, G''st, and Jost's D from large genomic datasets (Hedrick, 2005; Jost, 2008; Masatoshi Nei, 1973; Nei & Chesser, 1983). Pypgen operates both on individual SNPs as well as on user defined regions (e.g., five kilobase windows tiled across each chromosome). For the windowed analyses pypgen estimates the multi-locus versions of each estimator.
- Handles multiallelic SNP calls
- Allows a single VCF file to contain multiple populations
- Operates on standard VCF (Variant Call Format) formatted SNP calls
- Uses bgziped input for fast random access
- Takes advantage of multiple processor cores
- Calculates additional metrics:
- snp count per window
- mean read depth (+/- STDEV) per window
- populations with fixed alleles per SNP
- more as I think of them
PYPGEN IS STILL IN ACTIVE DEVELOPMENT AND ALMOST CERTAINLY CONTAINS BUGS. If you find a bug please file a report in the issues section of the github repository and I'll address it as soon as I can.
- Sliding window analysis (vcf_sliding_window.py)
- Per SNP analysis (vcf_snpwise_fstats.py)
- OSX or Linux
- Python 2.7
- pysam and samtools
Detailed documentation is available on ReadTheDocs. It includes a tutorial and installation instructions.