Whole-genome re-sequencing to examine genetic changes in a population of Ithaca, NY honeybees using samples collected in 1977 and 2011
This repo contains the part of the analysis that was performed on the cluster. Downstream analysis done in R and plotting in python have not yet been added.
Genomes were sequenced on an Illumina HiSeq, using genomic libraries prepared without PCR. In addition to the Ithaca samples, there were some bees included from populations in Arizona, Chiapas (Africanized) and from Hawaii, Korea and Japan (non-Africanized).
Some of the steps are parallelized on an SGE cluster.
The first step was to align the reads to the reference using bowtie2, and then to re-calibrate alignments around indels using GATK.
- create a file of limits for GATK, corresponding to the 16 major chromosomes
- this was piped to data/scaffolds_long.txt
- perform base quality recalibration using known SNP sites from NCBI and validated sites kindly provided by Greg Hunt
- starting with mapped fragments, call genotypes for all samples
- perform variant quality score recalibration to filter low-quality SNPs
SNP frequency measurement using ANGSD
- compute minor allele frequencies for old and modern populations, and conduct likelihood ratio tests for significant changes
- intersect minor allele frequency files for old and modern populations
Imputation and association testing using BEAGLE
- convert GATK vcf to BEAGLE format
- phase genotypes and impute missing values
- association testing on imputed haplotypes, looking for evidence of selection between old and modern populations
- This is a parallel analysis to likelihoood ratio testing with ANGDS
- extract haplotypes from BEAGLE results
- calculate Fst between populations with European and African ancestry using vcftools
- note: output files manually moved into the data directory
- trying to compute Fst using ngsutils.
- this approach has not worked, given the different number of snp calls between samples.
- I have given up on this for now, focusing instead on the vcftools analysis
- generate BEAGLE-formatted data from ngs count data
- use NgsAdmix to infer ancestral population clusters
- compute covariance matrix using posterior probabilities of genotypes computed by angsd.sh
- intersect beagle and angds results
- iEHH (using rehh package in R)
- visualize data
- look at genes in beagle haplotype blocks