Code used to analyze whole-genome sequencing data of giraffe in Coimbra et al. (2023):
- Coimbra RTF, Winter S, Muneza A, Fennessy S, Otiende M, Mijele D, Masiaine S, Stacy-Dawes J, Fennessy J, Janke A (2023) Genomic analysis reveals limited hybridization among three giraffe species in Kenya. BMC Biology, 21:215. DOI: https://doi.org/10.1186/s12915-023-01722-y
Note: The plotting scripts do not necessarily reproduce the figures exactly as shown in the paper. In some cases, I used a free image editing software, namely Krita, to assemble independent plots and add or correct plot annotations.
workflow.txt
: describes the steps and the context in which the scripts described below were used for processing and analyzing the whole-genome sequencing data of giraffe.
trim_reads_v2.sh
andtrim_reads_v2-sra.sh
: trim paired-end reads with fastp.
map_reads_v2.sh
: map reads against the Masai giraffe assembly with BWA and sort the output BAMs with samtools.mark_duplicates.sh
: mark PCR/optical duplicate reads with Picard MarkDuplicates.mapping_flagstats.sh
: calculate mapping statistics with samtools.realigner_target_creator.sh
: create a list of target intervals for indel realignment with GATK.indel_realigner.sh
: perform local realignment around indels with GATK.clean_bams_v3.sh
: remove bad reads (flags 4, 256, 512, or 1024) from BAM files and keep only properly paired reads (flag 2) mapped to non-repetitive regions in autosomes with samtools.
site_depth_sample.sh
: calculate the mean site depth per sample with mosdepth.site_depth_global.sh
: calculate the global site depth for multiple giraffe BAMs with sambamba.site_depth_stats_v2.py
: calculate summary statistics from the global site depth distribution (i.e. 5th and 95th percentiles, median, and median absolute deviation) with Python (NumPy and SciPy).
snp_calling_v2.sh
: estimate genotype likelihoods with ANGSD.genotype_calling.sh
: call genotypes with bcftools.
calc_pairwise_ld.sh
: calculate pairwise linkage disequilibrium with ngsLD.ld_decay_multi.sh
: fit LD decay curves for multiple species with fit_LDdecay.R.ld_pruning.sh
: prune linked sites with prune_graph.pl and extract unlinked sites from ANGSD's genotype likelihoods file.process_pruned_snps.sh
: combine pruned SNPs of all species, split them into separate files by autosome, and index resulting files for ANGSD.
infer_relatedness.sh
: infer pairwise relatedness among individuals based with NGSadmix and NGSremix.plot_figureS1.R
: plot the relatedness coefficient as a heatmap and find the number of individuals to exclude (and identify them) for increasing thresholds of relatedness. Note: this script was co-developed with Emma Vinson.
pcangsd_hwe.sh
: estimate covariance matrix and perform a Hardy-Weinberg equilibrium test with PCAngsd.ngsadmix.sh
: estimate admixture proportions with NGSadmix.evaladmix.sh
: calculate a matrix of pairwise correlation of residuals between individuals using evalAdmix to test the goodness of fit of the data to an admixture model.
generate_snp_phylo.sh
: infer a phylogeny based on a random subset of SNPs from a BCF file using bcftools, vcflib, vcf2phylip, and IQ-TREE.
- See
workflow.txt
.
infer_admixture_graphs.sh
: converts a VCF file into TreeMix input with PLINK andplink2treemix
, estimates admixture graphs with either TreeMix or OrientAGraph allowing for a range of migration edges (m) with multiple bootstrap replicates, and finds the replicate with the highest likelihood per m value.estimate_fbranch.sh
: calculate Patterson’s D, f4-ratio, and f-branch statistics with Dsuite.
run_ba3-snps.sh
: estimate contemporary migration rates based on a random subset of SNPs from a BCF file using bcftools, vcflib, stacks, stacksStr2immanc.pl, BA3-SNPS-autotune and BA3-SNPS.
create_fasta.sh
: create genome consensus sequence in FASTA format with sambamba and ANGSD.estimate_sfs.sh
: estimate the unfolded site frequency spectrum (SFS) with ANGSD and realSFS.
- Figure 1: QGIS +
plot_figure1b-e.R
+visFuns_modified.R
(script slightly modified fromvisFuns.R
from evalAdmix) - Figure 2:
plot_figure2.R
- Figure 3:
plot_figure3a.R
+estimate_fbranch.sh
+plot_figure3c.R
- Figure 4:
plot_figure4.R
- Figure S1:
plot_figureS1.R
- Figure S2:
plot_figureS2.R
- Figure S3:
plot_figureS3.R
- Figure S4:
plot_figureS4.R
+visFuns_modified.R
(script slightly modified fromvisFuns.R
from evalAdmix) - Figure S5:
plot_figureS5.R
- Figure S6:
plot_figureS6.R
- Figure S7:
ld_decay_multi.sh
- Figure S8:
plot_figureS8.R