Online resource "The first steps toward a global pandemic: Reconstructing the demographic history of parasite host switches in its native range"
This repository aims at providing an online resource for the preprint
Using whole-genome sequencing, (i) we compared sympatric populations genetic diversity, and (ii) estimated the demographic parameters of independent host switches in the Apis honey bee ectoparasites mites: Varroa destructor and V. jacobsoni. The following sections described how the bioinformatics analysis were processed and how to replicate them with available genome data reads, Snakemake pipeline and other file formats or codes necessary.
- Varroa mite sequencing reads are available under the DDBJ DNA Data Bank of Japan BioProject PRJDB9195
- GCF_002443255.1 Vdes_3.0 Reference genome can be directly downloaded from NCBI developed by Techer et al. 2020. Comms. Biology
- Honey bees and Varroa mites fasta sequences and Bowtie2 index build are available in the
ref2020
folder on DRYAD DOI TO ADD
- Genomics analysis are summarized into a Snakemake pipeline is available in the file
Snakefile
, along with parameters filecluster.json
and launchersnakemake.slurm
. - All scripts called in
Snakefile
are present in thescripts
folder. - Interactive visualization of sampling distribution and details is available here.
- R markdown including the code lines used to generate the interactive maps, PCAs analysis and bootstraping for the demographic inferences can be found in
R_data
.
- Variant calling files (Raw, filtered and LD_pruned) as well as some lists (samples ID, SNP list, ) DRYAD DOI to add
- Fastsimcoal2 input files for demographic scenarios and SFS subsets are available in
demography
. - FASTA files generated for mitochondrial analysis (COX1 and COX1-COX3-ATP6-CYTB) and compared to Varroa reference sequences are available in
alignment_mtDNA
Samtools
: version samtools/1.3.1 used here.
Mashtree
Bowtie2
: version bowtie2/2.2 used here.
NextGenMap
: version NextGenMap/0.5.0 used here.
VariantBam
: version VariantBam/1.4.3 used here.
Freebayes
: version freebayes/1.1.0 used here.
VCFtools
: version vcftools/0.1.12b used here.
vcflib
: version vcflib/1.0.0-rc1 used here.
PLINK
: version plink/1.90b3.36 used here.
R
: version R/3.4.2 used here.
NGSadmix
: version NGSadmix/32 used here.
The first 72 lines of the Snakemake
file are setting in which environment we ran our analysis (i.e., which folders the raw reads data can be found, and the paths of future generated resulting analysis data). We computed most analysis on Sango, OIST Super Computer Center.
You can change the targets of rule all
depending on which particular steps of the pipeline you want to proceed following Snakemake's documentation.
For example, if you want to generate the mapped read .bam files from the original .fastq files downloaded on NCBI:
- Place the fastq files in a
reads
folder - Samples name will be recognized according to line 45 for which fastq R1 files end with
{sample}-R1_001.fastq.gz
- Apply rule all as
expand(outDir + "/alignments/ngm/{sample}.bam", sample = SAMPLES)
Or, if you want to generate the variant call file from NextGenMap reads apply:
outDir + "/var/ngm/filtered.vcf"
for which outDir
is the path of your working folder.
All files are available in the DRYAD repository.
We designed six evolutionary scenario using fastsimcoal2
to choose the most likely and estimate demographic parameters such as the size of founding population at time of independent host switch events by V. destructor and V. jacobsoni from A. cerana to A. mellifera.
Additionally to the complete manual available for fastsimcoal2
written by Laurent Excoffier, we recommend the helpful active google group forum.
An excellent tutorial on fastsimcoal2 usage on genomics data can be found here, hosted on the speciationgenomics Github page by Mark Ravinet & Joana Meier.
easySFS was used to generate SFS input files from our vcf files (developed by Isaac Overcast).
SFS-scripts were used to plot observed and expected 2D-SFS under each scenario model (developed by David Marques).
Here, we described the command lines used for one demographic model (templates_vjac_MUTINB/14_mig_botgwt/mut_8e-10
) with the SFS generated from all sites subset for V. destructor (VDES32by22_folded_50kbvcf2sfs.txt
). The mig_botgwt_TEMPLATE_jointMAFpop1_0.obs
file should be place in the same folder as the same named .tpl and .est files.
Briefly the mig_botgwt_TEMPLATE.tpl
file draw the scenario with :
- A two populations model between V. destructor mites on original host A. cerana with a current effective population size
NVAC1
andNVAM0
for the novel host A. mellifera. - Observed SFS Data were projected using easySFS with 17 haploid genomes for both A. mellifera (population 0) and A. cerana (population 1) mites.
- A growth rate GAM since the host switch event only for the novel host (expansion biologically known).
- Two migration matrices were given for the population split and after.
- We considered a single historical event
TJUMP
ending atTBOTEND
from which a number of haploid mitesNBOTAM
splited from A. cerana population to found the new A. mellifera population. - The mutation rate was proposed following preliminary de novo mutations estimations.
In the mig_botgwt_TEMPLATE.est
file, most parameters were sampled in a uniform distribution (NB: upper limit does not constitute a maximum bound for fsc26, see manual).
We copied-named mig_botgwt_TEMPLATE_XXX.est
, mig_botgwt_TEMPLATE_XXX.tpl
, mig_botgwt_TEMPLATE_XXX_jointMAFpop1_0.obs
100 times for which XXX is in {1..100}. Using an array bash script we then ran the following command for each replicate.
#!/bin/bash
#SBATCH --job-name=vd-migbotgwt
#SBATCH --partition=XXX
#SBATCH --mem=5G
#SBATCH -c 10
#SBATCH --time=3:00:00
number=$SLURM_ARRAY_TASK_ID
[PATH_TO_FASTSIMCOAL2]/fsc26 --tplfile mig_botgwt_TEMPLATE_"$number".tpl --estfile mig_botgwt_TEMPLATE_"$number".est -m --numsims 1000000 --maxlhood 0.001 --minnumloops 20 --numloops 100 -c 10
Then to extract the best results from each run, we simply apply the following command lines:
cat mig_botgwt_TEMPLATE_1/mig_botgwt_TEMPLATE_1.bestlhoods >> scenario_migbotgwt.txt
for i in {2..100}; do sed -n 2p mig_botgwt_TEMPLATE_"$i"/mig_botgwt_TEMPLATE_"$i".bestlhoods >> scenario_migbotgwt.txt; done
for i in {1..100}; do cat mig_botgwt_TEMPLATE_"$i"/mig_botgwt_TEMPLATE_"$i".bestlhoods >> scenario_migbotgwt.txt; done
cat scenario_migbotgwt.txt | wc -l
# to check that we have the results for the 100 replicates
cat scenario_migbotgwt.txt | sort -k 10nr
# the first line is then the scenatio with the lowest MaxEstLhood
For the best scenario, bootstraps from the best replicate were performed following the tutorial in the manual (page 56-57). We simulated 100 SFS datasets from the best output .par
file after modifying it to generate a DNA sequence data, using:
[PATH_TO_FASTSIMCOAL2]/NUMBER=YYY
# where NUMBER is the replicate number with the lowest MaxEstLhood
fsc26 -i mig_botgwt_TEMPLATE_${NUMBER}_boot.par -n100 -j -m -s0 -x -I -q
We then repeat the parameters estimation 100 times for each of the 100 simulated SFS mig_botgwt_TEMPLATE_${NUMBER}_boot_jointMAFpop1_0.obs
using the same initial command lines.
The best replicate for each simulated SFS run is then used to obtain the confidence interval of NVAM0
, NVAC1
, TJUMP
, NBOTAM
, GAM
and migration rates.
Questions about the data or scripts? please contact either:
Maeva Techer, JSPS Postdoctoral Fellow, Ecology and Evolution lab @OIST
OIST Email: [email protected]
Gmail: [email protected]
Alexander (Sasha) Mikheyev, Adjunct Professor Ecology and Evolution @OIST, Associate Professor Evolutionary Genomics@ANU
OIST Email: [email protected]
ANU Email: [email protected]