Collection of evaluation workflows for the rad-seq-stacks snakemake workflow based on the Stacks software. Notice that this evaluation was performed for my PhD thesis. For the most recent version of this analysis workflow, please refer to the rad-seq-stacks workflow that is now part of the Snakemake Workflows project.
The pipelines require a snakemake version of 5.10.0 or above and an installation of the conda package manager to run (we recommend using the miniconda Python 3.7 installer).
Lower versions result in a crash due to a WorkflowError
.
Assuming conda is already installed (for installation instructions please refer to the (mini)conda website), the required version of snakemake can be installed into a conda environment as follows:
$ conda create -n snakemake-env "snakemake>=5.10.0" -c bioconda -c conda-forge
After installation, activate the environment using:
$ conda activate snakemake-env
(snakemake-env)$ snakemake --version
5.10.0
Each folder contains a snakemake workflow that calls two instances of the rad-seq-stacks pipeline (each itself a snakemake workflow) on a simulated dataset. To perform an evaluation, navigate into the corresponding folder and call
(snakemake-env)$ snakemake --use-conda --jobs 6
to run the pipeline.
All subworkflows are restricted to 3 cores using a parameter in the config.yaml
file in the respective folder.
To increase the number of cores for each subworkflow, change the value
cores_per_subworkflow: 3
and call the workflow with twice this amount, so that each subworkflow can use half of the total assigned cores.
Read data is simulated using ddRAGE. To change the number of simulated loci (or other parameters of the simulation), change the respective values in the config file:
ddrage:
loci:
10000
individuals:
# max number possible with standard BC set
24
Note, that smaller instances (less loci, less individuals or less coverage) will execute faster, however some effects observed in the evaluation might not be visible with these parameters.
This repository contains five evaluations. The first four use data simulated with our ddRADseq read simulator ddRAGE and can be executed as is. For the fifth evaluation workflow we used an unpublished in-house dataset. Hence, this workflow cannot be executed without access to the data set. Once the data is published, we will add the link.
Illustrates the influence of PCR deduplication on the workflow.
Simulates a low coverage dataset to analyze the impact of the minimum reads per locus parameter.
Simulates a high diversity dataset to analyze the impact of locus distance parameters.
Simulates a data set with an increased number of mutations (all SNPs) to analyze the performance of different parameter sets for SNP detection.
Analysis workflow for a dataset with 315 individuals of Gammarus fossarum with a total size of 103GB of gzipped FASTQ reads.
This dataset is not included, since it is not yet published by its owners.
Once it is published, we will provide a link to it here.
A list of individual names (id), barcoding information (p5 and p7 barcode), spacer lengths, and file names are contained in the respective individuals.tsv and units.tsv files. Used RADseq enzymes are documented in the config.yaml file. The used DBR sequence is NNNNNNMMGGACG
.