-
Notifications
You must be signed in to change notification settings - Fork 7
2. Requirements & pipeline configuration
This pipeline analyzes whole genome re-sequencing data and requires a Linux system with at least 16 cores (min. 6 GB RAM each; in total 96 GB RAM) to analyze data for a species with a genome size of around 3 Gbp. GenErode has been developed to run on high-performance compute (HPC) clusters.
-
Apptainer (former singularity) and conda have to be installed on the system to run the pipeline.
-
Make sure there is storage space available as the pipeline produces a lot of large files:
- In a test run with four Sumatran rhinoceros samples, the final size of the results directory after running the optional steps BAM file subsampling, CpG filtering and all downstream analyses except the GERP step was about 750 GB, even though many intermediate files had already been automatically deleted by the pipeline during the run. However, the space required reached > 1 TB at some points during this pipeline run, so storage space flexibility is required.
- When running the GERP step, several TB of storage space will be required for intermediate files. In the first part of the GERP step, the required storage space scales linearly with the number of outgroup genomes, and in the second part with the number of re-sequenced samples.
- Moreover, the number of intermediate files in the GERP step scales linearly with the level of fragmentation (i.e. the number of contigs/scaffolds) of the reference genome; it is therefore recommended to use a reference genome that is as contiguous as possible.
- Even though GenErode removes many intermediate files during the pipeline run, not all of them are removed and may take up a lot of space. After all analyses are run and no re-run is planned anymore, all files that are not required anymore must therefore be deleted manually to free up space.
-
Clone this repository to the directory where you want to run the pipeline (hereafter called
path/to/your/snakemake_run
), and rename it suitable for your project. All output files will be automatically written to subdirectories therein (underdata/
andresults/
), except for files related to the reference genome and annotation that are placed in their respective directory. -
Create a conda environment from the yaml file
path/to/your/snakemake_run/environment.yml
(e.g. see the conda documentation). This environment contains Snakemake version 8.14.0 (Köster & Rahmann 2012), the required python3 version and all required python modules to run the pipeline, as well as the Snakemake executor plugin for slurm.
conda env create -n generode -f environment.yml
If you want to create a conda environment in a different location than your home directory, you can provide a path to a directory for the conda environment to be installed in, and run the following command instead of the command above:
conda env create -f environment.yml -p /path/to/your/project/directory/generode
- This pipeline has been tested on HPC clusters using slurm, using the Snakemake executor plugin slurm (https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/slurm.html). Please copy one of the example configuration files
config/slurm/profile/config_plugin_rackham.yaml
orconfig/slurm/profile/config_plugin_dardel.yaml
toslurm/config.yaml
and adjust it according to your workload manager or your slurm system. This file specifies compute resources for each rule (i.e. for each slurm job). Note that some rules are grouped into group jobs and are referred to via their group name in the configuration files. Any rule or group job that requires more resources than specified underdefault-resources
is listed underset-threads
orset-resources
. If any rule or group jobs fail due to too little run time or memory, their compute resources can be updated by either updating or adding resources underset-threads
orset-resources
. You might have to adjust computational requirements according to the size of your reference genome and sequencing depth of your samples since the example configuration files were set up for a mammalian reference genome and re-sequencing depths of around 10X coverage.
Details on how to configure and run GenErode on Dardel are provided in
config/slurm/profile/README.md
- Prepare one metadata file for historical samples and one for modern samples following these examples as text files. You will find example metadata files in the directory
path/to/your/snakemake_run/config/
that you can use as templates. The pipeline can also be run with only historical or only modern samples (minimum one sample). - Columns in the metadata table file must be separated by whitespace.
- The file has to contain a header line ("samplename_index_lane readgroup_id readgroup_platform path_to_R1_fastq_file path_to_R2_fastq_file").
- Each row contains the information for one sequencing library, i.e. a sample sequenced with several sequencing libraries will be described on several rows (see next bulletpoint).
- The first column provides the naming scheme for all files produced by the pipeline, and has to follow the pattern
samplename_index_lane
.samplename
should be the lab ID of the sample to make sure it is unique (e.g. initials of the person who generated the data plus a running number, e.g.JS001
); it is only allowed to contain alphanumeric characters.index
stands for the PCR index, in case several reactions or library preparations were run for a given sample.lane
is the lane number on which the library was sequenced. The delimiter betweensamplename
,index
, andlane
has to be underscore (_
). - The naming pattern for
readgroup_id
in the second column isFLOWCELLID.LANE.INDEX
and has to be unique, too. In case two samples were sequenced on the same lane for which each only one PCR was made (for example for some modern samples), the values forindex
should be different. This information will be used to create read groups for BAM files. - The sequencing platform has to be specified in column three under
readgroup_platform
(e.g.illumina
,hiseq2000
,novaseq
orbgi
). - In columns four and five, the full paths to the raw forward (
path_to_R1_fastq_file
) and reverse (path_to_R2_fastq_file
) fastq-files have to be provided for each sample (samplename_index_lane
). Allowed file name extensions are*.fastq.gz
or*.fq.gz
. The first step of the pipeline will create symbolic links to these paths, based on the sample naming scheme in the first column. - Save the tables, for example in the directory
path/to/your/snakemake_run/config/
. - Whenever the pipeline is executed, the metadata files are compared to a schema (specified in
workflow/schemas/metadata.schema.yaml
) and Snakemake gives an error message for violations of the schema. - In case samples are added later to the same dataset, just insert them to the end of the existing files without renaming the metadata files and rerun the pipeline.
- Some programs will crash if the fasta file headers, i.e. the scaffold or contig names, are too long and/or contain any special characters except letters, numbers and underscores. Reference genomes downloaded from repositories like NCBI or ENA often have headers composed of an accession number followed by a whitespace and a long string, which will also cause some programs to crash. Check therefore your reference genome fasta file and remove or replace any special characters, and cut the scaffold names after the accession number or max. 10-15 characters before using the fasta file in the pipeline.
- For snpEff, an annotation file of the reference genome is required in GTF2.2 format. Make sure the GTF file uses the same scaffold or contig names as the corrected fasta file. You can read more about tools for file format conversions from GFF to GTF here.
- For snpEff annotations and for the GERP step, the reference genome should be carefully chosen to avoid a reference genome bias in comparisons of populations (through time and/or space). It is recommended to use the genome assembly from a closely related outgroup that is equally diverged from all populations under comparison and to avoid using a genome assembly of an individual from the same population. For mlRho, runs of homozygosity (ROH) and PCA, this bias is not as strong, so that a reference genome from the same species or population can be used. For GERP, a highly contiguous, scaffolded assembly should be used as reference genome if possible as intermediate file numbers scale with the number of contigs/scaffolds.
The pipeline requires the configuration file to have the file name config/config.yaml
.
- Enter the full path to the reference genome fasta file. The reference genome can be located in a different directory than the pipeline code. Note that the file name extension has to be
.fasta
or.fa
and that the file name (without the file name extension) will be reused in output paths (subdirectory names) and filenames of many analyses so it should be short and meaningful. - OPTIONAL: The path to a text file listing sex-chromosomal scaffolds/contigs to be excluded from mlRho, VCF files (and downstream analyses), and GERP score calculations can be provided (
sexchromosomes
). - Enter the relative paths to the metadata tables within the pipeline directory, e.g.
config/historical_samples.txt
. If exclusively modern or exclusively historical samples are analyzed, create only the necessary table and leave the variable string for the other file empty (""
). - Specify which step of the pipeline to run by setting the respective step to
True
orFalse
in the config file and fill out any required parameters (see this wiki page for details about each step). All analyses within one step ("rules") are automatically run when the step is set toTrue
. If your system uses slurm, each rule can be submitted as a separate slurm job to the cluster by Snakemake if you run the pipeline using the slurm profile or the cluster configuration (see here). - Each step of the pipeline depends on most steps before, so that if you set a downstream step to
True
, all required steps prior to the desired one are automatically run. - You can avoid warnings when running Snakemake by setting only one step at a time to
True
. - Each step generates a lot of output files and statistics about data quality for each sample. It is therefore recommended to run the pipeline step by step and to check all output files before proceeding to the next step to assess data quality and to detect errors more easily.
- The pipeline can be run partially if not all output files are desired (for example, only repeat identification, or only BAM file processing).
Please note that although minimum and maximum depth thresholds can be chosen in the configuration file (see this wiki section), the pipeline applies a hard-coded (absolute) minimum depth threshold of three reads per site and sample for mlRho and VCF file filtering that cannot be changed in the configuration file. An average genome-wide depth of at least 6X per sample is recommended to have enough statistical power to detect heterozygous sites.