Skip to content

2. Requirements & pipeline configuration

Verena Kutschera edited this page Nov 26, 2024 · 29 revisions

Requirements

This pipeline analyzes whole genome re-sequencing data and requires a Linux system with at least 16 cores (min. 6 GB RAM each; in total 96 GB RAM) to analyze data for a species with a genome size of around 3 Gbp. GenErode has been developed to run on high-performance compute (HPC) clusters.

  1. Apptainer (former singularity) and conda have to be installed on the system to run the pipeline.

  2. Make sure there is storage space available as the pipeline produces a lot of large files:

  • In a test run with four Sumatran rhinoceros samples, the final size of the results directory after running the optional steps BAM file subsampling, CpG filtering and all downstream analyses except the GERP step was about 750 GB, even though many intermediate files had already been automatically deleted by the pipeline during the run. However, the space required reached > 1 TB at some points during this pipeline run, so storage space flexibility is required.
  • When running the GERP step, several TB of storage space will be required for intermediate files. In the first part of the GERP step, the required storage space scales linearly with the number of outgroup genomes, and in the second part with the number of re-sequenced samples.
  • Moreover, the number of intermediate files in the GERP step scales linearly with the level of fragmentation (i.e. the number of contigs/scaffolds) of the reference genome; it is therefore recommended to use a reference genome that is as contiguous as possible.
  • Even though GenErode removes many intermediate files during the pipeline run, not all of them are removed and may take up a lot of space. After all analyses are run and no re-run is planned anymore, all files that are not required anymore must therefore be deleted manually to free up space.
  1. Clone this repository to the directory where you want to run the pipeline (hereafter called path/to/your/snakemake_run), and rename it suitable for your project. All output files will be automatically written to subdirectories therein (under data/ and results/), except for files related to the reference genome and annotation that are placed in their respective directory.

  2. Create a conda environment from the yaml file path/to/your/snakemake_run/environment.yml (e.g. see the conda documentation). This environment contains Snakemake version 8.14.0 (Köster & Rahmann 2012), the required python3 version and all required python modules to run the pipeline, as well as the Snakemake executor plugin for slurm.

conda env create -n generode -f environment.yml

If you want to create a conda environment in a different location than your home directory, you can provide a path to a directory for the conda environment to be installed in, and run the following command instead of the command above: conda env create -f environment.yml -p /path/to/your/project/directory/generode

  1. This pipeline has been tested on HPC clusters using slurm, using the Snakemake executor plugin slurm (https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/slurm.html). Please copy one of the example configuration files config/slurm/profile/config_plugin_rackham.yaml or config/slurm/profile/config_plugin_dardel.yaml to slurm/config.yaml and adjust it according to your workload manager or your slurm system. This file specifies compute resources for each rule (i.e. for each slurm job). Note that some rules are grouped into group jobs and are referred to via their group name in the configuration files. Any rule or group job that requires more resources than specified under default-resources is listed under set-threads or set-resources. If any rule or group jobs fail due to too little run time or memory, their compute resources can be updated by either updating or adding resources under set-threads or set-resources. You might have to adjust computational requirements according to the size of your reference genome and sequencing depth of your samples since the example configuration files were set up for a mammalian reference genome and re-sequencing depths of around 10X coverage.

Details on how to configure and run GenErode on Dardel are provided in config/slurm/profile/README.md

Configure the pipeline

1) Prepare metadata files of samples

  • Prepare one metadata file for historical samples and one for modern samples following these examples as text files. You will find example metadata files in the directory path/to/your/snakemake_run/config/ that you can use as templates. The pipeline can also be run with only historical or only modern samples (minimum one sample).
  • Columns in the metadata table file must be separated by whitespace.
  • The file has to contain a header line ("samplename_index_lane readgroup_id readgroup_platform path_to_R1_fastq_file path_to_R2_fastq_file").
  • Each row contains the information for one sequencing library, i.e. a sample sequenced with several sequencing libraries will be described on several rows (see next bulletpoint).
  • The first column provides the naming scheme for all files produced by the pipeline, and has to follow the pattern samplename_index_lane. samplename should be the lab ID of the sample to make sure it is unique (e.g. initials of the person who generated the data plus a running number, e.g. JS001); it is only allowed to contain alphanumeric characters. index stands for the PCR index, in case several reactions or library preparations were run for a given sample. lane is the lane number on which the library was sequenced. The delimiter between samplename, index, and lane has to be underscore (_).
  • The naming pattern for readgroup_id in the second column is FLOWCELLID.LANE.INDEX and has to be unique, too. In case two samples were sequenced on the same lane for which each only one PCR was made (for example for some modern samples), the values for index should be different. This information will be used to create read groups for BAM files.
  • The sequencing platform has to be specified in column three under readgroup_platform (e.g. illumina, hiseq2000, novaseq or bgi).
  • In columns four and five, the full paths to the raw forward (path_to_R1_fastq_file) and reverse (path_to_R2_fastq_file) fastq-files have to be provided for each sample (samplename_index_lane). Allowed file name extensions are *.fastq.gz or *.fq.gz. The first step of the pipeline will create symbolic links to these paths, based on the sample naming scheme in the first column.
  • Save the tables, for example in the directory path/to/your/snakemake_run/config/.
  • Whenever the pipeline is executed, the metadata files are compared to a schema (specified in workflow/schemas/metadata.schema.yaml) and Snakemake gives an error message for violations of the schema.
  • In case samples are added later to the same dataset, just insert them to the end of the existing files without renaming the metadata files and rerun the pipeline.

2) Check the reference genome fasta file and the GTF file

  • Some programs will crash if the fasta file headers, i.e. the scaffold or contig names, are too long and/or contain any special characters except letters, numbers and underscores. Reference genomes downloaded from repositories like NCBI or ENA often have headers composed of an accession number followed by a whitespace and a long string, which will also cause some programs to crash. Check therefore your reference genome fasta file and remove or replace any special characters, and cut the scaffold names after the accession number or max. 10-15 characters before using the fasta file in the pipeline.
  • For snpEff, an annotation file of the reference genome is required in GTF2.2 format. Make sure the GTF file uses the same scaffold or contig names as the corrected fasta file. You can read more about tools for file format conversions from GFF to GTF here.
  • For snpEff annotations and for the GERP step, the reference genome should be carefully chosen to avoid a reference genome bias in comparisons of populations (through time and/or space). It is recommended to use the genome assembly from a closely related outgroup that is equally diverged from all populations under comparison and to avoid using a genome assembly of an individual from the same population. For mlRho, runs of homozygosity (ROH) and PCA, this bias is not as strong, so that a reference genome from the same species or population can be used. For GERP, a highly contiguous, scaffolded assembly should be used as reference genome if possible as intermediate file numbers scale with the number of contigs/scaffolds.

3) Edit the file "config/config.yaml"

The pipeline requires the configuration file to have the file name config/config.yaml.

  • Enter the full path to the reference genome fasta file. The reference genome can be located in a different directory than the pipeline code. Note that the file name extension has to be .fasta or .fa and that the file name (without the file name extension) will be reused in output paths (subdirectory names) and filenames of many analyses so it should be short and meaningful.
  • OPTIONAL: The path to a text file listing sex-chromosomal scaffolds/contigs to be excluded from mlRho, VCF files (and downstream analyses), and GERP score calculations can be provided (sexchromosomes).
  • Enter the relative paths to the metadata tables within the pipeline directory, e.g.
    config/historical_samples.txt. If exclusively modern or exclusively historical samples are analyzed, create only the necessary table and leave the variable string for the other file empty ("").
  • Specify which step of the pipeline to run by setting the respective step to True or False in the config file and fill out any required parameters (see this wiki page for details about each step). All analyses within one step ("rules") are automatically run when the step is set to True. If your system uses slurm, each rule can be submitted as a separate slurm job to the cluster by Snakemake if you run the pipeline using the slurm profile or the cluster configuration (see here).
  • Each step of the pipeline depends on most steps before, so that if you set a downstream step to True, all required steps prior to the desired one are automatically run.
  • You can avoid warnings when running Snakemake by setting only one step at a time to True.
  • Each step generates a lot of output files and statistics about data quality for each sample. It is therefore recommended to run the pipeline step by step and to check all output files before proceeding to the next step to assess data quality and to detect errors more easily.
  • The pipeline can be run partially if not all output files are desired (for example, only repeat identification, or only BAM file processing).

Please note that although minimum and maximum depth thresholds can be chosen in the configuration file (see this wiki section), the pipeline applies a hard-coded (absolute) minimum depth threshold of three reads per site and sample for mlRho and VCF file filtering that cannot be changed in the configuration file. An average genome-wide depth of at least 6X per sample is recommended to have enough statistical power to detect heterozygous sites.