3. Detailed documentation of the different analysis steps of the pipeline

Data processing

Note that software versions are listed on the page https://github.com/NBISweden/GenErode/wiki/9.-Changelog for the different GenErode versions.

Preparation of the reference genome for downstream analyses

Required for most downstream steps and automatically run when activating downstream steps, i.e. does not have to be separately activated in the config file.

Indexes the reference genome assembly with bwa (using -a bwtsw for large genomes).
Indexes the reference genome assembly with samtools faidx.
Creates a FASTA dict for the reference genome assembly using Picard CreateSequenceDictionary.
Generates a genomefile for later use in BEDTools using custom code.
Generates a BED file with the coordinates of all contigs/scaffolds of the reference genome from the samtools index file using custom code.

Output file locations:

Original directory containing the reference genome FASTA file as specified in the config file. That way this step is run only once for a given reference FASTA file.

Repeat identification in the reference genome

Required for filtering of BAM and VCF files. Will be automatically run when setting any steps requiring repeat BED files for repeat filtering to True (see below) but can also be run independently from the rest of the pipeline.

Changes all bases printed in lowercase letter in the reference assembly FASTA file to uppercase letters using custom code.
Runs RepeatModeler with the flag -quick on the reference assembly FASTA file in upper case letters to identify repeat elements.
Runs RepeatMasker on the reference assembly FASTA file in upper case letters using the RepeatModeler output library of de novo detected repeat elements and the DFAM blast database.
Generates a BED file of repeat elements from the RepeatMasker *.out file using custom code and sorts it, i.e. containing all sites of the genome within repeats (*.repeats.bed).
Subtracts the BED file of repeats from the BED file of the reference genome with in BEDTools to generate a repeat-masked BED file of the genome, i.e. containing all sites of the genome excluding repeats (*.repma.bed).

Output file locations:

Original directory containing the reference genome FASTA file as specified in the config file. That way this step is run only once for a given reference FASTA file.
RepeatModeler: repeatmodeler/[reference_filename] within reference FASTA file directory. A detailed log of the analysis can be found in the file repeatmodeler/[reference_filename]/RM.log and the final results file is written to repeatmodeler/[reference_filename]/RM_raw.out/consensi.fa.classified.
RepeatMasker: repeatmasker/[reference_filename] within reference file directory.

Data that can be deleted after the pipeline has finished successfully:

The directory called /repeatmodeler/[reference_filename]/RM_[ID].[DATE]/ has to be deleted manually as soon as RepeatModeler is finished as it contains many large intermediate files and is not automatically detected by Snakemake.

Fastq file processing

Required for all downstream steps.

Creates symbolic links to the original raw short read data, based on the metadata files provided by the user.
Runs FastQC on raw reads.
Runs adapter trimming and read merging for historical samples using fastp (FASTQ files containing unmerged reads are marked as temporary files and are automatically deleted by the pipeline as soon as not required any longer). Fastp is run with merging of overlapping paired-end reads (with --overlap_len_require 15 --overlap_diff_limit 1), automatic adapter detection and removal, default quality trimming (phred score 15), and automatic detection of NovaSeq and NextSeq samples and poly-G tail removal from NovaSeq and NextSeq samples.
Runs adapter trimming for modern samples using fastp. Fastp is run with automatic adapter detection and removal, default quality trimming (phred score 15), and automatic detection of NovaSeq and NextSeq samples and poly-G tail removal from NovaSeq and NextSeq samples.
Runs FastQC on trimmed (and merged) reads (for historical samples, FastQC will also be run on unmerged reads).
Summarizes FastQC results before and after trimming and fastp reports using MultiQC.

Config file:

Adjust minimum read length for fastp if desired (default: 30 bp), separately for historical and modern samples.

Output file locations:

Symbolic links to raw data: data/raw_reads_symlinks/historical/
and data/raw_reads_symlinks/modern/.
Statistics on raw data: data/raw_reads_symlinks/historical/stats/
and data/raw_reads_symlinks/modern/stats/ and therein.
Trimmed and merged reads from historical samples (FASTQ format):
results/historical/trimming/.
Trimmed reads from modern samples (FASTQ format): results/modern/trimming/.
Statistics on trimming: results/historical/trimming/stats/
and results/modern/trimming/stats/ and therein.

OPTIONAL: Mapping of historical reads to mitochondrial genomes of possible contaminating species

The output files of this step are not required for any of the downstream steps and the analysis is thus not automatically included when running any of the following steps of the pipeline. It has to be set to True separately from the downstream steps in the config file to be run and the next step (mapping) can only be started once this step is finished. The output from this step can be used to assess whether any of the ancient/historical samples shows elevated read depth on one of the included mitochondrial genomes relative to the mitochondrial genome of the target species, which could indicate contamination. All BAM files of merged reads mapped to the mitochondrial genome of the target species are merged into one BAM file for optional analysis outside of the pipeline. Mitochondrial genomes of species for comparative analysis are provided in the directory data/mitogenomes/ (chicken Gallus gallus: NC_001323.1, cow Bos taurus: NC_006853.1, human Homo sapiens: NC_012920.1, mouse Mus musculus: NC_005089.1, pig Sus scrofa: NC_000845.1).

Creates a symbolic link to the mitochondrial genome FASTA file of the species under investigation.
Indexes the mitochondrial genomes with bwa.
Maps merged reads of historical samples to the mitochondrial genome of the species under investigation, and to the human, chicken, cow, pig and mouse mitochondrial genomes using ancient DNA specific parameters in bwa aln (-l 16500 -n 0.01 -o 2; Palkopoulou et al. 2015) using bwa and samtools, and sorts the BAM files (temporary files, automatically deleted by the pipeline as soon as not required any longer).
OPTIONAL: Mapping of unmerged reads of historical samples using the same approach as for merged reads (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Merges all BAM files of merged reads mapped to the mitochondrial genome of the target species into one BAM file using samtools for downstream analyses outside the pipeline.
Calculates basic summary statistics on the BAM files using samtools.
Calculates ratios of reads mapped to human, chicken, cow, pig, mouse mitochondrial genomes with reads mapped to the mitochondrial genome of the target species to normalize the read counts using custom code. Note that in case the BAM file does not contain any mapped reads, the ratio is created by adding +1 to the number of mapped reads to avoid division by zero.
Runs QualiMap on all BAM files that contain at least 100 reads (to avoid QualiMap to crash).
Summarizes samtools flagstat and QualiMap statistics using MultiQC.

Config file:

Enter the full path (incl. file name) to the mitochondrial genome FASTA file of the species under investigation. The file name will be reused in the pipeline and can have the file name extensions *.fasta, *.fa or *.fna. The variable can be left empty ("") if not run (optional analysis).
Specify if you also want to map the unmerged reads to the mitochondrial genomes by setting map_unmerged_reads to True.

Output file locations:

BAM files: results/historical/mitogenomes_mapping/.
Statistics on mapping: results/historical/mitogenomes_mapping/stats/ and therein.

Mapping

Required for all downstream steps.

Extracts read group information from metadata files and formats it using custom code (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Maps merged and trimmed short read data of historical samples to the reference assembly using bwa aln with ancient DNA specific parameters (-l 16500 -n 0.01 -o 2; Palkopoulou et al. 2015), creating SAI files (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Converts SAI files of historical samples into BAM files using bwa samse and samtools. During this step, read group information based on the metadata file is added to each BAM file, and the file is sorted.
Maps trimmed short read data of modern samples to the reference assembly using bwa mem with default settings and marking shorter split hits as secondary (for Picard compatibility; parameter -M). During this step, read group information based on the metadata file is added to each BAM file, and the file is sorted.
Indexes the BAM files with samtools.
Calculates basic summary statistics on the BAM files using samtools.
Runs QualiMap on the BAM files.
Summarizes samtools flagstat and QualiMap statistics using MultiQC, separately for modern and historical samples.

Output file locations:

BAM files: results/historical/mapping/[reference_filename]/ and results/modern/mapping/[reference_filename]/.
Statistics on BAM files: results/historical/mapping/[reference_filename]/stats/bams_sorted/
and results/modern/mapping/[reference_filename]/stats/bams_sorted/ and therein.

BAM file processing: merging per index, duplicate removal, merging per sample, indel realignment, BAM statistics

Required for all downstream steps.

Merges sorted BAM files from the same sample and the same PCR/index but from different lanes into one BAM file per sample and PCR/index using samtools. If the sample was sequenced on only one lane, the pipeline creates a copy of the corresponding sorted BAM file and renames it to match the output file name scheme of merged BAM files using custom code (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Removes duplicates from BAM files of historical samples using samtools and a python script provided by Pontus Skoglund that checks both ends of a read (workflow/scripts/samremovedup.py). The script was modified for GenErode and also prints unmapped reads to the output (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Marks duplicates in BAM files of modern samples using MarkDuplicates in Picard (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Merges BAM files after duplicate removal/marking from the same sample but from several PCR/indices into one BAM file per sample using samtools. If only one PCR/index was sequenced for the sample, the pipeline creates a copy of the corresponding BAM file and renames it to match the output file name scheme of merged BAM files using custom code (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Creates a realignment target list for indel (insertion-deletion) realignment in GATK.
Realigns indels in GATK.
Calculates the arithmetic mean of the genome-wide depth for each realigned BAM file using samtools and custom code, excluding repeat elements and restricted to reads of minimum mapping quality 30 and bases of minimum base quality 30. Sites with zero coverage (missing data) can be included or excluded, as specified in the config file. Calculates minimum and maximum depth thresholds from the genome-wide mean depth for each sample based on a factor by which the mean depth should be multiplied provided in the config file. Used as input for depth filtering in downstream analyses.
Plots a histogram of the depth of coverage for each sample that is included into the pipeline report, including vertical lines for the mean genome-wide depth, minimum and maximum depth thresholds calculated for each sample, using a custom script.
Indexes each of the BAM files using Picard.
Indexes each of the BAM files using samtools.
Calculates basic summary statistics on each of the BAM files using samtools.
Runs QualiMap on each of the BAM files.
Runs FastQC on the realigned BAM files.
Summarizes samtools flagstat and QualiMap (and FastQC) statistics using MultiQC, separately for modern and historical samples, at each filtering level.

Config file:

Enter parameters for depth filtering of BAM files for mlRho analyses and of VCF files for all other downstream analyses. Check the depth histograms carefully before choosing appropriate parameters.
- Set zerocoverage to True if sites with missing data (i.e. sites with zero coverage) should be included into the calculation of the mean genome-wide depth, and to False if these sites should be excluded. Mean genome-wide depth is used to calculate depth thresholds but also to calculate the target depth proportion for BAM file subsampling (see below).
- Enter the factor by which the mean genome-wide depth should be multiplied to calculate a minimum depth threshold. A factor of 0.33 (i.e. a minimum of 1/3 of the mean genome-wide depth) has been used in many projects. In mlRho and VCF file filtering, the minimum depth filter is automatically set to 3X (i.e. 3 reads per site) for samples of low coverage.
- Enter the factor by which the mean genome-wide depth should be multiplied to calculate a maximum depth threshold. A factor of 2 (i.e. a minimum of 2X of the mean genome-wide depth) has been used in some projects but this might remove too much data in other projects. The default setting of the pipeline is 10.

Output file locations:

BAM files (indels realigned): results/historical/mapping/[reference_filename]/ and results/modern/mapping/[reference_filename]/. All BAM files except the BAM files after indel realignment are marked as temporary files and are automatically deleted by the pipeline once they are not required any longer.
Statistics on merged BAM files per index: results/historical/mapping/[reference_filename]/stats/bams_merged_index/ and results/modern/mapping/[reference_filename]/stats/bams_merged_index/ and therein.
Statistics on BAM files after duplicate removal: results/historical/mapping/[reference_filename]/stats/bams_rmdup/ and results/modern/mapping/[reference_filename]/stats/bams_rmdup/ and therein.
Statistics on merged BAM files per sample: results/historical/mapping/[reference_filename]/stats/bams_merged_sample/ and results/modern/mapping/[reference_filename]/stats/bams_merged_sample/ and therein.
Statistics on BAM files after indel realignment, incl. depth thresholds:
results/historical/mapping/[reference_filename]/stats/bams_indels_realigned/ and
results/modern/mapping/[reference_filename]/stats/bams_indels_realigned/ and therein (depth histograms: results/historical/mapping/[reference_filename]/stats/bams_indels_realigned/*.bam.dp.hist.pdf and results/modern/mapping/[reference_filename]/stats/bams_indels_realigned/*.bam.dp.hist.pdf).
GenErode report: GenErode_pipeline_report.html in the pipeline directory. Note that the MultiQC html files embedded in the pipeline report can only be accessed through some browsers due to a bug in Snakemake.

Checkpoint Please have a close look at the MultiQC reports and the depth histograms before proceeding with the pipeline and consider to make appropriate adjustments to the depth parameters in the config file. Depth histograms and MultiQC reports can be found in the pipeline report that is automatically created when setting this or any of the downstream analysis pipeline steps to True. Note that MultiQC reports will round average genome-wide depth estimates to an integer. For more precise depth estimates check the QualiMap reports and the depth statistics calculated with samtools and custom code (*.merged.rmdup.merged.realn.repma.Q30.bam.dpstats.txt). Note that a minimum depth of 6X should be aimed for per sample to ensure high enough quality of variant calls at heterozygous sites.

Note that some ancient DNA samples may require a second round of PCR duplicate removal after merging BAM files per sample. This option has not been implemented in the current pipeline version and thus needs to be run outside the pipeline if required.

OPTIONAL: Base quality rescaling of historical samples

This step can be run for historical samples that were not or only partially treated with uracil-DNA glycosylase (UDG) in the lab. Note that technical biases have been reported when modern samples were compared to historical samples that had been rescaled in mapDamage2.

Runs mapDamage on BAM files of historical samples to rescale base qualities of probably damaged sites. Copies rescaled BAM files to the directory results/historical/mapping/ and deletes the original rescaled BAM files (the latter are temporary files, automatically deleted by the pipeline as soon as not required any longer).
Calculates basic summary statistics on the rescaled BAM files using samtools.
Runs QualiMap on the rescaled BAM files.
Runs FastQC on rescaled BAM files.
Summarizes samtools flagstat, QualiMap and FastQC statistics using MultiQC.

Config file:

List of historical or ancient samples for which bases have to be rescaled. Enter samplenames as provided in the metadata table, without index or lane, in quotation marks and separated by commas (e.g. ["JS08", "JS09"]). Leave empty ([]) if not run (optional analysis).
Keep the sample list during the entire run of the pipeline if rescaled BAM files are supposed to be used in downstream steps.

Output file locations:

Rescaled BAM files: results/historical/mapping/[reference_filename]/.
Statistics on rescaled BAM files: results/historical/mapping/[reference_filename]/stats/bams_rescaled/ and therein.

OPTIONAL: Subsampling of BAM files to a target mean genome-wide depth

This step is recommended for datasets with uneven mean genome-wide depth across samples.

Removes unmapped reads and reads with mapping quality < 30 from the BAM files using samtools (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Subsamples BAM files by sampling a specific proportion of reads per site using samtools with the parameter -s FRAC. The proportion (FRAC) is calculated per BAM file from the target depth per site as specified in the config file and from the mean genome-wide depth in the indel-realigned BAM file using custom code. Note that sampling of a proportion of reads per site can lead to slight deviations in the resulting mean genome-wide depth value (which is rounded) from the target depth specified in the config file.
Calculates basic summary statistics on the subsampled BAM files using samtools.
Runs QualiMap on the subsampled BAM files.
Calculates the arithmetic mean of the genome-wide depth for each subsampled BAM file using samtools, excluding repeat elements and restricted to reads of minimum mapping quality 30 and bases of minimum base quality 30. Sites with zero coverage (missing data) can be included or excluded, as provided in the config file. Calculates minimum and maximum depth thresholds from the genome-wide mean depth for each sample based on a factor by which the mean depth should be multiplied provided in the config file. Used as input for depth filter in downstream analyses. The minimum depth filter is automatically set to 3X (i.e. 3 reads per site) for samples of low coverage. Parameters are provided at the step "BAM file processing" as described above.
Summarizes samtools flagstat and QualiMap statistics using MultiQC.

Config file:

Target depth to which BAM files should be subsampled (as integer or floating point number). This depth has to be lower than or equal to the mean genome-wide depth of the samples specified in the list of samples to be subsampled. Set to False if not run (optional analysis).
List of modern and/or historical samples that should be subsampled to a target depth. Do not enter samples with a mean genome-wide depth lower than the target depth. Enter samplenames as provided in the metadata table, without index or lane, in quotation marks and separated by commas (e.g. ["JS01", "JS02", "JS08"]). Leave empty ([]) if not run (optional analysis).
Keep the target depth and sample list during the entire run of the pipeline so that subsampled BAM files are used in downstream steps.

Output file locations:

Subsampled BAM files: results/historical/mapping/[reference_filename]/ and results/modern/mapping/[reference_filename]/.
Statistics on subsampled BAM files, incl. re-calculated depth thresholds:
results/historical/mapping/[reference_filename]/stats/bams_subsampled/ and
results/modern/mapping/[reference_filename]/stats/bams_subsampled/ and therein. Please note that MultiQC reports will round average genome-wide depth to an integer. For more precise depth estimates check the QualiMap reports and the depth statistics calculated with samtools and custom code (*.merged.rmdup.merged.*.mapped_q30.subs_dp*.repma.Q30.bam.dpstats.txt).

Genotyping

Required for all downstream steps.

Calls variants using bcftools for each sample individually. In bcftools mpileup, minimum mapping quality and minimum base quality are set to 30 with -Q and -q, and -B is set to reduce false SNPs due to misalignments. The output from mpileup is piped to bcftools call, which is run with -c (samtools/bcftools original calling method) and all sites are written to the output BCF file (i.e. including monomorphic sites) (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Sorts the BCF files using bcftools.
Indexes sorted BCF files using bcftools.
Calculates basic summary statistics on sorted BCF files using bcftools.
Summarizes the statistics of sorted BCF files separately for historical and modern samples using MultiQC.

Output file locations:

Sorted BCF files: results/historical/vcf/[reference_filename]/ and results/modern/vcf/[reference_filename]/.
Statistics on sorted BCF files: results/historical/vcf/[reference_filename]/stats/vcf_sorted/ and
results/modern/vcf/[reference_filename]/stats/vcf_sorted/ and therein.

OPTIONAL: CpG site identification

Required for CpG filtering of VCF files and for mlRho analyses excluding CpG sites. Recommended for any dataset including historical or ancient samples to reduce the number of damaged sites from DNA degradation. Choose one of three available methods in the config file for a given GenErode pipeline run.

CpG sites from genotypes (CpG_from_vcf):

Converts the sorted BCF files of samples specified in the sample list into gzipped VCF format using bcftools (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Makes BED files of CpG sites from each of the gzipped VCF files using a custom script (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Merges all BED files of CpG sites from VCF files of modern and historical/ancient samples into one BED file.

CpG sites from the reference genome (CpG_from_reference):

Makes a BED file of CpG sites from the reference genome using a modified script from Tom van der Valk.

CpG sites from genotypes and the reference genome (CpG_from_vcf_and_reference):

Merges BED files of CpG sites from VCF files and from the reference genome into one BED file. Runs methods 1) and 2) automatically if set to True.

Applied to CpG BED files regardless the method:

Merges the final CpG BED file (file name endings: .CpG_vcf.bed, .CpG_ref.bed, or .CpG_vcfref.bed) with a BED file of repeat elements identified in the reference genome (file name endings: .CpG_vcf.repeats.bed, .CpG_ref.repeats.bed, or .CpG_vcfref.repeats.bed) using BEDTools.
Subtracts the CpG BED file and the CpG-repeats BED file from the reference genome BED file to generate BED files of CpG-masked (file name endings: .noCpG_vcf.bed, .noCpG_ref.bed, or .noCpG_vcfref.bed) and CpG- and repeat-masked regions (file name endings: .noCpG_vcf.repma.bed, .noCpG_ref.repma.bed, or .noCpG_vcfref.repma.bed) using BEDTools.

Config file:

Set method 1, 2 or 3 to True.
Create a list of modern and/or historical samples in which CpG sites should be identified with method 1 or 3 and from which CpG sites should be removed in downstream analyses (methods 1, 2, 3). Enter samplenames as provided in the metadata table, without index or lane, in quotation marks and separated by commas (e.g. ["JS01", "JS02", "JS08", "JS09"]). Leave empty ([]) if not run (optional analysis). Note that in order to compare the same set of sites across samples, CpG sites have to be removed from all samples.
Keep the desired method set to True and the sample list during the entire run of the pipeline so that the correct CpG BED files are used for data filtering in downstream analyses.

Output file locations:

BED files from reference genome and merged BED files: results/[reference_filename]/.

OPTIONAL: mlRho

Runs mlRho to calculate genome-wide heterozygosity (or for autosomes vs. sex chromosomes, or autosomes-only).

Converts BAM files to the pro format using samtools (v1.9) and mlRho (temporary files, automatically deleted by the pipeline as soon as not required any longer). BED files with repeat-masked regions (default filtering), optionally with repeat- and CpG-masked regions, and/or optionally split into autosomes and sex chromosomes are automatically generated and used to specify which genome regions to analyze with mlRho. Depth thresholds are either obtained from realigned or from subsampled BAM files (see depth threshold calculation from BAM files in the BAM file processing step above).
Formats the pro files and runs mlRho on each pro file using mlRho using default settings, with the same minimum depth threshold as chosen for the BAM to pro file format conversion.
Combines all results tables and creates a plot of theta estimates (including confidence intervals) per sample, split up into historical and modern samples, and optionally into autosomes and sex chromosomes, which is included into the pipeline report.

Config file:

If mlRho should be run on all contigs/scaffolds and/or the identity of sex-chromosomal contigs/scaffolds is unknown, set mlRho_autosomes_sexchromosomes to False and do not provide the path to the scaffold/contig text file with the reference genome (sexchromosomes) when running mlRho.
For mlRho analyses of autosomes and sex chromosomes separately from each other, set mlRho_autosomes_sexchromosomes to True and provide the path to the scaffold/contig text file with the reference genome (sexchromosomes) when running mlRho.
If sex-chromosomal contigs/scaffolds (or other contigs/scaffolds such as unplaced or short scaffolds) should be excluded from the analysis, i.e. if only autosomes should be analyzed, set mlRho_autosomes_sexchromosomes to False and provide the path to the scaffold/contig text file with the reference genome (sexchromosomes) when running mlRho.

Output file locations:

mlRho output files: results/historical/mlRho/[reference_filename]/ and results/modern/mlRho/[reference_filename]/.
BED files of sites to be included in the analysis: results/[reference_filename]/.
GenErode report: GenErode_pipeline_report.html in the pipeline directory.

VCF file processing, PCA, ROH, and SNP annotation

OPTIONAL: CpG filtering of VCF files

Will be run on the list of samples specified in the config file using the method selected to identify CpG sites under CpG_identification.

Converts the BCF files into gzipped VCF files using bcftools (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Removes CpG sites from gzipped VCF files using BEDTools (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Converts the filtered gzipped VCF files into BCF format using bcftools.
Indexes and calculates basic summary statistics for each filtered BCF file using bcftools.
Summarizes the statistics using MultiQC.

Config file:

Keep the sample list and method selected under "CpG_identification".

Output file locations:

CpG-filtered BCF files: results/historical/vcf/[reference_filename]/ and results/modern/vcf/[reference_filename]/.
Statistics on BCF files: results/historical/vcf/stats/vcf_CpG_filtered/[reference_filename]/ and
results/modern/vcf/stats/vcf_CpG_filtered/[reference_filename]/ and therein.

VCF file quality and repeat filtering

Required for all downstream steps.

Removes SNPs within 5 bp up- and downstream of indels using bcftools (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Removes indels, sites of genotype quality QUAL below 30, and sites outside depth thresholds from each individual BCF file using bcftools (temporary files, automatically deleted by the pipeline as soon as not required any longer). Depth thresholds are obtained from realigned or from subsampled BAM files (see depth threshold calculation from BAM files in the BAM file processing step above).
Removes heterozygote sites with allelic imbalance from each BCF file that could be due to contamination, sequencing or mapping errors using bcftools (temporary files, automatically deleted by the pipeline as soon as not required any longer). Thresholds of 0.2 and 0.8 for reference and alternative alleles are used for this filter.
Converts filtered BCF files into gzipped VCF files using bcftools (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Removes repeat regions from gzipped VCF files using BEDTools (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Converts the repeat filtered gzipped VCF files into BCF format using bcftools.
Indexes and calculates basic summary statistics for each BCF file using bcftools.
Summarizes the BCF file statistics at each filtering level using MultiQC.

Output file locations:

BCF files (after repeat filtering): results/historical/vcf/[reference_filename]/ and results/modern/vcf/[reference_filename]/. All BCF files except the files after repeat filtering are marked as temporary files and are automatically deleted by the pipeline once they are not required any longer.
Statistics on quality filtered BCF files: results/historical/vcf/stats/vcf_qual_filtered/[reference_filename]/ and results/modern/vcf/stats/vcf_qual_filtered/[reference_filename]/ and therein.
Statistics on repeat filtered BCF files: results/historical/vcf/stats/vcf_repmasked/[reference_filename]/ and results/modern/vcf/stats/vcf_repmasked/[reference_filename]/ and therein.

Merge VCF files per dataset and filter across all samples

After merging individual VCF files, biallelic sites and sites fulfilling missingness criteria are identified and extracted from the merged and individual VCF files. Optionally, sites on sex-chromosomal contigs/scaffolds can be excluded.

Merges all BCF files of historical and modern samples into one BCF file using bcftools.
Removes all sites that are not biallelic from the merged BCF file using bcftools. Note that this filter keeps sites where all samples are homozygous for the alternative allele.
Removes all sites with a larger proportion of missing genotypes than specified in the config file (f_missing) using bcftools.
OPTIONAL: If a scaffold/contig text file is provided in the config file along with the reference genome (sexchromosomes), these scaffolds/contigs are removed with bcftools.
Creates a BED file from the filtered VCF file with the genomic locations of all sites that passed all previous as well as the biallelic SNP and missingness filter, and excluding sex-chromosomal scaffolds/contigs (optional) using custom code.
Extracts all historical samples in BCF format using bcftools.
Extracts all modern samples in BCF format using bcftools.
Extracts sites that passed all filters and are listed in the BED file from individual BCF files for downstream analyses using BEDTools.
Indexes and calculates basic summary statistics for each BCF file generated in this step using bcftools.
Summarizes the BCF file statistics at each filtering level using MultiQC.

Output file locations:

Merged BCF files: results/all/vcf/[reference_filename]/, results/historical/vcf/[reference_filename]/ and results/modern/vcf/[reference_filename]/. Note that the merged BCF file names do not contain any information about filters applied to individual BAM and BCF files.
Statistics on merged BCF files after filtering for biallelic sites: results/historical/vcf/[reference_filename]/stats/vcf_merged_biallelic/ and
results/modern/vcf/[reference_filename]/stats/vcf_merged_biallelic/ and therein.
Statistics on merged BCF files after applying the missingness filter (if sex-chromosomal scaffolds/contigs are provided, genome-wide and for autosomes): results/historical/vcf/[reference_filename]/stats/vcf_merged_missing/ and
results/modern/vcf/[reference_filename]/stats/vcf_merged_missing/ and therein.
Individual BCF files (after filtering for biallelic sites, missingness and optionally for sex-chromosomal scaffolds/contigs): results/historical/vcf/[reference_filename]/ and results/modern/vcf/[reference_filename]/.
Statistics on filtered individual BCF files (genome-wide): results/historical/vcf/stats/vcf_biallelic_missing_genome/[reference_filename]/ and results/modern/vcf/stats/vcf_biallelic_missing_genome/[reference_filename]/ and therein.
Statistics on filtered individual BCF files (autosomes only): results/historical/vcf/stats/vcf_biallelic_missing_autos/[reference_filename]/ and results/modern/vcf/stats/vcf_biallelic_missing_autos/[reference_filename]/ and therein.

OPTIONAL: PCA

Plots PCAs of all historical, all modern, and all modern and historical samples combined to check for outliers, technical artifacts and signals of population structure.

Converts the merged and filtered BCF files to plink format using plink (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Calculates PCAs for each file using plink.
Plots PCs 1 vs. 2 for each file using a custom script, that is included into the pipeline report.
Plots PCs 1 vs. 3 for each file using a custom script, that is included into the pipeline report (for less than four individuals, an empty PDF file is generated).

Output file locations:

PCA output files incl. plots: results/all/pca/[reference_filename]/, results/historical/pca/[reference_filename]/ and results/modern/pca/[reference_filename]/.
GenErode report: GenErode_pipeline_report.html in the pipeline directory.

OPTIONAL: ROH

Calculates runs of homozygosity for each sample in the merged and filtered BCF files using plink.

Filters the BCF files to remove sites out of Hardy-Weinberg Equilibrium (--hwe 0.05) using vcftools (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Converts filtered gzipped VCF files to plink format using plink.
Calculates runs of homozygosity using plink. The following parameter values are hard-coded and cannot be changed in the config file: --homozyg --homozyg-window-threshold 0.05 --allow-extra-chr. Check below which parameters have to be specified in the config file.
Creates a barplot of FROH (proportion of the genome in runs of homozygosity) for runs of homozygosity larger than 2 Mb for each sample, divided into historical and modern samples, that is included into the pipeline report.

Config file:

Specify the following parameters:
- homozyg-snp: Minimum SNP count. For example, 10, 25. Abbreviation in file name: homsnp.
- homozyg-kb: Minimum size of ROH in kilobases. For example, 500, 1000. Abbreviation in file name: homkb.
- homozyg-window-snp: Window size for ROH estimation. For example, 20, 50, 100, 1000. Abbreviation in file name: homwinsnp.
- homozyg-window-het: Maximum number of heterozygote sites per window. For example, 1 for a stringent analysis, 3 as a relaxed setting. Abbreviation in file name: homwinhet.
- homozyg-window-missing: Maximum number of missing sites per window. For example, 1 or 5 for a stringent analysis, 10 for intermediate filtering and 15 for relaxed filtering. Abbreviation in file name: homwinmis.
- homozyg-het: Maximum number of heterozygote sites per ROH. For example, 1 for a stringent analysis, 3 as a relaxed setting. Disable this parameter by setting it to 999. Abbreviation in file name: homhet.

Output file locations:

ROH output files: results/historical/ROH/[reference_filename]/ and results/modern/ROH/[reference_filename]/. Note that the parameters specified in the config file are included in the output file names to allow to keep the results from several ROH runs with different parameter settings.
FROH plots: results/historical/ROH/[reference_filename]/, results/modern/ROH/[reference_filename]/, and results/all/ROH/[reference_filename]/.
GenErode report: GenErode_pipeline_report.html in the pipeline directory.

OPTIONAL: snpEff

Annotates SNPs in coding regions in each individual BCF file using snpEff. Requires a GTF file (GTF version 2.2) with gene model predictions for the reference genome. Read more about file conversions between the GTF and GFF format here.

Note that it is recommended to use the genome assembly from a closely related species or population as a reference genome for mapping, variant calling and snpEff. This population or species should be equally distantly related to all populations under comparison.

Creates a directory structure with symbolic links to the genome assembly (FASTA format) and annotation (GTF2.2 format) using custom code.
Updates the snpEff config file with the genome assembly name using custom code.
Builds a snpEff database for the genome assembly and annotation using snpEff.
Annotates individual (i.e. per-sample) filtered VCF files with snpEff.
Summarizes statistics files generated by snpEff using MultiQC (included into the pipeline report).
Creates a table listing numbers of SNPs with high, moderate, low, and modifier effects, for modern and historical samples using custom code that is included into the pipeline report.
Creates a barplot with numbers of SNPs with high, moderate, low, and modifier effects, for modern and historical samples that is included into the pipeline report.

Config file:

Provide the full path to the GTF file (version 2.2) with protein-coding gene model predictions for the reference genome. For format conversions from GFF3 to GTF, e.g. the tool agat_convert_sp_gff2gtf.pl can be used with the parameter --gtf_version 2.2.

Output file locations:

snpEff database files: snpEff/data/[reference_filename]/ within original directory of the GTF file as specified in the config file. That way this step is run only once for a given GTF file.
snpEff output files: results/all/snpEff/[reference_filename]/, results/historical/snpEff/[reference_filename]/ and results/modern/snpEff/[reference_filename]/ and therein.
GenErode report: GenErode_pipeline_report.html in the pipeline directory. Note that the MultiQC html files embedded in the pipeline report can only be accessed through some browsers due to a bug in snakemake.

OPTIONAL: GERP

Calculates GERP scores reflecting evolutionary constraint for each base in the reference genome, requiring genome assemblies of at least 30 outgroup species and a dated phylogeny in NEWICK format scaled in millions of years. If whole-genome resequencing data is available, calculates a measure of relative mutational load from derived alleles and GERP scores.

This step is based on a workflow and scripts from Tom van der Valk.

Please note that this analysis is computationally demanding as it creates a large number of intermediate files and requires several TB of storage space during the pipeline run. These depend on the number of outgroup species, the number of historical/ancient and modern samples, and the level of fragmentation of the reference genome (i.e. the number of contigs/scaffolds).

Note that it is recommended to use the genome assembly from a closely related species or population as a reference genome for GERP. This population or species should be equally distantly related to all populations under comparison.

Splits the reference genome BED file into up to 200 chunks of contigs for parallelization (implemented as python code, which is executed the first time this step is run in dry or in main mode).
Renames genome assemblies so that all files have the same file name extension (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Converts the outgroup genome assemblies from FASTA to FASTQ format (read length: 35 bp) using a custom script (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Calculates basic summary statistics on the FASTQ files using FastQC and summarizes the FastQC reports using MultiQC.
Maps the reads from each outgroup to the reference genome of the target species using bwa mem and converts the output to BAM format using samtools, keeping only uniquely mapped reads (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Indexes the BAM files using samtools (temporary files, automatically deleted by the pipeline as soon as not required any longer).
Calculates basic summary statistics on the BAM files using samtools and summarizes the BAM file statistics using MultiQC.
The following steps are run per genome chunk, i.e. for several contigs per rule (or slurm job if submitted to a cluster). Output directories are marked as temporary and are automatically deleted by the pipeline as soon as their content is not required any longer:
- Converts each BAM file to FASTA format using samtools and a custom script. The output is one FASTA file per contig and outgroup species.
  - Splits the reference genome assembly into one FASTA file per contig using seqtk.
- Concatenates FASTA files from all outgroup species and the reference genome assembly per contig using custom code.
- Computes GERP scores per contig using gerpcol from GERP++, scaling the GERP scores from millions (i.e. the scale of the input tree) to billions of years (-s 0.001).
- Converts GERP scores to the correct genome coordinates of the reference genome of the target species using a custom script.
- Estimates the ancestral state of each position in the reference genome of the target species using a custom script.
- Merges GERP scores with ancestral states into one file per contig using custom code.
- Merges results per contig into one output file per genome chunk using custom code.
Merges results per genome chunk into one output file using custom code.
Creates a histogram of GERP scores across the genome that is included into the pipeline report.
Splits each individual (i.e. per-sample) filtered VCF file into chunks using BEDtools to run the subsequent steps in parallel.
Splits each chunk BED file into windows of 10 Mb each using BEDtools.
Calculates the number of derived alleles per site and sample and adds them to the GERP++ results using a custom script, run per chunk and within each chunk per 10 Mb window. Homozygous derived alleles are counted as 2 derived alleles, heterozygous derived alleles as 1 derived allele per site and individual.
Merges results per 10 Mb window into one file per genome chunk and sample using custom code.
Merges results per genome chunk into one output file per sample using custom code.
Calculates the relative mutational load per sample using a custom script with minimum and maximum GERP scores as specified in the config file, defined as:

For all sites with GERP > minimum GERP score and < maximum GERP score, (sum of GERP scores, multiplied with 2 for homozygous derived alleles) / (sum of derived alleles, homozygous sites counted as 2)

Merges all relative mutational load results per sample into one table that is included into the pipeline report.
Creates a barplot of the relative mutational load per sample, separated into modern and historical samples, that is included into the pipeline report.

Config file:

Provide the full path to a directory containing gzipped genome assemblies of outgroup species. The file name extensions of these FASTA files can be ".fa.gz", ".fasta.gz" or ".fna.gz". A minimum of ca. 30 species should be included into the analysis to get reliable estimates and these outgroup species should be genetically diverse, e.g. include species from all over the tree of mammals when studying a primate (see Davydov et al. 2010).
Provide the full path to a phylogenetic tree of all species included in the analysis in NEWICK format and including divergence time estimates in millions of years. The species names in the tree have to be identical to the FASTA file names of the corresponding genome assemblies, and the name of the target species has to be replaced with the name of the reference genome assembly of the target species (without file name extensions like ".fasta" or ".fasta.gz"). A tree in millions of years can be obtained e.g. from www.timetree.org.
If only GERP scores (incl. a histogram) should be calculated, the GERP step can be run without specifying historical or modern samples in the configuration file.
Sex-chromosomal scaffolds/contigs are automatically excluded from GERP score calculations and relative mutational load calculations if the path to a scaffold/contig text file is provided with the reference genome (sexchromosomes).

Output file locations:

BED files containing chunks of the genome and 10 Mb windows for parallelization: original directory containing the reference genome FASTA file as specified in the config file. That way this step is run only once for a given reference FASTA file (gerp/[reference_filename]/split_bed_files/ within reference file directory).
FastQ file statistics: results/gerp/fastq_files/stats/
BAM file statistics: results/gerp/alignment/[reference_filename]/stats/
GERP scores: results/gerp/[reference_filename].ancestral.rates.gz, with columns chromosome, position (1-based), ancestral state of reference base, GERP++ score (scaled from millions to billions of years)
GERP scores plus numbers of derived alleles per site and per sample: results/gerp/historical/[reference_filename]/*.ancestral.rates.derived.alleles.gz and results/gerp/modern/[reference_filename]/*.ancestral.rates.derived.alleles.gz.
GenErode report: GenErode_pipeline_report.html in the pipeline directory.

Intermediate files/directories:

Subdirectories within results/gerp/chunks/[reference_filename]/ are automatically deleted by the pipeline. Please check if they are still present after a finished pipeline run and remove them manually if necessary as they contain a large number of intermediate files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3. Detailed documentation of the different analysis steps of the pipeline

Data processing

Preparation of the reference genome for downstream analyses

Repeat identification in the reference genome

Fastq file processing

OPTIONAL: Mapping of historical reads to mitochondrial genomes of possible contaminating species

Mapping

BAM file processing: merging per index, duplicate removal, merging per sample, indel realignment, BAM statistics

OPTIONAL: Base quality rescaling of historical samples

OPTIONAL: Subsampling of BAM files to a target mean genome-wide depth

Genotyping

OPTIONAL: CpG site identification

OPTIONAL: mlRho

VCF file processing, PCA, ROH, and SNP annotation

OPTIONAL: CpG filtering of VCF files

VCF file quality and repeat filtering

Merge VCF files per dataset and filter across all samples

OPTIONAL: PCA

OPTIONAL: ROH

OPTIONAL: snpEff

OPTIONAL: GERP

Clone this wiki locally