Reference files are essential at various steps of the pipeline, including joint-genotyping, VQSR, the Variant Effect Predictor (VEP), and exomiser.
These files must be correctly downloaded and specified through pipeline parameters. This document provides a comprehensive list of the required reference files and explains how to set the pipeline parameters appropriately.
The referenceGenome
parameter specifies the directory containing the reference genome files.
This directory should contain the following files:
- The reference genome FASTA file (e.g.,
Homo_sapiens_assembly38.fasta
). This filename must be specified with thereferenceGenomeFasta
parameter. - The reference genome FASTA file index (e.g.,
Homo_sapiens_assembly38.fasta.fai
). Its location will be automatically derived by appending.fai
to thereferenceGenomeFasta
parameter. - The reference genome dictionary file (e.g.,
Homo_sapiens_assembly38.dict
). Its location will be automatically derived by replacing the.fasta
file extension of thereferenceGenomeFasta
parameter with.dict
.
The broad
parameter specifies the directory containing the reference data files for VQSR. We chose the name broad
because
this data is from the Broad Institute, a collaborative research institution known for its contributions to genomics and biomedical research.
Files can be downloaded using this link: GATK Ressource Bundle
The broad directory must contain the following files:
- Intervals Files: The genomic interval(s) over which we operate (WES, WGS or targeted sequencing). The filename of this list must be defined with the intervalsFile parameter (e.g.,
wgs_calling_regions.hg38.interval_list
). For more details, see Gatk documentation. - Highly validated variance ressources currently required by VQSR. These are currently hard coded in the pipeline:
- HapMap file : hapmap_3.3.hg38.vcf.gz
- 1000G omni2.5 file : 1000G_omni2.5.hg38.vcf.gz
- 1000G reference file : 1000G_phase1.snps.high_confidence.hg38.vcf.gz
- SNP database : Homo_sapiens_assembly38.dbsnp138.vcf.gz
Extra settings (ex: resource prior probabilities, tranches, etc.) required to run the different VQSR steps are injected through pipeline parameters or hard coded in the vqsr modules. The values chosen for these settings are based on NIH Biowulf
The vepCache
parameter specifies the directory for the vep cache. It is only required if vep
is specified via the
tools
parameter.
The vep cache is not automatically populated by the pipeline. It must be pre-downloaded. You can obtain a copy of the data by following the vep installation procedure. Generally, we only need the human files obtainable from Ensembl.
The exomiser reference data is only required if exomiser
is specified via the tools
parameter.
The exomiser_data_dir
parameter specifies the path to the directory containing the exomiser reference files.
This directory will be passed to the exomiser tool via the exomiser option --exomiser.data-directory
.
It's content should look like this:
2402_hg19/
2402_hg38/
2402_phenotype/
remm/
ReMM.v0.3.1.post1.hg38.tsv.gz
ReMM.v0.3.1.post1.hg38.tsv.gz.tbi
cadd/1.7/
gnomad.genomes.r4.0.indel.tsv.gz
gnomad.genomes.r4.0.indel.tsv.gz.tbi
whole_genome_SNVs.tsv.gz
whole_genome_SNVs.tsv.gz.tbi
- 2402_hg19/ and 2402_hg38/: These folders contain data associated with the
hg19
andhg38
genome assemblies, respectively. The number2402
corresponds to the exomiser data version. - remm/: This folder is required only if REMM is used as a pathogenicity source in the exomiser analysis. In this case, additional parameters must be provided to specify the REMM data version (here
0.3.1.post1
) and the name of the .tsv.gz file to be used within this folder. See below. - cadd/: This folder is required only if CADD is used as a pathogenicity source in the exomiser analysis. Here
1.7
is the CADD data version. As for REMM, additionnal parameters must be provided. See below.
To prepare the exomiser data directory, follow the instructions in the exomiser installation documentation
Together with the exomiser_data_dir
parameter, these parameters must be provided to exomiser and should match the reference data available
exomiser_genome
: The genome assembly version to be used by exomiser. Accepted values arehg38
orhg19
.exomiser_data_version
: The exomiser data version. Example:2402
.exomiser_cadd_version
: The version of the CADD data to be used by exomiser (optional). Example:1.7
.exomiser_cadd_indel_filename
: The filename of the exomiser CADD indel data file (optional). Example:gnomad.genomes.r4.0.indel.tsv.gz
exomiser_cadd_snv_filename
: The filename of the exomiser CADD snv data file (optional). Example:whole_genome_SNVs.tsv.gz
exomiser_remm_version
: The version of the REMM data to be used by exomiser (optional). Example:0.3.1.post1
exomiser_remm_filename
: The filename of the exomiser REMM data file (optional). Example:ReMM.v0.3.1.post1.hg38.tsv.gz
In addition to the reference data, exomiser requires an analysis file (.yml/.json) that contains, among others things, the variant frequency sources for prioritization of rare variants, variant pathogenicity sources to consider, the list of filters and prioretizers to apply, etc.
Typically, different analysis settings are used for whole exome sequencing (WES) and whole genome sequencing (WGS) data. Defaults analysis files are provided for each sequencing type in the assets folder:
- assets/exomiser/default_exomiser_WES_analysis.yml
- assets/exomiser/default_exomiser_WGS_analysis.yml
You can override these defaults and provide your own analysis file(s) via parameters exomiser_analyis_wes
and exomiser_analysis_wgs
.
Note that the default analysis files do not include REMM or CADD pathogenicity sources.
The exomiser analysis file format follows the phenopacket
standard and is described in detail here.
There are typically multiple sections in the analysis file. To be compatible with the way we run the exomiser command, your
analysis file should contain only the analysis
section.
Parameter name | Required? | Description |
---|---|---|
referenceGenome |
Required | Path to the directory containing the reference genome data |
referenceGenomeFasta |
Required | Filename of the reference genome .fasta file, within the specified referenceGenome directory |
broad |
Required | Path to the directory containing Broad reference data |
intervalsFile |
Required | Filename of the genome intervals list, within the specified broad directory |
vepCache |
Optional | Path to the vep cache data directory |
exomiser_data_dir |
Optional | Path to the exomiser reference data directory |
exomiser_genome |
Optional | Genome assembly version to be used by exomiser(hg19 or hg38 ) |
exomiser_data_version |
Optional | Exomiser data version (e.g., 2402 ) |
exomiser_cadd_version |
Optional | Version of the CADD data to be used by exomiser (e.g., 1.7 ) |
exomiser_cadd_indel_filename |
Optional | Filename of the exomiser CADD indel data file (e.g., gnomad.genomes.r4.0.indel.tsv.gz ) |
exomiser_cadd_snv_filename |
Optional | Filename of the exomiser CADD snv data file (e.g., whole_genome_SNVs.tsv.gz ) |
exomiser_remm_version |
Optional | Version of the REMM data to be used by exomiser (e.g., 0.3.1.post1 ) |
exomiser_remm_filename |
Optional | Filename of the exomiser REMM data file (e.g., ReMM.v0.3.1.post1.hg38.tsv.gz ) |
exomiser_analysis_wes |
Optional | Path to the exomiser analysis file for WES data, if different from the default |
exomiser_analysis_wgs |
Optional | Path to the exomiser analysis file for WGS data, if different from the default |