-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
It includes the following changes: -fix linter warnings -introducing conditional parameter validation logic for exomiser and vep -using a dedicated exomiser_genome parameter -utility functions to check if a tool is present -make exomiser stub output files identical to real output files -infer exomiser version from exomiser banner file in container -standardize exomizer process outputs -introducing per sequencing type analysis file -use process input instead params to pass configuration information -update README.md, OUTPUT.md and USAGE.md
- Loading branch information
1 parent
9e74994
commit 860ac7f
Showing
13 changed files
with
432 additions
and
282 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# Ferlab-Ste-Justine/Post-processing-Pipeline: Reference Data | ||
|
||
Reference files are essential at various steps of the pipeline, including joint-genotyping, VQSR, the Variant Effect Predictor (VEP), and exomiser. | ||
|
||
These files must be correctly downloaded and specified through pipeline parameters. This document provides a comprehensive list of the required reference files and explains how to set the pipeline parameters appropriately. | ||
|
||
## Broad reference data (VQSR) | ||
The `broad` parameter specifies the directory containing the reference data files for VQSR. We chose the name `broad` because | ||
this data is from the [Broad Institute](https://www.broadinstitute.org/), a collaborative research institution known for its contributions to genomics and biomedical research. | ||
|
||
The broad directory must contain the following files: | ||
- *Intervals File*: The genomic interval(s) over which we operate. The filename of this list must be defined with the `intervalsFile` parameter (e.g., "interval_long_local.list"). | ||
- Highly validated variance ressources currently required by VQSR. ***These are currently hard coded in the pipeline***: | ||
- HapMap file : hapmap_3.3.hg38.vcf.gz | ||
- 1000G omni2.5 file : 1000G_omni2.5.hg38.vcf.gz | ||
- 1000G reference file : 1000G_phase1.snps.high_confidence.hg38.vcf.gz | ||
- SNP database : Homo_sapiens_assembly38.dbsnp138.vcf.gz | ||
|
||
## Reference Genome | ||
|
||
The `referenceGenome` parameter specifies the directory containing the reference genome files. | ||
|
||
This directory should contain the following files: | ||
- The reference genome FASTA file (e.g., `Homo_sapiens_assembly38.fasta`). This filename must be specified with the `referenceGenomeFasta` parameter. | ||
- The reference genome FASTA file index (e.g., `Homo_sapiens_assembly38.fasta.fai`). Its location will be automatically derived by appending `.fai` to the `referenceGenomeFasta` parameter. | ||
- The reference genome dictionary file (e.g., `Homo_sapiens_assembly38.dict`). Its location will be automatically derived by replacing the `.fasta` file extension of the `referenceGenomeFasta` parameter with `.dict`. | ||
|
||
|
||
## VEP Cache Directory | ||
The `vepCache` parameter specifies the directory for the vep cache. It is only required if `vep` is specified via the | ||
`tools` parameter. | ||
|
||
The vep cache is not automatically populated by the pipeline. It must be pre-downloaded. You can obtain a copy of the | ||
data by following the [vep installation procedure](https://github.com/Ensembl/ensembl-vep). Generally, we only need the human files obtainable from [Ensembl](https://ftp.ensembl.org/pub/release-112/variation/vep/homo_sapiens_vep_112_GRCh38.tar.gz). | ||
|
||
## Exomiser reference data | ||
The exomiser reference data is only required if `exomiser` is specified via the `tools` parameter. | ||
|
||
The `exomiser_data_dir` parameter specifies the path to the directory containing the exomiser reference files. | ||
This directory will be passed to the exomiser tool via the exomiser option `--exomiser.data-directory`. | ||
|
||
It's content should look like this: | ||
``` | ||
2402_hg19/ | ||
2402_hg38/ | ||
2402_phenotype/ | ||
remm/ | ||
cadd/ | ||
``` | ||
|
||
- *2402_hg19/* and *2402_hg38/*: These folders contain data associated with the `hg19` and `hg38` genome assemblies, respectively. The number `2402` corresponds to the exomiser data version. | ||
- *remm/* and *cadd/*: These folders are necessary if REMM and CADD are used as pathogenicity sources in the exomiser analysis file. The files and subdirectories within these folders must follow a specific structure, and exomiser will need to know the genome assembly (hg19 or hg38) and the versions of REMM and CADD being used to infer files locations. | ||
|
||
To prepare the exomiser data directory, follow the instructions in the [exomiser installation documentation](https://exomiser.readthedocs.io/en/latest/installation.html#linux-install) | ||
|
||
Together with the `exomiser_data_dir` parameter, these parameters must be provided to exomiser and should match the reference data available | ||
- `exomiser_genome`: The genome assembly version to be used by exomiser. Accepted values are `hg38` or `hg19`. | ||
- `exomiser_data_version`: The exomiser data version. Example: `2402`. | ||
- `exomiser_cadd_version`: The version of the CADD data to be used by exomiser (optional). Example: `1.3`. | ||
- `exomiser_remm_version`: The version of the REMM data to be used by exomiser (optional). Example:`0.3.1` | ||
|
||
## Exomiser analysis files | ||
In addition to the reference data, exomiser requires an analysis file (.yml/.json) that contains, among others | ||
things, the variant frequency sources for prioritization of rare variants, variant pathogenicity sources to consider, the list of filters and prioretizers to apply, etc. | ||
|
||
Typically, different analysis settings are used for whole exome sequencing (WES) and whole genome sequencing (WGS) data. | ||
Defaults analysis files are provided for each sequencing type in the assets folder: | ||
- assets/exomiser/default_exomiser_WES_analysis.yml | ||
- assets/exomiser/default_exomiser_WGS_analysis.yml | ||
|
||
You can override these defaults and provide your own analysis file(s) via parameters `exomiser_analyis_wes` and `exomiser_analysis_wgs`. | ||
|
||
The exomiser analysis file format follows the `phenopacket` standard and is described in detail [here](https://exomiser.readthedocs.io/en/latest/advanced_analysis.html#analysis). | ||
There are typically multiple sections in the analysis file. To be compatible with the way we run the exomiser command, your | ||
analysis file should contain only the `analysis` section. | ||
|
||
## Reference data parameters summary | ||
|
||
| Parameter name | Required? | Description | | ||
| --- | --- | --- | | ||
| `referenceGenome` | _Required_ | Path to the directory containing the reference genome data | | ||
| `referenceGenomeFasta` | _Required_ | Filename of the reference genome .fasta file, within the specified `referenceGenome` directory | | ||
| `broad` | _Required_ | Path to the directory containing Broad reference data | | ||
| `intervalsFile` | _Required_ | Filename of the genome intervals list, within the specified `broad` directory | | ||
| `vepCache` | _Optional_ | Path to the vep cache data directory | | ||
| `exomiser_data_dir` | _Optional_ | Path to the exomiser reference data directory | | ||
| `exomiser_genome` | _Optional_ | Genome assembly version to be used by exomiser(`hg19` or `hg38`) | | ||
| `exomiser_data_version` | _Optional_ | Exomiser data version (e.g., `2402`)| | ||
| `exomiser_cadd_version` | _Optional_ | Version of the CADD data to be used by exomiser (e.g., `1.7`)| | ||
| `exomiser_remm_version` | _Optional_ | Version of the REMM data to be used by exomiser (e.g., `0.3.1`)| | ||
| `exomiser_analysis_wes` | _Optional_ | Path to the exomiser analysis file for WES data, if different from the default | | ||
| `exomiser_analysis_wgs` | _Optional_ | Path to the exomiser analysis file for WGS data, if different from the default | | ||
|
Oops, something went wrong.