Lysiane changes

It includes the following changes: -fix linter warnings -introducing conditional parameter validation logic for exomiser and vep -using a dedicated exomiser_genome parameter -utility functions to check if a tool is present -make exomiser stub output files identical to real output files -infer exomiser version from exomiser banner file in container -standardize exomizer process outputs -introducing per sequencing type analysis file -use process input instead params to pass configuration information -update README.md, OUTPUT.md and USAGE.md
Ferlab-Ste-Justine · Sep 23, 2024 · 860ac7f · 860ac7f
1 parent 9e74994
commit 860ac7f
Show file tree

Hide file tree

Showing 13 changed files with 432 additions and 282 deletions.
diff --git a/README.md b/README.md
@@ -1,16 +1,18 @@
 [![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)
 
 [![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.10.1-23aa62.svg)](https://www.nextflow.io/)
-[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
 [![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
 [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
-[![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/ferlab/postprocessing)
+
+<!-- HIDDING BECAUSE NOT SUPPORTED YET
+[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
+-->
 
 ## Introduction
 
-**ferlab/postprocessing** is a bioinformatics pipeline that takes GVCFs from several samples to combine, perform joint genotyping, tag low quality variant and annotate a final vcf version.
+**Ferlab-Ste-Justine/Post-processing-Pipeline** is a bioinformatics pipeline designed for family-based analysis of GVCFs from multiple samples. 
+It performs joint genotyping, tags low-quality variants, and optionally annotates the final vcf data using vep and/or exomiser.
 
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
 ###  Summary:
 1. Remove MNPs using bcftools 
 2. Normalize .gvcf
@@ -19,104 +21,56 @@
 5. Tag false positive variants with either:
   - For whole genome sequencing data: [Variant quality score recalibration (VQSR)](https://gatk.broadinstitute.org/hc/en-us/articles/360036510892-VariantRecalibrator)
   - For whole exome sequencing data: [Hard-Filtering](https://gatk.broadinstitute.org/hc/en-us/articles/360036733451-VariantFiltration)
-6. Annotate variants with [Variant effect predictor (VEP)](https://useast.ensembl.org/info/docs/tools/vep/index.html)
+6. Optionnally annotate variants with [Variant effect predictor (VEP)](https://useast.ensembl.org/info/docs/tools/vep/index.html)
+7. Optionnally integrate phenotype data to annotate, filter and prioritise variants likely to be disease-causing with [exomiser](https://www.sanger.ac.uk/tool/exomiser/)
 
+<!-- TODO: UPDATE THIS DIAGRAM -->
 ![PostProcessingDiagram](assets/PostProcessingImage.png?raw=true)
 
 ## Usage
 
-> [!NOTE]
-> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
-
-### Samples
-The workflow will accept sample data separated by commas (CSV format). The path to the sample file must be specified with the "**input**" parameter. The column names are : familyId,sample,sequencingType,file. The sequencing type must be either WES (Whole Exome Sequencing) or WGS (Whole Genome Sequencing).
-
-**sample.csv**
-```csv
-**familyId**,**sample**,**sequencingType**,**file**
-CONGE-XXX,01,WES,CONGE-XXX-01.hard-filtered.gvcf.gz
-CONGE-XXX,02,WES,CONGE-XXX-02.hard-filtered.gvcf.gz
-CONGE-XXX,03,WES,CONGE-XXX-03.hard-filtered.gvcf.gz
-CONGE-YYY,01,WGS,CONGE-YYY-01.hard-filtered.gvcf.gz
-CONGE-YYY,02,WGS,CONGE-YYY-02.hard-filtered.gvcf.gz
-CONGE-YYY,03,WGS,CONGE-YYY-03.hard-filtered.gvcf.gz
-```
-
-
-> [!NOTE]
-> The sequencing type also determines the type of variant filtering the pipeline will use.
-> 
-> In the case of Whole Genome Sequencing, VQSR (Variant Quality Score Recalibration) is used (preferred method).
-> 
-> In the case of Whole Exome Sequencing, Hard-filtering needs to be used.
-
-Now, you can run the pipeline using:
-
-<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
+Here is an example nextflow command to run the pipeline:
 
 ```bash
-nextflow run ferlab/postprocessing \
-   -profile <docker/singularity/.../> \
+nextflow run -c cluster.config Ferlab-Ste-Justine/Post-processing-Pipeline -r "v2.0.0" \
+    -params-file params.json  \
    --input samplesheet.csv \
-   --outdir <OUTDIR>
+   --outdir results/dir \
+   --tools vep,exomiser
 ```
 
+> [!NOTE]
+> If you are new to nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up nextflow.
+
 > [!WARNING]
-> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
+> Please provide pipeline parameters via the CLI or nextflow `-params-file` option. Custom config files including those provided by the `-c` nextflow option can be used to provide any configuration _**except for parameters**_;
 > see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
 
-### References
-Reference files are necessary at multiple steps of the workflow, notably for joint-genotyping,the variant effect predictor (VEP) and VQSR. 
-Using igenome, we can retrieve the relevant files for the desired version of the human genome.
-Specifically, we specifiy the igenome version with the **genome** parameter. Most likely this value will be *'GRCh38'*
-
 
-Next, we also need broader references, which are contained in a path defined by the **broad** parameter.
+For more details, see [docs/usage.md](docs/usage.md) and [docs/reference_data.md](docs/reference_data.md).
 
-The broad directory must contain the following files:
 
-- The interval list which determines the genomic interval(s) over which we operate: filename of this list must be defined with the **intervalsFile** parameter
-- Highly validated variance ressources currently required by VQSR. ***These are currently hard coded in the pipeline!***
-  - HapMap file : hapmap_3.3.hg38.vcf.gz
-  - 1000G omni2.5 file : 1000G_omni2.5.hg38.vcf.gz
-  - 1000G reference file : 1000G_phase1.snps.high_confidence.hg38.vcf.gz
-  - SNP database : Homo_sapiens_assembly38.dbsnp138.vcf.gz
+### Stub mode and quick tests
 
-
-Finally, the vep cache directory must be specified with **vepCache**, which is usually created by vep itself on first installation.
-Generally, we only need the human files obtainable from https://ftp.ensembl.org/pub/release-112/variation/vep/homo_sapiens_vep_112_GRCh38.tar.gz
+The `-stub` (or `-stub-run`) option can be added to run the "stub" block of processes instead of the "script" block. This can be helpful for testing.
 
-### Stub run
-The -stub-run option can be added to run the "stub" block of processes instead of the "script" block. This can be helpful for testing.
 
-🚧
-
-Parameters summary
------
+To test your setup in stub mode, simply run `nextflow run Ferlab-Ste-Justine/Post-processing-Pipeline -profile test,docker -stub`. 
 
-| Parameter name | Required? | Accepted input |
-| --- | --- | --- |
-| `input` | _Required_ | file |
-| `outdir` | _Required_ | path |
-| `genome` | _Required_ | igenome version, ie 'GRCh38'|
-| `broad` | _Required_ | path |
-| `intervalsFile` | _Required_ | list of genome intervals |
-| `vepCache` | _Required_ | path |
+For tests with real data, see documentation in the [test configuration profile](conf/test.config)
 
 
 Pipeline Output
 -----
-Path to output directory must be specified in **outdir** parameter.
-🚧
+Path to output directory must be specified via the `outdir` parameter.
 
+See [docs/output.md](docs/output.md) for more details about pipeline outputs.
 
-## Credits
 
-ferlab/postprocessing was originally written by Damien Geneste, David Morais, Felix-Antoine Le Sieur, Jeremy Costanza, Lysiane Bouchard.
+## Credits
 
-We thank the following people for their extensive assistance in the development of this pipeline:
+Ferlab-Ste-Justine/Post-processing-Pipeline was originally written by Damien Geneste, David Morais, Felix-Antoine Le Sieur, Jeremy Costanza, Lysiane Bouchard.
 
-<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
 
 ## Contributions and Support
 
@@ -140,11 +94,10 @@ The documentation of the various tools used in this workflow are available here:
 
 [VEP](https://useast.ensembl.org/info/docs/tools/vep/script/vep_options.html)
 
-## Citations
+[EXOMISER](https://exomiser.readthedocs.io/en/latest/)
 
-<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
 
-An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
+## Citations
 
 This pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/master/LICENSE).
 

diff --git a/conf/test.config b/conf/test.config
@@ -50,8 +50,9 @@ params {
     tools = "vep,exomiser"
 
     // Exomiser parameters
-    exomiser_analysis = "assets/exomiser/test_exomiser_analysis.yml"
+    exomiser_analysis_wes = "assets/exomiser/test_exomiser_analysis.yml"
+    exomiser_analysis_wgs = "assets/exomiser/test_exomiser_analysis.yml"
     exomiser_data_dir = "data-test/reference/exomiser"
     exomiser_data_version = "2402"
-    genome = "hg38"
+    exomiser_genome = "hg38"
 }
diff --git a/docs/output.md b/docs/output.md
@@ -3,9 +3,8 @@
 ## Introduction
 
 This document describes the output produced by the pipeline.
-The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
+The directories listed below will be created in the output directory after the pipeline has finished. All paths are relative to the top-level output directory.
 
-<!-- TODO nf-core: Write this documentation describing your workflow's output -->
 
 ## Pipeline overview
 
@@ -20,7 +19,11 @@ The directories listed below will be created in the results directory after the
   - A copy of the nextflow log file: `nextflow.log`. Note that it will miss logs written after the workflow.onComplete handler is run.
   - Copies of the configuration files used: `config/*.config`. This includes the default `nextflow.config` file as well as any additional configuration files passed as parameters.
   - Other metadata relevant for reproducibility: `metadata.txt` . It contains information such as the original command line, the name of the branch and revision used, the username of the person who submitted the job, a list of configuration files passed, the nextflow work directory, etc.
-
+- `splitmultiallelics/`: pipeline output before running the tools specified via the `tools` parameter.
+- `vep/`: vep output
+- `exomiser/results`: exomiser output
+
+You might see other folders named after different pipeline processes. These are considered intermediate pipeline outputs.
 
 </details>
 

diff --git a/docs/reference_data.md b/docs/reference_data.md
@@ -0,0 +1,93 @@
+# Ferlab-Ste-Justine/Post-processing-Pipeline: Reference Data
+
+Reference files are essential at various steps of the pipeline, including joint-genotyping, VQSR, the Variant Effect Predictor (VEP), and exomiser. 
+
+These files must be correctly downloaded and specified through pipeline parameters. This document provides a comprehensive list of the required reference files and explains how to set the pipeline parameters appropriately.
+
+## Broad reference data (VQSR)
+The `broad` parameter specifies the directory containing the reference data files for VQSR. We chose the name `broad` because
+this data is from the [Broad Institute](https://www.broadinstitute.org/), a collaborative research institution known for its contributions to genomics and biomedical research.
+
+The broad directory must contain the following files:
+-  *Intervals File*: The genomic interval(s) over which we operate. The filename of this list must be defined with the `intervalsFile` parameter (e.g., "interval_long_local.list"). 
+- Highly validated variance ressources currently required by VQSR. ***These are currently hard coded in the pipeline***:
+  - HapMap file : hapmap_3.3.hg38.vcf.gz
+  - 1000G omni2.5 file : 1000G_omni2.5.hg38.vcf.gz
+  - 1000G reference file : 1000G_phase1.snps.high_confidence.hg38.vcf.gz
+  - SNP database : Homo_sapiens_assembly38.dbsnp138.vcf.gz
+
+## Reference Genome
+
+The `referenceGenome` parameter specifies the directory containing the reference genome files.
+
+This directory should contain the following files:
+- The reference genome FASTA file (e.g., `Homo_sapiens_assembly38.fasta`). This filename must be specified with the `referenceGenomeFasta` parameter.
+- The reference genome FASTA file index (e.g., `Homo_sapiens_assembly38.fasta.fai`). Its location will be automatically derived by appending `.fai` to the `referenceGenomeFasta` parameter.
+- The reference genome dictionary file (e.g., `Homo_sapiens_assembly38.dict`). Its location will be automatically derived by replacing the `.fasta` file extension of the `referenceGenomeFasta` parameter with `.dict`.
+
+
+## VEP Cache Directory
+The `vepCache` parameter specifies the directory for the vep cache. It is only required if `vep` is specified via the
+`tools` parameter.
+
+The vep cache is not automatically populated by the pipeline. It must be pre-downloaded. You can obtain a copy of the 
+data by following the [vep installation procedure](https://github.com/Ensembl/ensembl-vep). Generally, we only need the human files obtainable from [Ensembl](https://ftp.ensembl.org/pub/release-112/variation/vep/homo_sapiens_vep_112_GRCh38.tar.gz).
+
+## Exomiser reference data
+The exomiser reference data is only required if `exomiser` is specified via the `tools` parameter.
+
+The `exomiser_data_dir` parameter specifies the path to the directory containing the exomiser reference files.
+This directory will be passed to the exomiser tool via the exomiser option `--exomiser.data-directory`.
+
+It's content should look like this:
+```
+2402_hg19/
+2402_hg38/
+2402_phenotype/
+remm/
+cadd/
+```
+
+- *2402_hg19/* and *2402_hg38/*: These folders contain data associated with the `hg19` and `hg38` genome assemblies, respectively. The number `2402` corresponds to the exomiser data version.
+- *remm/* and *cadd/*: These folders are necessary if REMM and CADD are used as pathogenicity sources in the exomiser analysis file. The files and subdirectories within these folders must follow a specific structure, and exomiser will need to know the genome assembly (hg19 or hg38) and the versions of REMM and CADD being used to infer files locations.
+
+To prepare the exomiser data directory, follow the instructions in the [exomiser installation documentation](https://exomiser.readthedocs.io/en/latest/installation.html#linux-install)
+
+Together with the `exomiser_data_dir` parameter, these parameters must be provided to exomiser and should match the reference data available
+- `exomiser_genome`: The genome assembly version to be used by exomiser. Accepted values are `hg38` or `hg19`.
+- `exomiser_data_version`: The exomiser data version. Example: `2402`.
+- `exomiser_cadd_version`: The version of the CADD data to be used by exomiser (optional). Example: `1.3`.  
+- `exomiser_remm_version`: The version of the REMM data to be used by exomiser (optional). Example:`0.3.1`
+
+## Exomiser analysis files
+In addition to the reference data, exomiser requires an analysis file (.yml/.json) that contains, among others 
+things, the variant frequency sources for prioritization of rare variants, variant pathogenicity sources to consider, the list of filters and prioretizers to apply, etc.
+
+Typically, different analysis settings are used for whole exome sequencing (WES) and whole genome sequencing (WGS) data.
+Defaults analysis files are provided for each sequencing type in the assets folder:
+- assets/exomiser/default_exomiser_WES_analysis.yml
+- assets/exomiser/default_exomiser_WGS_analysis.yml
+
+You can override these defaults and provide your own analysis file(s) via parameters `exomiser_analyis_wes` and `exomiser_analysis_wgs`. 
+
+The exomiser analysis file format follows  the `phenopacket` standard and is described in detail [here](https://exomiser.readthedocs.io/en/latest/advanced_analysis.html#analysis). 
+There are typically multiple sections in the analysis file. To be compatible with the way we run the exomiser command, your 
+analysis file should contain only the `analysis` section.
+
+## Reference data parameters summary
+
+| Parameter name | Required? | Description |
+| --- | --- | --- |
+| `referenceGenome` |  _Required_ | Path to the directory containing the reference genome data |
+| `referenceGenomeFasta` | _Required_ | Filename of the reference genome .fasta file, within the specified `referenceGenome` directory |
+| `broad` | _Required_ | Path to the directory containing Broad reference data |
+| `intervalsFile` | _Required_ | Filename of the genome intervals list, within the specified `broad` directory |
+| `vepCache` | _Optional_ | Path to the vep cache data directory |
+| `exomiser_data_dir` | _Optional_ | Path to the exomiser reference data directory |
+| `exomiser_genome` | _Optional_ | Genome assembly version to be used by exomiser(`hg19` or `hg38`) |
+| `exomiser_data_version` | _Optional_ | Exomiser data version (e.g., `2402`)|
+| `exomiser_cadd_version` | _Optional_ | Version of the CADD data to be used by exomiser (e.g., `1.7`)|
+| `exomiser_remm_version` | _Optional_ | Version of the REMM data to be used by exomiser (e.g., `0.3.1`)|
+| `exomiser_analysis_wes` | _Optional_ | Path to the exomiser analysis file for WES data, if different from the default |
+| `exomiser_analysis_wgs` | _Optional_ | Path to the exomiser analysis file for WGS data, if different from the default |
+