Skip to content

Commit

Permalink
Lysiane changes
Browse files Browse the repository at this point in the history
It includes the following changes:
-fix linter warnings
-introducing conditional parameter validation logic for exomiser and vep
-using a dedicated exomiser_genome parameter
-utility functions to check if a tool is present
-make exomiser stub output files identical to real output files
-infer exomiser version from exomiser banner file in container
-standardize exomizer process outputs
-introducing per sequencing type analysis file
-use process input instead params to pass configuration information
-update README.md, OUTPUT.md and USAGE.md
  • Loading branch information
LysianeBouchard committed Sep 23, 2024
1 parent 9e74994 commit 860ac7f
Show file tree
Hide file tree
Showing 13 changed files with 432 additions and 282 deletions.
105 changes: 29 additions & 76 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
[![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)

[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.10.1-23aa62.svg)](https://www.nextflow.io/)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
[![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/ferlab/postprocessing)

<!-- HIDDING BECAUSE NOT SUPPORTED YET
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
-->

## Introduction

**ferlab/postprocessing** is a bioinformatics pipeline that takes GVCFs from several samples to combine, perform joint genotyping, tag low quality variant and annotate a final vcf version.
**Ferlab-Ste-Justine/Post-processing-Pipeline** is a bioinformatics pipeline designed for family-based analysis of GVCFs from multiple samples.
It performs joint genotyping, tags low-quality variants, and optionally annotates the final vcf data using vep and/or exomiser.

<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
### Summary:
1. Remove MNPs using bcftools
2. Normalize .gvcf
Expand All @@ -19,104 +21,56 @@
5. Tag false positive variants with either:
- For whole genome sequencing data: [Variant quality score recalibration (VQSR)](https://gatk.broadinstitute.org/hc/en-us/articles/360036510892-VariantRecalibrator)
- For whole exome sequencing data: [Hard-Filtering](https://gatk.broadinstitute.org/hc/en-us/articles/360036733451-VariantFiltration)
6. Annotate variants with [Variant effect predictor (VEP)](https://useast.ensembl.org/info/docs/tools/vep/index.html)
6. Optionnally annotate variants with [Variant effect predictor (VEP)](https://useast.ensembl.org/info/docs/tools/vep/index.html)
7. Optionnally integrate phenotype data to annotate, filter and prioritise variants likely to be disease-causing with [exomiser](https://www.sanger.ac.uk/tool/exomiser/)

<!-- TODO: UPDATE THIS DIAGRAM -->
![PostProcessingDiagram](assets/PostProcessingImage.png?raw=true)

## Usage

> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
### Samples
The workflow will accept sample data separated by commas (CSV format). The path to the sample file must be specified with the "**input**" parameter. The column names are : familyId,sample,sequencingType,file. The sequencing type must be either WES (Whole Exome Sequencing) or WGS (Whole Genome Sequencing).

**sample.csv**
```csv
**familyId**,**sample**,**sequencingType**,**file**
CONGE-XXX,01,WES,CONGE-XXX-01.hard-filtered.gvcf.gz
CONGE-XXX,02,WES,CONGE-XXX-02.hard-filtered.gvcf.gz
CONGE-XXX,03,WES,CONGE-XXX-03.hard-filtered.gvcf.gz
CONGE-YYY,01,WGS,CONGE-YYY-01.hard-filtered.gvcf.gz
CONGE-YYY,02,WGS,CONGE-YYY-02.hard-filtered.gvcf.gz
CONGE-YYY,03,WGS,CONGE-YYY-03.hard-filtered.gvcf.gz
```


> [!NOTE]
> The sequencing type also determines the type of variant filtering the pipeline will use.
>
> In the case of Whole Genome Sequencing, VQSR (Variant Quality Score Recalibration) is used (preferred method).
>
> In the case of Whole Exome Sequencing, Hard-filtering needs to be used.
Now, you can run the pipeline using:

<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
Here is an example nextflow command to run the pipeline:

```bash
nextflow run ferlab/postprocessing \
-profile <docker/singularity/.../> \
nextflow run -c cluster.config Ferlab-Ste-Justine/Post-processing-Pipeline -r "v2.0.0" \
-params-file params.json \
--input samplesheet.csv \
--outdir <OUTDIR>
--outdir results/dir \
--tools vep,exomiser
```

> [!NOTE]
> If you are new to nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up nextflow.
> [!WARNING]
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
> Please provide pipeline parameters via the CLI or nextflow `-params-file` option. Custom config files including those provided by the `-c` nextflow option can be used to provide any configuration _**except for parameters**_;
> see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
### References
Reference files are necessary at multiple steps of the workflow, notably for joint-genotyping,the variant effect predictor (VEP) and VQSR.
Using igenome, we can retrieve the relevant files for the desired version of the human genome.
Specifically, we specifiy the igenome version with the **genome** parameter. Most likely this value will be *'GRCh38'*


Next, we also need broader references, which are contained in a path defined by the **broad** parameter.
For more details, see [docs/usage.md](docs/usage.md) and [docs/reference_data.md](docs/reference_data.md).

The broad directory must contain the following files:

- The interval list which determines the genomic interval(s) over which we operate: filename of this list must be defined with the **intervalsFile** parameter
- Highly validated variance ressources currently required by VQSR. ***These are currently hard coded in the pipeline!***
- HapMap file : hapmap_3.3.hg38.vcf.gz
- 1000G omni2.5 file : 1000G_omni2.5.hg38.vcf.gz
- 1000G reference file : 1000G_phase1.snps.high_confidence.hg38.vcf.gz
- SNP database : Homo_sapiens_assembly38.dbsnp138.vcf.gz
### Stub mode and quick tests


Finally, the vep cache directory must be specified with **vepCache**, which is usually created by vep itself on first installation.
Generally, we only need the human files obtainable from https://ftp.ensembl.org/pub/release-112/variation/vep/homo_sapiens_vep_112_GRCh38.tar.gz
The `-stub` (or `-stub-run`) option can be added to run the "stub" block of processes instead of the "script" block. This can be helpful for testing.

### Stub run
The -stub-run option can be added to run the "stub" block of processes instead of the "script" block. This can be helpful for testing.

🚧

Parameters summary
-----
To test your setup in stub mode, simply run `nextflow run Ferlab-Ste-Justine/Post-processing-Pipeline -profile test,docker -stub`.

| Parameter name | Required? | Accepted input |
| --- | --- | --- |
| `input` | _Required_ | file |
| `outdir` | _Required_ | path |
| `genome` | _Required_ | igenome version, ie 'GRCh38'|
| `broad` | _Required_ | path |
| `intervalsFile` | _Required_ | list of genome intervals |
| `vepCache` | _Required_ | path |
For tests with real data, see documentation in the [test configuration profile](conf/test.config)


Pipeline Output
-----
Path to output directory must be specified in **outdir** parameter.
🚧
Path to output directory must be specified via the `outdir` parameter.

See [docs/output.md](docs/output.md) for more details about pipeline outputs.

## Credits

ferlab/postprocessing was originally written by Damien Geneste, David Morais, Felix-Antoine Le Sieur, Jeremy Costanza, Lysiane Bouchard.
## Credits

We thank the following people for their extensive assistance in the development of this pipeline:
Ferlab-Ste-Justine/Post-processing-Pipeline was originally written by Damien Geneste, David Morais, Felix-Antoine Le Sieur, Jeremy Costanza, Lysiane Bouchard.

<!-- TODO nf-core: If applicable, make list of people who have also contributed -->

## Contributions and Support

Expand All @@ -140,11 +94,10 @@ The documentation of the various tools used in this workflow are available here:

[VEP](https://useast.ensembl.org/info/docs/tools/vep/script/vep_options.html)

## Citations
[EXOMISER](https://exomiser.readthedocs.io/en/latest/)

<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->

An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
## Citations

This pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/master/LICENSE).

Expand Down
5 changes: 3 additions & 2 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,9 @@ params {
tools = "vep,exomiser"

// Exomiser parameters
exomiser_analysis = "assets/exomiser/test_exomiser_analysis.yml"
exomiser_analysis_wes = "assets/exomiser/test_exomiser_analysis.yml"
exomiser_analysis_wgs = "assets/exomiser/test_exomiser_analysis.yml"
exomiser_data_dir = "data-test/reference/exomiser"
exomiser_data_version = "2402"
genome = "hg38"
exomiser_genome = "hg38"
}
9 changes: 6 additions & 3 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,8 @@
## Introduction

This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
The directories listed below will be created in the output directory after the pipeline has finished. All paths are relative to the top-level output directory.

<!-- TODO nf-core: Write this documentation describing your workflow's output -->

## Pipeline overview

Expand All @@ -20,7 +19,11 @@ The directories listed below will be created in the results directory after the
- A copy of the nextflow log file: `nextflow.log`. Note that it will miss logs written after the workflow.onComplete handler is run.
- Copies of the configuration files used: `config/*.config`. This includes the default `nextflow.config` file as well as any additional configuration files passed as parameters.
- Other metadata relevant for reproducibility: `metadata.txt` . It contains information such as the original command line, the name of the branch and revision used, the username of the person who submitted the job, a list of configuration files passed, the nextflow work directory, etc.

- `splitmultiallelics/`: pipeline output before running the tools specified via the `tools` parameter.
- `vep/`: vep output
- `exomiser/results`: exomiser output

You might see other folders named after different pipeline processes. These are considered intermediate pipeline outputs.

</details>

Expand Down
93 changes: 93 additions & 0 deletions docs/reference_data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Ferlab-Ste-Justine/Post-processing-Pipeline: Reference Data

Reference files are essential at various steps of the pipeline, including joint-genotyping, VQSR, the Variant Effect Predictor (VEP), and exomiser.

These files must be correctly downloaded and specified through pipeline parameters. This document provides a comprehensive list of the required reference files and explains how to set the pipeline parameters appropriately.

## Broad reference data (VQSR)
The `broad` parameter specifies the directory containing the reference data files for VQSR. We chose the name `broad` because
this data is from the [Broad Institute](https://www.broadinstitute.org/), a collaborative research institution known for its contributions to genomics and biomedical research.

The broad directory must contain the following files:
- *Intervals File*: The genomic interval(s) over which we operate. The filename of this list must be defined with the `intervalsFile` parameter (e.g., "interval_long_local.list").
- Highly validated variance ressources currently required by VQSR. ***These are currently hard coded in the pipeline***:
- HapMap file : hapmap_3.3.hg38.vcf.gz
- 1000G omni2.5 file : 1000G_omni2.5.hg38.vcf.gz
- 1000G reference file : 1000G_phase1.snps.high_confidence.hg38.vcf.gz
- SNP database : Homo_sapiens_assembly38.dbsnp138.vcf.gz

## Reference Genome

The `referenceGenome` parameter specifies the directory containing the reference genome files.

This directory should contain the following files:
- The reference genome FASTA file (e.g., `Homo_sapiens_assembly38.fasta`). This filename must be specified with the `referenceGenomeFasta` parameter.
- The reference genome FASTA file index (e.g., `Homo_sapiens_assembly38.fasta.fai`). Its location will be automatically derived by appending `.fai` to the `referenceGenomeFasta` parameter.
- The reference genome dictionary file (e.g., `Homo_sapiens_assembly38.dict`). Its location will be automatically derived by replacing the `.fasta` file extension of the `referenceGenomeFasta` parameter with `.dict`.


## VEP Cache Directory
The `vepCache` parameter specifies the directory for the vep cache. It is only required if `vep` is specified via the
`tools` parameter.

The vep cache is not automatically populated by the pipeline. It must be pre-downloaded. You can obtain a copy of the
data by following the [vep installation procedure](https://github.com/Ensembl/ensembl-vep). Generally, we only need the human files obtainable from [Ensembl](https://ftp.ensembl.org/pub/release-112/variation/vep/homo_sapiens_vep_112_GRCh38.tar.gz).

## Exomiser reference data
The exomiser reference data is only required if `exomiser` is specified via the `tools` parameter.

The `exomiser_data_dir` parameter specifies the path to the directory containing the exomiser reference files.
This directory will be passed to the exomiser tool via the exomiser option `--exomiser.data-directory`.

It's content should look like this:
```
2402_hg19/
2402_hg38/
2402_phenotype/
remm/
cadd/
```

- *2402_hg19/* and *2402_hg38/*: These folders contain data associated with the `hg19` and `hg38` genome assemblies, respectively. The number `2402` corresponds to the exomiser data version.
- *remm/* and *cadd/*: These folders are necessary if REMM and CADD are used as pathogenicity sources in the exomiser analysis file. The files and subdirectories within these folders must follow a specific structure, and exomiser will need to know the genome assembly (hg19 or hg38) and the versions of REMM and CADD being used to infer files locations.

To prepare the exomiser data directory, follow the instructions in the [exomiser installation documentation](https://exomiser.readthedocs.io/en/latest/installation.html#linux-install)

Together with the `exomiser_data_dir` parameter, these parameters must be provided to exomiser and should match the reference data available
- `exomiser_genome`: The genome assembly version to be used by exomiser. Accepted values are `hg38` or `hg19`.
- `exomiser_data_version`: The exomiser data version. Example: `2402`.
- `exomiser_cadd_version`: The version of the CADD data to be used by exomiser (optional). Example: `1.3`.
- `exomiser_remm_version`: The version of the REMM data to be used by exomiser (optional). Example:`0.3.1`

## Exomiser analysis files
In addition to the reference data, exomiser requires an analysis file (.yml/.json) that contains, among others
things, the variant frequency sources for prioritization of rare variants, variant pathogenicity sources to consider, the list of filters and prioretizers to apply, etc.

Typically, different analysis settings are used for whole exome sequencing (WES) and whole genome sequencing (WGS) data.
Defaults analysis files are provided for each sequencing type in the assets folder:
- assets/exomiser/default_exomiser_WES_analysis.yml
- assets/exomiser/default_exomiser_WGS_analysis.yml

You can override these defaults and provide your own analysis file(s) via parameters `exomiser_analyis_wes` and `exomiser_analysis_wgs`.

The exomiser analysis file format follows the `phenopacket` standard and is described in detail [here](https://exomiser.readthedocs.io/en/latest/advanced_analysis.html#analysis).
There are typically multiple sections in the analysis file. To be compatible with the way we run the exomiser command, your
analysis file should contain only the `analysis` section.

## Reference data parameters summary

| Parameter name | Required? | Description |
| --- | --- | --- |
| `referenceGenome` | _Required_ | Path to the directory containing the reference genome data |
| `referenceGenomeFasta` | _Required_ | Filename of the reference genome .fasta file, within the specified `referenceGenome` directory |
| `broad` | _Required_ | Path to the directory containing Broad reference data |
| `intervalsFile` | _Required_ | Filename of the genome intervals list, within the specified `broad` directory |
| `vepCache` | _Optional_ | Path to the vep cache data directory |
| `exomiser_data_dir` | _Optional_ | Path to the exomiser reference data directory |
| `exomiser_genome` | _Optional_ | Genome assembly version to be used by exomiser(`hg19` or `hg38`) |
| `exomiser_data_version` | _Optional_ | Exomiser data version (e.g., `2402`)|
| `exomiser_cadd_version` | _Optional_ | Version of the CADD data to be used by exomiser (e.g., `1.7`)|
| `exomiser_remm_version` | _Optional_ | Version of the REMM data to be used by exomiser (e.g., `0.3.1`)|
| `exomiser_analysis_wes` | _Optional_ | Path to the exomiser analysis file for WES data, if different from the default |
| `exomiser_analysis_wgs` | _Optional_ | Path to the exomiser analysis file for WGS data, if different from the default |

Loading

0 comments on commit 860ac7f

Please sign in to comment.