Merge branch 'docs-update' into 'dev'

Docs update [CW-3102] See merge request epi2melabs/workflows/wf-amplicon!51
epi2me-labs · Dec 6, 2023 · 3f60edc · 3f60edc
2 parents d91a2dc + a9edc36
commit 3f60edc
Show file tree

Hide file tree

Showing 28 changed files with 680 additions and 376 deletions.
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -5,6 +5,7 @@ include:
 
 variables:
   NF_WORKFLOW_OPTS: >
+    -executor.\$$local.memory 16GB
     --fastq test_data/fastq
     --reference test_data/reference.fasta
     --threads 2
@@ -41,6 +42,7 @@ docker-run:
     - if: $MATRIX_NAME == "ref-sample_sheet"
       variables:
         NF_WORKFLOW_OPTS: >
+          -executor.\$$local.memory 16GB
           --fastq test_data/fastq
           --reference test_data/reference.fasta
           --sample_sheet test_data/sample_sheet.csv
@@ -54,6 +56,7 @@ docker-run:
     - if: $MATRIX_NAME == "filter-all"
       variables:
         NF_WORKFLOW_OPTS: >
+          -executor.\$$local.memory 16GB
           --fastq test_data/fastq
           --reference test_data/reference.fasta
           --min_read_qual 20
@@ -69,6 +72,7 @@ docker-run:
     - if: $MATRIX_NAME == "de-novo"
       variables:
         NF_WORKFLOW_OPTS: >
+          -executor.\$$local.memory 16GB
           --fastq test_data/fastq-denovo
           --drop_frac_longest_reads 0.05
         NF_PROCESS_FILES: >

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -1,22 +1,14 @@
 repos:
   - repo: local
     hooks:
-      - id: docs_schema
-        name: docs_schema
-        entry: parse_docs -p docs -e .md -s intro links -oj nextflow_schema.json
-        language: python
-        always_run: true
-        pass_filenames: false
-        additional_dependencies:
-          - epi2melabs
       - id: docs_readme
         name: docs_readme
-        entry: parse_docs -p docs -e .md -s header intro quickstart links -ot README.md
+        entry: parse_docs -p docs -e .md -s 01_brief_description 02_introduction 03_compute_requirements 04_install_and_run 05_related_protocols 06_inputs 07_outputs 08_pipeline_overview 09_troubleshooting 10_FAQ 11_other -ot README.md -od output_definition.json -ns nextflow_schema.json
         language: python
         always_run: true
         pass_filenames: false
         additional_dependencies:
-          - epi2melabs
+          - epi2melabs>=0.0.50
       - id: build_models
         name: build_models
         entry: datamodel-codegen --strict-nullable --base-class workflow_glue.results_schema_helpers.BaseModel --use-schema-description --disable-timestamp --input results_schema.yml --input-file-type openapi --output bin/workflow_glue/results_schema.py

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,13 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [v1.0.0]
+### Added
+- Memory requirements for each process.
+
+### Changed
+- Reworked docs to follow new layout.
+
 ## [v0.6.2]
 ### Fixed
 - The de-novo QC stage failing when not a single input read re-aligns against the draft consensus.

diff --git a/README.md b/README.md
diff --git a/bin/workflow_glue/subset_reads.py b/bin/workflow_glue/subset_reads.py
@@ -12,9 +12,12 @@ def main(args):
     logger = get_named_logger("subsetReads")
 
     logger.info("Read per-read stats and sort lengths.")
-    sorted_lengths = pd.read_csv(args.per_read_stats, sep="\t", index_col=0)[
-        "read_length"
-    ].sort_values(ascending=False)
+    sorted_lengths = pd.read_csv(
+        args.per_read_stats,
+        sep="\t",
+        index_col="read_id",
+        usecols=["read_id", "read_length"],
+    ).squeeze().sort_values(ascending=False)
 
     drop_longest_n = 0
     if args.drop_longest_frac:

diff --git a/docs/01_brief_description.md b/docs/01_brief_description.md
@@ -0,0 +1 @@
+Nextflow workflow for analysing Oxford Nanopore reads created by amplicon sequencing.
diff --git a/docs/intro.md → docs/02_introduction.md b/docs/intro.md → docs/02_introduction.md
@@ -1,14 +1,11 @@
-## Introduction
-
 This [Nextflow](https://www.nextflow.io/) workflow provides a simple way to
 analyse Oxford Nanopore reads generated from amplicons.
 
 The workflow requires raw reads in FASTQ format and can be run in two modes:
 * Variant calling mode: Trigger this mode by passing a reference FASTA file.
   After initial filtering (based on read length and quality) and adapter
   trimming, [minimap2](https://github.com/lh3/minimap2) is used to align the
-  reads to the reference (please note that the reference should only contain the
-  expected sequences of the individual amplicons). Variants are then called with
+  reads to the reference. Variants are then called with
   [Medaka](https://github.com/nanoporetech/medaka). This mode allows for
   multiple amplicons per barcode (for details on how to map specific target
   amplicons to individual samples / barcodes, see below).
@@ -19,6 +16,6 @@ The workflow requires raw reads in FASTQ format and can be run in two modes:
   [Medaka](https://github.com/nanoporetech/medaka). Please note that only one
   amplicon per barcode is supported in de-novo consensus mode.
 
-The results of the workflow include an interactive HTML report, FASTA files with
-the consensus sequences of the amplicons, BAM files with the alignments, and VCF
-files containing the variants (if run in variant calling mode).
+> Note: This workflow is *not* intended for marker gene sequencing of mixtures / communities of different organisms (e.g. 16S sequencing).
+> In de-novo consensus mode it expects a single amplicon per barcode.
+> When running in variant calling mode, multiple amplicons per barcode can be processed, but their sequences need to be sufficiently different from each other so that most reads only align to one of the provided references.
diff --git a/docs/03_compute_requirements.md b/docs/03_compute_requirements.md
@@ -0,0 +1,13 @@
+Recommended requirements:
+
++ CPUs = 12
++ Memory = 32GB
+
+Minimum requirements:
+
++ CPUs = 6
++ Memory = 16GB
+
+Approximate run time: 0.5-5 minutes per sample (depending on number of reads, length of amplicons, and available compute).
+
+ARM processor support: True
diff --git a/docs/04_install_and_run.md b/docs/04_install_and_run.md
@@ -0,0 +1,35 @@
+These are instructions to install and run the workflow on command line. You can also access the workflow via the [EPI2ME application](https://labs.epi2me.io/downloads/).
+
+The workflow uses [Nextflow](https://www.nextflow.io/) to manage compute and software resources, therefore Nextflow will need to be installed before attempting to run the workflow.
+
+The workflow can currently be run using either [Docker](https://www.docker.com/products/docker-desktop) or
+[Singularity](https://docs.sylabs.io/guides/3.0/user-guide/index.html) to provide isolation of
+the required software. Both methods are automated out-of-the-box provided
+either Socker or Singularity is installed. This is controlled by the [`-profile`](https://www.nextflow.io/docs/latest/config.html#config-profiles) parameter as exemplified below.
+
+It is not required to clone or download the git repository in order to run the workflow.
+More information on running EPI2ME workflows can be found on our [website](https://labs.epi2me.io/wfindex).
+
+The following command can be used to obtain the workflow. This will pull the repository in to the assets folder of Nextflow and provide a list of all parameters available for the workflow as well as an example command:
+
+```
+nextflow run epi2me-labs/wf-amplicon –-help
+```
+
+A demo dataset is provided for testing of the workflow. It can be downloaded using:
+
+```
+wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-amplicon/wf-amplicon-demo.tar.gz
+tar -xzvf wf-amplicon-demo.tar.gz
+```
+
+The workflow can be run with the demo data using:
+
+```
+nextflow run epi2me-labs/wf-amplicon \
+    --fastq wf-amplicon-demo/fastq \
+    --reference wf-amplicon-demo/reference.fa \
+    -profile standard
+```
+
+For further information about running a workflow on the command line see https://labs.epi2me.io/wfquickstart/
diff --git a/docs/05_related_protocols.md b/docs/05_related_protocols.md
@@ -0,0 +1,3 @@
+This workflow is designed to take input sequences that have been produced from [Oxford Nanopore Technologies](https://nanoporetech.com/) devices.
+
+Find related protocols in the [Nanopore community](https://community.nanoporetech.com/docs/).
diff --git a/docs/06_inputs.md b/docs/06_inputs.md
@@ -0,0 +1,73 @@
+### Input Options
+
+| Nextflow parameter name  | Type | Description | Help | Default |
+|--------------------------|------|-------------|------|---------|
+| fastq | string | FASTQ files to use in the analysis. | This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. |  |
+| analyse_unclassified | boolean | Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory. | If selected and if the input is a multiplex directory the workflow will also process the unclassified directory. | False |
+| reference | string | Path to a reference FASTA file. | The reference file should contain one sequence per amplicon. |  |
+
+
+### Sample Options
+
+| Nextflow parameter name  | Type | Description | Help | Default |
+|--------------------------|------|-------------|------|---------|
+| sample_sheet | string | A CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files. | The sample sheet is a CSV file with, minimally, columns named `barcode` and `alias`. Extra columns are allowed. A `type` column is required for certain workflows and should have the following values; `test_sample`, `positive_control`, `negative_control`, `no_template_control`. |  |
+| sample | string | A single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files. |  |  |
+
+
+### Pre-processing Options
+
+| Nextflow parameter name  | Type | Description | Help | Default |
+|--------------------------|------|-------------|------|---------|
+| min_read_length | integer | Shorter reads will be removed. |  | 300 |
+| max_read_length | integer | Longer reads will be removed. |  |  |
+| min_read_qual | number | Reads with a lower mean quality will be removed. |  | 10 |
+| drop_frac_longest_reads | number | Drop fraction of longest reads. | The very longest reads might be concatemers or contain other artifacts. In many cases removing them simplifies de novo consensus generation. | 0.05 |
+| take_longest_remaining_reads | boolean | Whether to use the longest (remaining) reads. | With this option, reads are not randomly selected during downsampling (potentially after the longest reads have been removed), but instead the longest remaining reads are taken. This generally improves performance on long amplicons. | True |
+| reads_downsampling_size | integer | Downsample to this number of reads per sample. | Downsampling is performed after filtering. Set to 0 to disable downsampling. | 0 |
+| min_n_reads | number | Samples / barcodes with fewer reads will not be processed. |  | 40 |
+
+
+### Variant Calling Options
+
+| Nextflow parameter name  | Type | Description | Help | Default |
+|--------------------------|------|-------------|------|---------|
+| min_coverage | integer | Minimum coverage for variants to keep. | Only variants covered by more than this number of reads are reported in the resulting VCF file. | 20 |
+| basecaller_cfg | string | Name of the basecaller model that processed the signal data; used to select an appropriate Medaka model. | The basecaller configuration is used to automatically select the appropriate Medaka model. The automatic selection can be overridden with the 'medaka_model' parameters. Available models are: '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', 'dna_r10.4.1_e8.2_400bps_hac_prom', 'dna_r9.4.1_450bps_hac_prom', 'dna_r10.3_450bps_hac', 'dna_r10.3_450bps_hac_prom', 'dna_r10.4.1_e8.2_260bps_hac', 'dna_r10.4.1_e8.2_260bps_hac_prom', 'dna_r10.4.1_e8.2_400bps_hac', 'dna_r9.4.1_450bps_hac', 'dna_r9.4.1_e8.1_hac', 'dna_r9.4.1_e8.1_hac_prom'. | [email protected] |
+| medaka_model | string | The name of the Medaka model to use. This will override the model automatically chosen based on the provided basecaller configuration. | The workflow will attempt to map the basecaller model (provided with 'basecaller_cfg') used to a suitable Medaka model. You can override this by providing a model with this option instead. |  |
+
+
+### De-novo Consensus Options
+
+| Nextflow parameter name  | Type | Description | Help | Default |
+|--------------------------|------|-------------|------|---------|
+| force_spoa_length_threshold | integer | Consensus length below which to force SPOA consensus generation. | If the consensus generated by `miniasm` is shorter than this value, force consensus generation with SPOA (regardless of whether the sequence produced by `miniasm` passed QC or not). The rationale for this parameter is that `miniasm` sometimes gives slightly truncated assemblies for short amplicons from RBK data, whereas SPOA tends to be more robust in this regard. | 2000 |
+| spoa_minimum_relative_coverage | number | Minimum coverage (relative to the number of reads per sample after filtering) when constructing the consensus with SPOA. | Needs to be a number between 0.0 and 1.0. The result of multiplying this number with the number of reads for the corresponding sample (after filtering) is passed on to SPOA's `--min-coverage` option. | 0.15 |
+| minimum_mean_depth | integer | Mean depth threshold to pass consensus quality control. | Draft consensus sequences with a lower average depth of coverage after re-aligning the input reads will fail QC. | 30 |
+| primary_alignments_threshold | number | Fraction of primary alignments to pass quality control. | Draft consensus sequences with a lower fraction of primary alignments after re-aligning the input reads will fail QC. | 0.7 |
+
+
+### Output Options
+
+| Nextflow parameter name  | Type | Description | Help | Default |
+|--------------------------|------|-------------|------|---------|
+| out_dir | string | Directory for output of all workflow results. |  | output |
+| combine_results | boolean | Whether to merge per-sample results into a single BAM / VCF file. | Per default, results are grouped per sample. With this option, an additional BAM and VCF file are produced which contain the alignments / variants for all samples and amplicons. | False |
+
+
+### Advanced Options
+
+| Nextflow parameter name  | Type | Description | Help | Default |
+|--------------------------|------|-------------|------|---------|
+| number_depth_windows | integer | Number of windows used during depth of coverage calculations. | Depth of coverage is calculated for each sample across each amplicon split into this number of windows. A higher number will produce more fine-grained plots at the expense of run time. | 100 |
+| medaka_target_depth_per_strand | integer | Downsample each amplicon to this per-strand depth before running Medaka. | Medaka performs best with even strand coverage and depths between 80X and 400X. To avoid too high coverage, the workflow downsamples the reads for each amplicon to this per-strand depth before running Medaka. Changing this value is discouraged as it might cause decreased performance. | 75 |
+
+
+### Miscellaneous Options
+
+| Nextflow parameter name  | Type | Description | Help | Default |
+|--------------------------|------|-------------|------|---------|
+| threads | integer | Maximum number of CPU threads to use per workflow task. | Several tasks in this workflow benefit from using multiple CPU threads. This option sets the maximum number of CPU threads for such processes. The total CPU resources used by the workflow are constrained by the executor configuration in `nextflow.config`. | 4 |
+| disable_ping | boolean | Enable to prevent sending a workflow ping. |  | False |
+
+
diff --git a/docs/07_outputs.md b/docs/07_outputs.md
@@ -0,0 +1,12 @@
+Outputs files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
+
+| Title | File path | Description | Per sample or aggregated |
+|-------|-----------|-------------|--------------------------|
+| workflow report | ./wf-amplicon-report.html | Report for all samples. | aggregated |
+| Sanitized reference file | ./reference_sanitized_seqIDs.fasta | Some programs used by the workflow don't like special characters (like colons) in the sequence IDs in the reference FASTA file. The reference is thus "sanitized" by replacing these characters with underscores. This file is only generated when the workflow is run in variant calling mode. | aggregated |
+| Alignments BAM file | ./{{ alias }}/alignments/aligned.sorted.bam | BAM file with alignments of input reads against the references (in variant calling mode) or the created consensus (in de-novo consensus mode). | per-sample |
+| Alignments index file | ./{{ alias }}/alignments/aligned.sorted.bam.bai | Index for alignments BAM file. | per-sample |
+| De-novo consensus FASTQ file | ./{{ alias }}/consensus/consensus.fastq | Consensus file generated by de-novo consensus pipeline. | per-sample |
+| Consensus FASTA file | ./{{ alias }}/consensus/consensus.fasta | Consensus file generated variant calling pipeline. | per-sample |
+| Variants VCF file | ./{{ alias }}/variants/medaka.annotated.vcf.gz | VCF file of variants detected against the provided reference. | per-sample |
+| Variants index file | ./{{ alias }}/variants/medaka.annotated.vcf.gz.csi | Index for variants VCF file. | per-sample |
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Nextflow workflow for analysing Oxford Nanopore reads created by amplicon sequencing.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		This workflow is designed to take input sequences that have been produced from [Oxford Nanopore Technologies](https://nanoporetech.com/) devices.

		Find related protocols in the [Nanopore community](https://community.nanoporetech.com/docs/).