🦠 SporeFlow: 16S and ITS metataxonomics pipeline
SporeFlow (Snakemake Pipeline For Metataxonomics Workflows) is a pipeline for metataxonomic analysis of fungal ITS and bacterial 16S using QIIME 2 and Snakemake. It takes into consideration all the particularities of the indel-rich ITS region.
🐍 This workflow uses Snakemake 7.32.4. Newer versions (8+) contain backwards incompatible changes that may result in this pipeline not working in a Slurm HPC queue system.
What SporeFlow does:
- Run FastQC on the raw FASTQ files (rule
fastqc_before
) - Run Cutadapt on the raw FASTQ files (rule
cutadapt
) - Run FastQC on the trimmed FASTQ files (rule
fastqc_after
) - Aggregate QC results (FastQC before trimming, Cutadapt, FastQC after trimming) with MultiQC (rule
multiqc
) - Create manifest file for QIIME 2 (rule
create_manifest
) - Import FASTQ files to QIIME 2 (rule
import_fastq
) - Trim ITS sequences in QIIME 2 with ITSxpress plugin (rule
itsxpress
) - Denoise, dereplicate, remove chimeras and merge sequences in QIIME 2 with DADA2 plugin (rule
dada2
) - Perform taxonomic classification in QIIME 2 with feature-classifier plugin (rule
taxonomy
) - Perform diversity analysis in QIIME 2 with diversity plugin (rule
diversity
) - Perform differential abundance in QIIME 2 with composition plugin (rule
abundance
)
There are some additional steps used for adapting results between main steps. We don't worry about those for now.
The only prerequisite is having Conda installed. In this regard, we highly recommend installing Miniconda and then installing Mamba (used by default by Snakemake) for a lightweight and fast experience.
-
Clone the repository
-
Create a Screen (see section Immediate submit and Screen)
-
Run the following command to download (if needed) and activate the SporeFlow environment, and to set aliases for the main functions:
source init_sporeflow.sh
-
Edit
config/config.yml
with your experiment details. Variables annotated with #cluster# must also be updated inconfig/cluster_config.yml
. -
If needed, modify
time
,ncpus
andmemory
variables inconfig/cluster_config.yml
. -
Download a UNITE classfier in QIIME 2 format from https://github.com/colinbrislawn/unite-train/releases. We recommend using one of the following (remember to change the name accordingly in
config/config.yml
):unite_ver10_dynamic_all_04.04.2024-Q2-2024.2.qza
unite_ver10_99_all_04.04.2024-Q2-2024.2.qza
-
Run the following command to start the workflow:
sf_run
Sporeflow inlcudes a command, sf_immediate
, that automatically sends all jobs to Slurm, correctly queued according to their dependencies. This is desirable e.g. when the runtime in the cluster login machine is very short, because it may kill Snakemake in the middle of the workflow. If your HPC queue system only allows a limited number of jobs submitted at once, change that number in init_sporeflow.sh
and source it again (that also applies for sf_run
).
Please note that if the number of simultaneous jobs accepted by the queue system is less than the total number of jobs you need to submit, the workflow will fail. For such cases, we highly recommend not using sf_immediate
. Instead, use sf_run
inside a Screen. Screen is a multiplexer that lets you create multiple virtual terminal sessions. It is installed by default in most Linux HPC systems.
To create a screen, use screen -S sporeflow
. Then, follow usage section there. You can dettach the screen with Ctrl+a
and then d
. You can attach the screen again with screen -r sporeflow
. For more details about Screen usage, please check this Gist.
Since Sporeflow is built over Snakemake, you can generate DAGs, rule graphs and file graphs of the workflow. We provide three commands for this: sf_draw_dag
, sf_draw_rulegraph
and sf_draw_filegraph
.