βοΈ π FUSARIUM-ID Naive Bayes classifiers for QIIME 2
π§ Work in progress π§
This pipeline and the pre-computed classifiers will be available soon!
A Snakemake workflow to train QIIME 2 taxonomic Naive Bayes classifiers for the FUSARIUM-ID database. This database contains sequences of the Translation Elongation Factor 1 alpha (TEF1), which serves as a considerably better marker for species identification in the filamentous fungal genus Fusarium than ITS, the standard marker for all Fungi.
π This workflow uses Snakemake 7.32.4. Newer versions (8+) contain backwards incompatible changes that may result in this pipeline not working in a Slurm HPC queue system.
This pipeline:
-
Parses the multi-FASTA headers searching metadata and saves it as a TSV file. You can read about how FUSARIUM-ID stores metadata in this manual (Spanish version here) and in the FUSARIUM-ID publication.
-
Formats metadata to match SILVA and UNITE taxonomy style.
-
Imports taxonomy and sequences into QIIME 2.
-
More coming soon...
The only prerequisite is having Conda installed. In this regard, we highly recommend installing Miniconda and then installing Mamba (used by default by Snakemake) for a lightweight and fast experience.
-
Clone the repository
-
Create a Screen (see section Immediate submit and Screen)
-
Run the following command to download (if needed) and activate the FUSARIUM-ID-train environment, and to set aliases for the main functions:
source init_fusariumid_train.sh
-
Edit
config/config.yml
with your specific requirements. Variables annotated with #cluster# must also be updated inconfig/cluster_config.yml
. -
If needed, modify
time
,ncpus
andmemory
variables inconfig/cluster_config.yml
. -
Download FUSARIUM-ID v3.0 FASTA file from https://github.com/fusariumid/fusariumid (
FUSARIUMID_v.3.0_TEF1.fas
). -
Run the following command to start the workflow:
fidt_run
FUSARIUM-ID-train inlcudes a command, fidt_immediate
, that automatically sends all jobs to Slurm, correctly queued according to their dependencies. This is desirable e.g. when the runtime in the cluster login machine is very short, because it may kill Snakemake in the middle of the workflow. If your HPC queue system only allows a limited number of jobs submitted at once, change that number in init_fusariumid_train.sh
and source it again (that also applies for fidt_run
).
Please note that if the number of simultaneous jobs accepted by the queue system is less than the total number of jobs you need to submit, the workflow will fail. For such cases, we highly recommend not using fidt_immediate
. Instead, use fidt_run
inside a Screen. Screen is a multiplexer that lets you create multiple virtual terminal sessions. It is installed by default in most Linux HPC systems.
To create a screen, use screen -S fusariumid_train
. Then, follow usage section there. You can dettach the screen with Ctrl+a
and then d
. You can attach the screen again with screen -r fusariumid_train
. For more details about Screen usage, please check this Gist.
Since FUSARIUM-ID-train is built over Snakemake, you can generate DAGs, rule graphs and file graphs of the workflow. We provide three commands for this: fidt_draw_dag
, fidt_draw_rulegraph
and fidt_draw_filegraph
.