This script provides a convenient way to run a Nextflow workflow with various configurations and parameters.
- Nextflow DSL2
- Python >= 3.10
Single inputs
python ete_build.py --mode local --input data/NUP62.aa.fa --output result/ --aligner mafft --trimmer trimal --tree_builder fasttree
Multitple inputs, which will fetch all *fa
, *faa
, *fasta
files.
python ete_build.py --mode local --input data/ --output result/ --aligner mafft --trimmer trimal --tree_builder fasttree
--mode
: Execution mode. Choose betweenlocal
andslurm
. This is a required argument.--input
: Path to the input fasta file or directory. This is a required argument.--output
: Path to the output directory where results will be saved. This is a required argument.--aligner
: Alignment tool to use. Default ismafft
.--trimmer
: Trimming tool to use. Default istrimal
.--tree_builder
: Tree building tool to use. Default isfasttree
.
If you choose slurm
as the execution mode, you need to provide additional SLURM-specific arguments:
--slurm-partition
: SLURM partition name. This is required if--mode
isslurm
.--time
: Time limit for SLURM jobs. Default is1h
.--memory
: Memory allocation for SLURM jobs. Default is4GB
.--cpus
: Number of CPUs for SLURM jobs. Default is4
.
The pipeline utilizes several bioinformatics tools categorized into three main groups: aligners, trimmers, and tree builders. Below are the tools used, and you can check their versions with the conda list
command.
Aligners are used to align sequences before trimming or building trees. The following aligners are supported:
- MAFFT: Multiple sequence alignment tool for amino acid or nucleotide sequences.
- MUSCLE: Multiple sequence alignment with high accuracy and throughput.
- T-Coffee: A sequence alignment package that computes multiple alignments of protein sequences.
- Clustal Omega (Clustalo): A fast and scalable multiple sequence alignment program.
- FAMSA: Fast and accurate multiple sequence alignment.
Trimmers are used to trim alignments, removing low-quality or ambiguous regions:
- TrimAl: A tool for automated trimming of multiple sequence alignments.
- ClipKit: A tool that enables automated selection of conserved sites in alignments.
- TrimAlgV2: A custom alignment trimming tool in
./bin/
.
Tree builders are used to infer phylogenetic trees from sequence alignments:
- FastTree: A tool to compute approximately maximum likelihood phylogenetic trees from alignments.
- PhyML: A software package that estimates maximum likelihood phylogenies from nucleotide or amino acid sequences.
- RAxML: A tool for maximum likelihood-based inference of large phylogenies.
- IQ-TREE: A fast and effective stochastic algorithm for inferring maximum likelihood trees.
- MrBayes: A program for Bayesian inference of phylogeny using Markov chain Monte Carlo methods.
The script first generates a configuration file for Nextflow based on the provided execution mode and other parameters. It then runs the specified Nextflow workflow script (ete_build_dsl2.nf
by default) with the given parameters.
The workflow processes the input fasta files to perform alignment, trimming, and tree building based on the specified tools.
In general, given the input fasta files the pipeline will output the alignment, trimmed alignment and phylogeneies in the given output folder. By default, program will generate phylogenetic tree for each given alignment. For example
- Build gene tree of NUP62
python ete_build.py --mode local --input data/NUP62.aa.fa --output result/ --aligner mafft --trimmer trimal --tree_builder fasttree
- check result
once it finished, we can check in the output folder, here is in
result/
:
ll -a result/
-rw-r--r-- 1 deng huerta 12149 Sep 29 17:21 .nextflow.log
drwxr-xr-x 2 deng huerta 9 Sep 29 17:21 NUP62.aa-mafft-trimal-fasttree
-rw-r--r-- 1 deng huerta 262 Sep 29 17:21 used_config.cfg
-rw-r--r-- 1 deng huerta 171 Sep 29 17:21 used_config.json
drwxr-xr-x 7 deng huerta 5 Sep 29 17:21 work
.nextflow.log
file contains detailed logs generated by Nextflow during the execution of the pipelineused_config.cfg
file contains the configurations used for the pipeline run, specifying the alignment, trimming, and tree-building tools with their associated parameters. It can be used to replicate the analysis or for transparency about which p arameters were applied during the run.used_config.json
This file is a JSON representation of the same configurations present in used_config.cfg. It provides a structured format that is easy to parse programmatically, documenting the tools and settings used in the pipeline run.work/
intermediate files created during the execution of each process in the pipeline. It includes working data for alignment, trimming, and tree-building steps, as well as temporary files used by Nextflow.NUP62.aa-mafft-trimal-fasttree/
output directory contains the output files related to the analysis of the NUP62 gene using the selected tools: MAFFT for alignment, Trimal for trimming, and FastTree for tree building. It holds all intermediate and final results of the pipeline for this gene, including alignment files, trimmed alignments, and the final phylogenetic tree.
Check output files to fetch target alignment NUP62.aa.aln.faa
, trimmed alignment NUP62.aa.clean.alg.faa
and tree NUP62.aa.output.tree
ls NUP62.aa-mafft-trimal-fasttree/
total 38
-rw-r--r-- 1 deng huerta 2830 Sep 29 17:21 align.err
-rw-r--r-- 1 deng huerta 1677 Sep 29 17:21 build.err
-rw-r--r-- 1 deng huerta 15526 Sep 29 17:21 NUP62.aa.aln.faa
-rw-r--r-- 1 deng huerta 38 Sep 29 17:21 NUP62.aa_aln_timing.log
-rw-r--r-- 1 deng huerta 14626 Sep 29 17:21 NUP62.aa.clean.alg.faa
-rw-r--r-- 1 deng huerta 1147 Sep 29 17:21 NUP62.aa.output.tree
-rw-r--r-- 1 deng huerta 39 Sep 29 17:21 NUP62.aa_treebuild_timing.log
-rw-r--r-- 1 deng huerta 0 Sep 29 17:21 trim.err
-rw-r--r-- 1 deng huerta 0 Sep 29 17:21 trim.out
You can easily compose your own workflow by specifying the desired tools for each step. For instance, if you want to use muscle
for alignment, clipkit
for trimming, and iqtree
for tree building, you would run:
python ete_build.py --mode local --input data/NUP62.aa.fa --output result/ --aligner muscle --trimmer clipkit --tree_builder iqtree
check in result, now result folder will be stored in NUP62.aa-muscle-clipkit-iqtree/
ll result/
drwxr-xr-x 2 deng huerta 16 Sep 29 18:05 NUP62.aa-muscle-clipkit-iqtree
drwxr-xr-x 11 deng huerta 9 Sep 29 18:04 work
-rw-r--r-- 1 deng huerta 313 Sep 29 18:03 used_config.cfg
-rw-r--r-- 1 deng huerta 226 Sep 29 18:03 used_config.json
drwxr-xr-x 2 deng huerta 9 Sep 29 17:21 NUP62.aa-mafft-trimal-fasttree
In addition to specifying aligners, trimmers, and tree builders on the command line, you can also customize your workflow further by providing a configuration file, such as nf_build.cfg
. This file allows you to define tool-specific parameters and options, giving you more control over how each tool operates within your workflow.
The nf_build.cfg
file contains settings for each tool used in the workflow. These settings override default values, allowing you to fine-tune the behavior of the tools (e.g., alignment precision, trimming thresholds, or tree-building models).
Here is an example of how the configuration file (nf_build.cfg
) may look:
[mafft_linsi]
_desc = 'Mafft with linsi method'
_app = mafft
op = 1.53 # Gap opening penalty at group-to-group alignment. Default: 1.53
ep = 0.123 # Offset value, which works like gap extension penalty, for group-to-group alignment. Default: 0.123
maxiterate = 1000 # number cycles of iterative refinement are performed. Default: 0
matrix = "" # "", blosum or pam
blosum_coefficient = 62 # Default: 62
pam_coefficient = 80
localpair = True
retree = 2 # Number of iterative refinement is performed. Default: 2
[trimal_default]
_desc = 'trimal alignment cleaning with default parameters'
_app = trimal
gt = 0.01
[fasttree_full]
_desc = 'Fasttree with full mode'
_app = fasttree
aa_model = JTT # Evolutionary model LG, WAG or JTT for amino acid sequences
nt_model = JC # Evolutionary model GTR or JC for nucleotide sequences
gamma = False # Non-uniformity of evolutionary rates among sites may be modeled by using a discrete Gamma distribution
bootstrap = 1000 # Number of bootstrap replicates, 0 to disable support values
pseudo = True # Pseudo-likelihood support values
spr = 4 # the number of rounds of minimum-evolution SPR moves
mlacc = 2 # the number of rate categories for the ML model of rate heterogeneity
slownni = True # Use slow NNI moves
....
To apply a custom configuration in your workflow, you can pass the path to the configuration file (nf_build.cfg
) using the --config
argument. The pipeline will then apply the settings specified in this file for the selected tools. Noted if you using custom config, the tool's name should be the same as written in []
from nf_build.cfg
python ete_build.py --mode local --input data/NUP62.aa.fa --output result/ --aligner mafft_linsi --trimmer trimal_default --tree_builder fasttree_full --config nf_build.cfg
The workflow can also be configured to infer a species tree from individual gene alignments using a supermatrix approach or a coalescent-based method, such as ASTRAL. The species tree mode allows you to summarize phylogenetic relationships across multiple genes into a single species tree, providing a comprehensive view of evolutionary relationships.
To enable species tree mode, you can use the --supermatrix_mode
or --coalescent_mode
flags depending on the approach you wish to take:
- Supermatrix Mode: This method concatenates all individual gene alignments into a supermatrix, which is then used to build the species tree.
- Coalescent Mode: This method constructs individual gene trees and then combines them to infer the species tree using a coalescent-based approach (e.g., ASTRAL).
In supermatrix mode, the workflow concatenates multiple gene alignments into a single large alignment file (supermatrix) before performing tree inference. To enable this mode, you need to provide a list of target species in a separate file. Here we use example in data/sp_tree
ls data/sp_tree
-rw-r--r-- 1 deng huerta 8401 Sep 19 16:36 gene1.fa
-rw-r--r-- 1 deng huerta 6147 Sep 19 16:36 gene2.fa
-rw-r--r-- 1 deng huerta 5002 Sep 19 16:36 gene3.fa
-rw-r--r-- 1 deng huerta 7219 Sep 19 16:36 gene4.fa
-rw-r--r-- 1 deng huerta 3671 Sep 19 16:36 gene5.fa
-rw-r--r-- 1 deng huerta 62 Sep 19 16:36 target_species.txt
grep ">" data/sp_tree/gene1.fa|head -5
>1073090_gene1
>105351_gene1
>1073089_gene1
>1034303_gene1
>1034304_gene1
head data/sp_tree/target_species.txt
1073090
105351
1073089
1034303
python ete_build.py --mode local --input data/sp_tree/ --output result/ --aligner mafft --trimmer none --tree_builder fasttree --supermatrix --target-species data/sp_tree/target_species.txt --spname-delimiter _ --spname-field 0
check result of each gene and the final species tree from supermatrix
ll result/
drwxr-xr-x 2 deng huerta 4 Sep 29 19:44 gene1-mafft-none-fasttree
drwxr-xr-x 2 deng huerta 4 Sep 29 19:44 gene2-mafft-none-fasttree
drwxr-xr-x 2 deng huerta 4 Sep 29 19:44 gene3-mafft-none-fasttree
drwxr-xr-x 2 deng huerta 4 Sep 29 19:44 gene4-mafft-none-fasttree
drwxr-xr-x 2 deng huerta 4 Sep 29 19:44 gene5-mafft-none-fasttree
drwxr-xr-x 2 deng huerta 5 Sep 29 19:44 supermatrix-mafft-none-fasttree
-rw-r--r-- 1 deng huerta 182 Sep 29 19:44 used_config.cfg
-rw-r--r-- 1 deng huerta 123 Sep 29 19:44 used_config.json
drwxr-xr-x 77 deng huerta 75 Sep 29 19:44 work
In coalescent mode, individual gene trees are first built, and then a species tree is inferred using the coalescent-based method (e.g., ASTRAL). This mode is particularly useful when different genes may have conflicting phylogenies, as it attempts to infer the species tree by accounting for such conflicts.
python ete_build.py --mode local --input data/sp_tree/ --output result/ --aligner mafft --trimmer none --tree_builder fasttree --spname-delimiter _ --spname-field 0 --coalescent
show result
ll result/
total 6
drwxr-xr-x 2 deng huerta 3 Sep 29 19:56 astral-mafft-none-fasttree
drwxr-xr-x 2 deng huerta 7 Sep 29 19:56 gene1-mafft-none-fasttree
drwxr-xr-x 2 deng huerta 7 Sep 29 19:56 gene2-mafft-none-fasttree
drwxr-xr-x 2 deng huerta 7 Sep 29 19:56 gene3-mafft-none-fasttree
drwxr-xr-x 2 deng huerta 7 Sep 29 19:56 gene4-mafft-none-fasttree
drwxr-xr-x 2 deng huerta 7 Sep 29 19:56 gene5-mafft-none-fasttree