metashot/prok-snp is a workflow for the identification SNVs (of closely related organisms) and phylogenetic tree inference from prokaryotic isolates.
- Input: single-end, paired-end (also interleaved) Illumina sequences (gzip compressed FASTA/FASTQ also supported);
- Variant calling and core genome alignment using snippy;
- Recombination prediction and filtering using Gubbins, optional;
- Phylogenetic tree inference using RAxML, optional.
- Install Docker (or Singulariry) and Nextflow (see Dependences);
- Start running the analysis:
nextflow run metashot/prok-snp \
--reads '*_R{1,2}.fastq.gz' \
--ref reference.fa \
--outdir results
See the file nextflow.config
for the complete list of
parameters.
The files and directories listed below will be created in the results
directory after the pipeline has finished.
core_aln.fa
: the core SNP alignment in FASTA format (snippy-core
output);full_aln.fa
: the whole genome SNP alignment in FASTA format (snippy-core
output);core.vcf
: multi-sample VCF file with genotype GT tags for all discovered (snippy-core
output); alleles(snippy-core
output);tree.tree
: the best-scoring ML tree of a thorough ML analysis (raxml
output);tree_support.tree
: the best-scoring ML tree with the BS support values (from 0 to 100, RAxML output when--raxml_mode rbs
).
raw_reads_stats
: base frequency, quality scores, gc content, average quality and length for each input sample;snippy
: the snippy output for each input sample;snippy_core
: thesnippy-core
output;gubbins
: gubbins output (when--skip_gubbins false
);raxml
: RAxML output (when--skip_raxml false
).
Since the input alignments are from SNP data, the ascertainment bias correction
is applied to the likelihood calculations1 (RAxML
option -m ASC_GTRCAT
) and the rate heterogeneity among sites model is disabled
(RAxML option -V
). Two modes are available:
-
default mode: construct a maximum likelihood (ML) tree. This mode runs the default RAxML tree search algorithm2 and perform multiple searches for the best tree (10 distinct randomized MP trees by default, see the parameter
--raxml_nsearch
). The following RAxML parameters will be used:-f d -m ASC_GTRCAT -V --asc-corr=lewis -N [RAXML_NSEARCH]
-
rbs mode: assess the robustness of inference and construct a ML tree. This mode runs the rapid bootstrapping full analysis3. The bootstrap convergence criterion or the number of bootstrap searches can be specified with the parameter
--raxml_nboot
. The following parameters will be used:-f a -m ASC_GTRCAT -V --asc-corr=lewis -N [RAXML_NBOOT]
Please refer to System requirements for the complete list of system requirements options.
For each GB of input data the workflow requires approximately 1 GB for the final output and 1 GB for the working directory.
1: Tamuri A., GoldmanAvoiding N. Ascertainment bias in the maximum likelihood inference of phylogenies based on truncated data. bioRxiv 186478, Link.
2: Stamatakis A., Blagojevic F., Nikolopoulos D.S. et al. Exploring New Search Algorithms and Hardware for Phylogenetics: RAxML Meets the IBM Cell. J VLSI Sign Process Syst Sign Im 48, 271–286 (2007). Link.
3: Stamatakis A., Hoover P., Rougemont J. A Rapid Bootstrap Algorithm for the RAxML Web Servers. Systematic Biology, Volume 57, Issue 5, October 2008, Pages 758–771, Link.