This repository contains an easy-to-use program to assign O1 Vibrio cholerae genomes to canonical lineages using phylogenetic placement.
Genomic surveillance of cholera has identified at least three waves of global transmission from Asia to Africa during the seventh pandemic (7PET), and at least 17 independent introductions of 7PET into Africa (deemed the T1-T17 lineages). These lineages have been used to connect apparently disparate outbreaks, characterize regional transmission patterns, and suggested possible transmission routes of cholera within Africa. As well, genomic differences between lineages might explain differences in severity and transmissibility observed between different outbreaks. Vibecheck enables the rapid assignment of sequences to these canonical lineages, as an alternative to a lengthy and computationally intensive reconstruction of the global phylogeny.
Note
The sequence-based classification performed by Vibecheck is basically a fork of Pangolin (and accompanying paper), in which the QC and hashing steps are removed, and an O1 Vibrio cholerae global phylogeny is used. Therefore, we'd like to thank Áine O'Toole, Verity Hill, JT McCrone, Emily Scher and Andrew Rambaut for creating such a great and open-source tool.
Vibecheck is a tool that identifies which canonical O1 Vibrio cholerae lineage a sequence belongs to. It processes both:
- Consensus genomes (fasta format)
- Raw sequencing reads (fastq format)
The tool automatically selects the appropriate analysis workflow based on your input type.
When provided with a multi-sequence fasta containing whole genome sequences, generated either by reference-based or de novo assembly, Vibecheck:
- Aligns sequences to a reference O1 V. cholerae genome using minimap2 and converts the mapping to a multi-sequence FASTA using gofasta.
- Performs quality control by calculating the proportion of ambiguous characters in each sequence. Sequences failing QC are excluded from lineage assignment
- Identifies variants (SNPs) between each sequence and the reference, generating a VCF file using UCSC's faToVcf.
- Places sequences into a lineage-annotated phylogeny using UShER, and records whether the placement is contained within a lineage.
- Analyzes UShER output to calculate lineage assignment confidence and generates a final report
When provided with a pair of fastq files representing the raw sequencing data for a sample, Vibecheck:
- Subsamples 20% of reads using seqtk to ease computational demands and speed-up the analysis.
- Aligns reads to a reference O1 V. cholerae genome using minimap2.
- Calls variants from the alignment using BCFtools.
- Estimates lineage abundances from variant frequencies using Freyja.
- Analyzes Freyja output to calculate confidence scores and generates a final report
Vibecheck achieves >98% accuracy in recapitulating lineages of sequences excluded from the guide tree. For detailed validation of speed and accuracy, see the calculate_accuracy notebook.
For information on the empirical basis of our default maximum ambiguity parameter, see the ambiguity_thresholding notebook.
- Install
mamba
by running the following two command:
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
- Clone the vibecheck repository:
git clone https://github.com/watronfire/vibecheck.git
- Move to the repository:
cd vibecheck/
- Install and activate the vibecheck's conda environment:
mamba env create -f environment.yaml
mamba activate vibecheck
- Install the
vibecheck
command:
pip install .
- Test the installation:
vibecheck -h
vibecheck -v
These command should print the help and version of the program. Please create an issue if this is not the case.
- Navigate to the directory where you cloned the vibecheck repository on the command line:
cd vibecheck/
- Activate the vibecheck conda environment:
mamba activate vibecheck
- Pull the latest changes from GitHub:
git pull
- Update the vibecheck conda environment:
mamba env update -f environment.yaml
- Reinstall the
vibecheck
command:
pip install .
- Activate the vibecheck conda environment:
mamba activate vibecheck
- Run
vibecheck [query ...]
Where [query ...]
is the name of your input fasta file or a pair of fastq files.
A query fasta file can contain as many sequences as you would like to be classified, while paired fastq files should
only contain the data for one sample.
This command will generate a CSV file (lineage_report.csv
) containing the estimated lineages for each sequence in the query file.
See Output for a complete description of this file.
[!ATTENTION]
usage: vibecheck [-h] [-o OUTDIR] [--outfile OUTFILE] [--tempdir TEMPDIR] [--no-temp] [-t THREADS] [-v] [-u USHER_TREE] [-m MAX_AMBIGUITY] [-b BARCODES] [-s SUBSAMPLE] [--no-subsample] [query ...]
██╗ ██╗██╗██████╗ ███████╗ ██████╗██╗ ██╗███████╗ ██████╗██╗ ██╗
██║ ██║██║██╔══██╗██╔════╝██╔════╝██║ ██║██╔════╝██╔════╝██║ ██╔╝
██║ ██║██║██████╔╝█████╗ ██║ ███████║█████╗ ██║ █████╔╝
╚██╗ ██╔╝██║██╔══██╗██╔══╝ ██║ ██╔══██║██╔══╝ ██║ ██╔═██╗
╚████╔╝ ██║██████╔╝███████╗╚██████╗██║ ██║███████╗╚██████╗██║ ██╗
╚═══╝ ╚═╝╚═════╝ ╚══════╝ ╚═════╝╚═╝ ╚═╝╚══════╝ ╚═════╝╚═╝ ╚═╝
Rapid classification of O1 Vibrio cholerae lineages.
positional arguments:
query Query sequences to classify
options:
-h, --help show this help message and exit
-o OUTDIR, --outdir OUTDIR
Output directory. Default: current working directory
--outfile OUTFILE Optional output file name. Default: lineage_report.csv
--tempdir TEMPDIR Specify where you want the temp stuff to go. Default: $TMPDIR
--no-temp Output all intermediate files, for dev purposes.
-t THREADS, --threads THREADS
Number of threads to use when possible. Default: all available cores, 4 detected on this machine
-v, --version Prints the version of Vibecheck and exits.
Sequence-based classification:
-u USHER_TREE, --usher-tree USHER_TREE
UShER Mutation Annotated Tree protobuf file to use instead of default tree
-m MAX_AMBIGUITY, --max-ambiguity MAX_AMBIGUITY
Maximum number of ambiguous bases a sequence can have before its filtered from the analysis. Default: 0.3
Read-based classification:
-b BARCODES, --barcodes BARCODES
Feather formatted lineage barcodes to use instead of default O1 barcodes
-s SUBSAMPLE, --subsample SUBSAMPLE
Fraction of reads to use in classification. Default: 0.2
--no-subsample Do not subsample reads. Default: False
A successful run of Vibecheck will output a CSV file, named by default lineage_report.csv
.
This output file contains 6 columns with a row for each sequence found in the query input file.
- The
sequence_id
column contains the name of each provided sequence. - The
qc_status
column indicates whether a sequenced passed or failed quality control. - The
qc_notes
columns summarizes the results the quality control process. - The
lineage
column contains the most likely lineage assigned to a sequence. - The
confidence
column contains a value reflecting how confidence the assignment of a sequence is. A value of 0 indicates, given the current phylogenetic tree, there is only a single lineage that the sequence could be assigned to, while a value above 0 indicates that number of lineages that a sequence could be assigned to. - The
classification_notes
column summarizes the placement(s) of a sequences.
Note
The assignment of a sequence is sensitive to missing data at key sites, recombination, and other factors. Therefore, caution should be taken in interpreting the results of Vibecheck. All results should be confirmed with a complete phylogenetic reconstruction involving quality and completeness filtering, and recombination masking. We recommend the bacpage phylogeny (available on Terra as well) pipeline for this.
sequence_id | qc_status | qc_notes | lineage | confidence | usher_note |
---|---|---|---|---|---|
SequenceA | pass | Ambiguous_content:0.01% | T13 | 1.0 | Usher placements: T13(1/1) |
SequenceB | pass | Ambiguous_content:0.03% | T15 | 1.0 | Usher placements: T15(1/1) |
SequenceC | pass | Ambiguous_content:0.02% | T12 | 1.0 | Usher placements: T12(8/8) |
SequenceD | pass | Ambiguous_content:0.13% | T13 | 0.6666 | Usher placements: T13(2/3) UNDEFINED(1/3) |
In the example above, SequenceA and SequenceB each have a single parsimonious placement in the phylogeny and are therefore assigned T13 and T15, respectively, with a confidence value of 1 indicating low uncertainty.
SequenceC has eight parsimonious placements in the phylogeny (as indicated by the (8/8)
in the classification_notes
column).
However, all of these placements are in the T12 lineage. Therefore, SequenceC is assigned the lineage T12 with a confidence value of 1 indicating high certainty.
SequenceD has three parsimonious placements in the phylogeny, two of which fall in the T13 lineage, and one which falls into non-African diversity.
SequenceD is therefore assigned as T13 because it is the most frequent assignment, but it has a confidence value less than 1 indicating an uncertain assignment.
The quality and completeness of this sequence should be confirmed, and a complete phylogenetic construction should be completed to confirm the lineage assignment.