Skip to content

An easy-to-use program to assign O1 Vibrio cholerae genomes to canonical lineages using phylogenetic placement.

License

Notifications You must be signed in to change notification settings

watronfire/Vibecheck

Repository files navigation

vibecheck

This repository contains an easy-to-use program to assign O1 Vibrio cholerae genomes to canonical lineages using phylogenetic placement.

Table of Contents

  1. Introduction
  2. How it Works
  3. Installation
  4. Updating
  5. Usage
  6. Output

Introduction

Genomic surveillance of cholera has identified at least three waves of global transmission from Asia to Africa during the seventh pandemic (7PET), and at least 17 independent introductions of 7PET into Africa (deemed the T1-T17 lineages). These lineages have been used to connect apparently disparate outbreaks, characterize regional transmission patterns, and suggested possible transmission routes of cholera within Africa. As well, genomic differences between lineages might explain differences in severity and transmissibility observed between different outbreaks. Vibecheck enables the rapid assignment of sequences to these canonical lineages, as an alternative to a lengthy and computationally intensive reconstruction of the global phylogeny.

How it works

Note

The sequence-based classification performed by Vibecheck is basically a fork of Pangolin (and accompanying paper), in which the QC and hashing steps are removed, and an O1 Vibrio cholerae global phylogeny is used. Therefore, we'd like to thank Áine O'Toole, Verity Hill, JT McCrone, Emily Scher and Andrew Rambaut for creating such a great and open-source tool.

vibecheck

Vibecheck is a tool that identifies which canonical O1 Vibrio cholerae lineage a sequence belongs to. It processes both:

  • Consensus genomes (fasta format)
  • Raw sequencing reads (fastq format)

The tool automatically selects the appropriate analysis workflow based on your input type.

Sequence-based classification

When provided with a multi-sequence fasta containing whole genome sequences, generated either by reference-based or de novo assembly, Vibecheck:

  1. Aligns sequences to a reference O1 V. cholerae genome using minimap2 and converts the mapping to a multi-sequence FASTA using gofasta.
  2. Performs quality control by calculating the proportion of ambiguous characters in each sequence. Sequences failing QC are excluded from lineage assignment
  3. Identifies variants (SNPs) between each sequence and the reference, generating a VCF file using UCSC's faToVcf.
  4. Places sequences into a lineage-annotated phylogeny using UShER, and records whether the placement is contained within a lineage.
  5. Analyzes UShER output to calculate lineage assignment confidence and generates a final report

Read-based classification

When provided with a pair of fastq files representing the raw sequencing data for a sample, Vibecheck:

  1. Subsamples 20% of reads using seqtk to ease computational demands and speed-up the analysis.
  2. Aligns reads to a reference O1 V. cholerae genome using minimap2.
  3. Calls variants from the alignment using BCFtools.
  4. Estimates lineage abundances from variant frequencies using Freyja.
  5. Analyzes Freyja output to calculate confidence scores and generates a final report

Validation

Vibecheck achieves >98% accuracy in recapitulating lineages of sequences excluded from the guide tree. For detailed validation of speed and accuracy, see the calculate_accuracy notebook.

For information on the empirical basis of our default maximum ambiguity parameter, see the ambiguity_thresholding notebook.

Installation

  1. Install mamba by running the following two command:
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
  1. Clone the vibecheck repository:
git clone https://github.com/watronfire/vibecheck.git
  1. Move to the repository:
cd vibecheck/
  1. Install and activate the vibecheck's conda environment:
mamba env create -f environment.yaml
mamba activate vibecheck
  1. Install the vibecheck command:
pip install .
  1. Test the installation:
vibecheck -h
vibecheck -v

These command should print the help and version of the program. Please create an issue if this is not the case.

Updating

  1. Navigate to the directory where you cloned the vibecheck repository on the command line:
cd vibecheck/
  1. Activate the vibecheck conda environment:
mamba activate vibecheck
  1. Pull the latest changes from GitHub:
git pull
  1. Update the vibecheck conda environment:
mamba env update -f environment.yaml
  1. Reinstall the vibecheck command:
pip install .

Usage

  1. Activate the vibecheck conda environment:
mamba activate vibecheck
  1. Run
vibecheck [query ...]

Where [query ...] is the name of your input fasta file or a pair of fastq files. A query fasta file can contain as many sequences as you would like to be classified, while paired fastq files should only contain the data for one sample.

This command will generate a CSV file (lineage_report.csv) containing the estimated lineages for each sequence in the query file. See Output for a complete description of this file.

[!ATTENTION]

Full usage options

usage: vibecheck [-h] [-o OUTDIR] [--outfile OUTFILE] [--tempdir TEMPDIR] [--no-temp] [-t THREADS] [-v] [-u USHER_TREE] [-m MAX_AMBIGUITY] [-b BARCODES] [-s SUBSAMPLE] [--no-subsample] [query ...]

██╗   ██╗██╗██████╗ ███████╗ ██████╗██╗  ██╗███████╗ ██████╗██╗  ██╗
██║   ██║██║██╔══██╗██╔════╝██╔════╝██║  ██║██╔════╝██╔════╝██║ ██╔╝
██║   ██║██║██████╔╝█████╗  ██║     ███████║█████╗  ██║     █████╔╝ 
╚██╗ ██╔╝██║██╔══██╗██╔══╝  ██║     ██╔══██║██╔══╝  ██║     ██╔═██╗ 
 ╚████╔╝ ██║██████╔╝███████╗╚██████╗██║  ██║███████╗╚██████╗██║  ██╗
  ╚═══╝  ╚═╝╚═════╝ ╚══════╝ ╚═════╝╚═╝  ╚═╝╚══════╝ ╚═════╝╚═╝  ╚═╝

        Rapid classification of O1 Vibrio cholerae lineages.

positional arguments:
  query                 Query sequences to classify

options:
  -h, --help            show this help message and exit
  -o OUTDIR, --outdir OUTDIR
                        Output directory. Default: current working directory
  --outfile OUTFILE     Optional output file name. Default: lineage_report.csv
  --tempdir TEMPDIR     Specify where you want the temp stuff to go. Default: $TMPDIR
  --no-temp             Output all intermediate files, for dev purposes.
  -t THREADS, --threads THREADS
                        Number of threads to use when possible. Default: all available cores, 4 detected on this machine
  -v, --version         Prints the version of Vibecheck and exits.

Sequence-based classification:
  -u USHER_TREE, --usher-tree USHER_TREE
                        UShER Mutation Annotated Tree protobuf file to use instead of default tree
  -m MAX_AMBIGUITY, --max-ambiguity MAX_AMBIGUITY
                        Maximum number of ambiguous bases a sequence can have before its filtered from the analysis. Default: 0.3

Read-based classification:
  -b BARCODES, --barcodes BARCODES
                        Feather formatted lineage barcodes to use instead of default O1 barcodes
  -s SUBSAMPLE, --subsample SUBSAMPLE
                        Fraction of reads to use in classification. Default: 0.2
  --no-subsample        Do not subsample reads. Default: False

Output

A successful run of Vibecheck will output a CSV file, named by default lineage_report.csv.

This output file contains 6 columns with a row for each sequence found in the query input file.

  • The sequence_id column contains the name of each provided sequence.
  • The qc_status column indicates whether a sequenced passed or failed quality control.
  • The qc_notes columns summarizes the results the quality control process.
  • The lineage column contains the most likely lineage assigned to a sequence.
  • The confidence column contains a value reflecting how confidence the assignment of a sequence is. A value of 0 indicates, given the current phylogenetic tree, there is only a single lineage that the sequence could be assigned to, while a value above 0 indicates that number of lineages that a sequence could be assigned to.
  • The classification_notes column summarizes the placement(s) of a sequences.

Note

The assignment of a sequence is sensitive to missing data at key sites, recombination, and other factors. Therefore, caution should be taken in interpreting the results of Vibecheck. All results should be confirmed with a complete phylogenetic reconstruction involving quality and completeness filtering, and recombination masking. We recommend the bacpage phylogeny (available on Terra as well) pipeline for this.

Example output

sequence_id qc_status qc_notes lineage confidence usher_note
SequenceA pass Ambiguous_content:0.01% T13 1.0 Usher placements: T13(1/1)
SequenceB pass Ambiguous_content:0.03% T15 1.0 Usher placements: T15(1/1)
SequenceC pass Ambiguous_content:0.02% T12 1.0 Usher placements: T12(8/8)
SequenceD pass Ambiguous_content:0.13% T13 0.6666 Usher placements: T13(2/3) UNDEFINED(1/3)

In the example above, SequenceA and SequenceB each have a single parsimonious placement in the phylogeny and are therefore assigned T13 and T15, respectively, with a confidence value of 1 indicating low uncertainty. SequenceC has eight parsimonious placements in the phylogeny (as indicated by the (8/8) in the classification_notes column). However, all of these placements are in the T12 lineage. Therefore, SequenceC is assigned the lineage T12 with a confidence value of 1 indicating high certainty. SequenceD has three parsimonious placements in the phylogeny, two of which fall in the T13 lineage, and one which falls into non-African diversity. SequenceD is therefore assigned as T13 because it is the most frequent assignment, but it has a confidence value less than 1 indicating an uncertain assignment. The quality and completeness of this sequence should be confirmed, and a complete phylogenetic construction should be completed to confirm the lineage assignment.

About

An easy-to-use program to assign O1 Vibrio cholerae genomes to canonical lineages using phylogenetic placement.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published