Overview
Installation
Tutorial
User Guide
Recommended workflow
Cue is a deep learning framework for SV calling and genotyping. At a high-level, Cue operates in the following stages illustrated in the figure below:
- sequence alignments are converted into images that capture multiple alignment signals across two genome intervals,
- a trained neural network is used to generate Gaussian response confidence maps for each image, which encode the location, type, and genotype of the SVs in this image, and
- the high-confidence SV predictions are refined and mapped back from image to genome coordinates.
The current version of Cue can be used to detect and genotype the following SV types: deletions (DELs), tandem duplication (DUPs), inversions (INVs), deletion-flanked inversions (INVDELs), and inverted duplications (INVDUPs) larger than 5kbp.
For more information please see the following preprint and video.
- Clone the repository:
git clone [email protected]:PopicLab/cue.git
-
Create the virtual environment (in the env directory):
$> python3.7 -m venv env
-
Activate the environment:
$> source env/bin/activate
-
Install all the required packages in the virtual environment (this should take a few minutes):
$> pip --no-cache-dir install -r install/requirements.txt
Packages can also be installed individually using the versions provided in theinstall/requirements.txt
file; for example:$> pip install numpy==1.18.5
-
Set the
PYTHONPATH
as follows:export PYTHONPATH=${PYTHONPATH}:/path/to/cue
To deactivate the environment: $> deactivate
The latest pre-trained Cue model can be downloaded from this link.
To download the latest model into the data/models directory:
wget --directory-prefix=data/models/ https://storage.googleapis.com/cue-models/latest/cue.v2.pt
Pre-trained models are stored in the following public Google Cloud Storage bucket.
Synthetic training and benchmark data is available in the public Google Cloud Storage datasets bucket.
We recommend trying the provided demo Jupyter notebook to ensure that the software
was properly installed and to experiment running Cue. For convenience, Jupyter was already included in the installation
requirements above, or can be installed separately from here.
In this demo we use Cue to discover variants in a small BAM file (with the associated YAML config files needed to execute
this workflow provided in the data/demo/config
directory).
In addition to the functionality to call structural variants, the framework can be used to execute
custom model training, evaluation, and image generation. The engine
directory contains the following
key high-level scripts to train/evaluate the model and generate image datasets:
call.py
: calls structural variants given a pre-trained model and an input BAM/CRAM file (can be executed on multiple GPUs or CPUs)train.py
: trains a deep learning model (currently, this is a stacked hourglass network architecture) to detect SV keypoints in imagesgenerate.py
: creates an annotated image dataset from alignments (BAM/CRAM file(s))view.py
: plots images annotated with SVs from a VCF/BED file given genome alignments (BAM/CRAM format); can be used to visualize model predictions or ground truth SVs
Each script accepts as input one or multiple YAML config files, which encode a variety of parameters.
Template config files with key parameters are provided in the config
directory.
The config/custom
directory contains template config files with additional parameters that
can be useful when generating custom models.
The key required and optional YAML parameters for each Cue command are listed below.
call.py
(data YAML):
bam
[required] path to the alignments file (BAM/CRAM format)fai
[required] path to the referene FASTA FAI filechr_names
[optional] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"] (default: null)
call.py
(model YAML):
model_path
[required] path to the pretrained Cue model (recommended: the latest available model)gpu_ids
[optional] list of GPU ids to use for calling (default: CPU(s) will be used if empty)n_jobs_per_gpu
[optional] number of parallel jobs to launch on the same GPU (default: 1)n_cpus
[optional] number of CPUs to use for calling if no GPUs are listed (default: 1)
train.py
:
dataset_dirs
[required] list of annotated imagesets to use for traininggpu_ids
[optional] GPU id to use for training -- a CPU will be used if emptyreport_interval
[optional] frequency (in number of batches) for reporting training stats and image predictions (default: 50)
generate.py
:
bam
[required] path to the alignments file (BAM/CRAM format)bed
[required] path to the ground truth BED or VCF filefai
[required] path to the referene FASTA FAI filen_cpus
[optional] number of CPUs to use for image generation (parallelized by chromosome) (default: 1)chr_names
[optional] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"] (default: null)
view.py
:
bam
[required] path to the alignments file (BAM/CRAM format)bed
[required] path to the BED or VCF file with SVs to visualizefai
[required] path to the reference FASTA FAI filen_cpus
[optional] number of CPUs (parallelized by chromosome) (default: 1)chr_names
[optional] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"] (default: null)
- Create a new directory.
- Place YAML config file(s) in this directory (see the provided templates).
- Populate the YAML config file(s) with the parameters specific to this experiment.
- Execute the appropriate
engine
script providing the path to the newly configured YAML file(s). The engine scripts will automatically create auxiliary directories with results in the folder where the config YAML files are located.
Victoria Popic ([email protected])
For questions, suggestions, or technical assistance, please create an issue on the Cue Github issues page or reach out to Victoria Popic at [email protected].