npore
is a read realigner which recalculates each read's fine-grained alignment in order to more accurately align ''n-polymers'' such as homopolymers (n=1) and tandem repeats (2 ≤ n ≤ 6). In other words, given an input BAM, it adjusts each read's CIGAR string to more accurately model the most likely sequencing errors and actual variants. Traditional affine gap penalties are context-agnostic, and do not model the higher likelihood of INDELs in low-complexity regions (particularly n-polymers), leading to poor or inconsistent alignments. We find that npore
improves pileup concordance across reads and results in slightly better variant calling performance.
Please cite the following pre-print if you use npore
:
[bioRxiv] nPoRe: n-Polymer Realigner for improved pileup variant calling
@article {dunn-npore, author = {Dunn, Tim and Blaauw, David and Das, Reetuparna and Narayanasamy, Satish}, title = {nPoRe: n-Polymer Realigner for improved pileup variant calling}, elocation-id = {2022.02.15.480561}, year = {2022}, doi = {10.1101/2022.02.15.480561}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2022/02/18/2022.02.15.480561}, eprint = {https://www.biorxiv.org/content/early/2022/02/18/2022.02.15.480561.full.pdf}, journal = {bioRxiv} }
First, clone the repository:
git clone https://github.com/timd1/npore && cd npore
Next, set up a virtual environment, activate it, and install the required packages.
python3 -m venv venv3 --prompt "npore"
source ./venv3/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt
Please ensure this environment is activated when building or running npore
.
Lastly, build npore
and verify that it has succeeded.
make
python3 ./src/realign.py --help
A pre-built Docker image can be downloaded from here using:
sudo docker pull timd1/npore
sudo docker run -it timd1/npore:latest python3 realign.py --help
This may take some time to re-build the image; the previous option should be preferred in most cases.
git clone https://github.com/TimD1/npore && cd npore
sudo docker build -f ./Dockerfile -t timd1/npore:latest .
sudo docker run -it timd1/npore:latest python3 realign.py --help
All input BAMs are required to:
- Have
MD
tag annotations (forpysam
read to ref mapping) - Be indexed (have an associated
.bam.bai
file)
You can prepare your input BAM using samtools
:
samtools calmd -@ `nproc` -b -Q orig_reads.bam ref.fasta > reads.bam
samtools index reads.bam
Here's an example usage of the main realign.py
program, which will store results in realigned.sam
.
export NPORE="$HOME/npore"
export DATA="$NPORE/test/data"
. $NPORE/venv3/bin/activate
python3 $NPORE/src/realign.py \
--bam $DATA/reads.bam \
--ref $DATA/ref.fasta \
--out_prefix $DATA/realigned \
--stats_dir $NPORE/guppy5_stats
For additional options, run python3 realign.py --help
.
Here's how to call the Docker container with the same arguments as above:
export NPORE="$HOME/npore"
export DATA="$NPORE/test/data"
sudo docker run \
-v $DATA:$DATA \
-v $NPORE/guppy5_stats:$NPORE/guppy5_stats \
timd1/npore:v0.1.0 \
python3 realign.py \
--bam $DATA/reads.bam \
--ref $DATA/ref.fasta \
--out_prefix $DATA/realigned \
--stats_dir $NPORE/guppy5_stats
src/ |
nPoRe source code |
---|---|
realign.py |
Module for realigning a BAM file. |
standardize_vcf.py |
Module for standardizing a ground-truth VCF file to report variants in the same manner that nPoRe would align the reads. |
bed.py |
Module for computing n-polymer BED regions. |
purity.py |
Module for computing a BAM pileup's Gini purity, for measuring read concordance. |
filter.py |
Simple module for filtering overlapping variants. |
cfg.py |
Contains global variables and configuration. |
All other src/
files (aln.pyx
, bam.pyx
, cig.pyx
, vcf.py
, util.py
) contain functions used in the above modules.
scripts/ |
Helper scripts used during evaluation |
---|---|
realign_pipeline.sh |
Main Clair3 retraining pipeline. |
happy.sh |
Runs hap.py evaluation of all configurations/regions. |
plot_results.py |
Plots final precision/recall graphs. |
plot_sankey.py |
Generates Sankey plot of actual/error INDELs by n-polymer BED region. |
calc_beds.sh |
Calculates n-polymer BEDs, running bed.py . |
sankey.py |
Custom Sankey plot library, extended from pySankey . |
purity.sh |
Calculates Gini purity. |
align.sh |
Aligns reads to a reference, allowing multiple input formats. |
tag_unphased.py |
Tags unphased reads with HP:i:0 . |
test/ |
Testing directory |
---|---|
align.py |
Tests align() kernel. |
get_np_info.py |
Tests n-polymer info generation. |
realign.sh |
Tests full read realignment. |
test_std_vcf.sh |
Tests VCF standardization. |
profile_alignment.ipynb |
Line-by-line profiling of align() kernel. |
*stats/ |
Directory storing cached confusion matrices |
---|
The Genome In A Bottle GRCh38 v4.1 ground truth VCF and benchmarking regions were downloaded from here. The GRCh38 human reference sequence and R9.4.1 reads basecalled with the Guppy 5.0.6 super-accuracy model were downloaded from ONT Open Datasets.
We would like to thank the developers of samtools
, minimap2
, pysam
, clair3
, pepper-deepvariant
, pysankey
, igv
, and swalign
. We would also like to thank GIAB and ONT for making their data available publicly.