Construction still in progress
A Snakemake workflow for Highly accurate Single Nucleotide polymorphisms calling and Implication of haplotypes in Probe-capture based long-Read Nanopore sequencing from raw FastQ to phased VCF. It is especially designed to process amplicon sequencing data for non-model organisms such as Plasmodium spp.
- Authors
- Software
- Methods
- Features
- Usage
- Installation
- Configuration
- Execution
- Examples Dataset
- Result
- Benchmarking
- Haplotype phasing
- Tips & FAQs
- Limitation
- Reporting Issue
- Link
This project is written based on the following software
Software | Reference (DOI) |
---|---|
BCFtools | https://doi.org/10.1093/gigascience/giab008 |
bedtools | https://doi.org/10.1093/bioinformatics/btq033 |
Freebayes | https://doi.org/10.48550/arXiv.1207.3907 |
GATK-4 | https://doi.org/10.1038/ng.806 |
longshot | https://doi.org/10.1038/s41467-019-12493-y |
minimap2 | https://doi:10.1093/bioinformatics/btab705 |
NanoSim | https://doi.org/10.1093/gigascience/gix010 |
PEPPER | https://doi.org/10.5281/zenodo.5275510 |
R-tidyverse | https://doi.org/10.21105/joss.01686 |
samtools | https://doi.org/10.1093/bioinformatics/btp352 |
Snakemake | https://doi.org/10.12688/f1000research.29032.2 |
VCFtools | https://doi.org/10.1093/bioinformatics/btr330 |
WhatsHap | https://doi.org/10.1186/s13059-020-02158-1 |
The following software are required to install prior to running sm-SNIPER
- Conda >= 4.12.0
- Singularity >= 3.8.5
- graphviz (optional)
- minimap2: mapping to the reference genome
- Quality control: Read-depth and coverage calculation with samtools
- Variant calling with longshot: Support only SNVs and restricted to high quality variants from high coverage regions (at least 10X)
- Variant calling with PEPPER which is run on singularity container: Support SNVs and Indels including variants from low quality and low coverage regions
- Merging SVNs supported by both callers: highly confidents SNPs from high quality regions
- Classification of variants (SVNs) by Support Vector Machine based on in-house reference database (it is genereated by sequencing of reference strains using STAR-seq protocol): rescues variants that failed the first filter using reference database
- Final variant call: final variant call
- Generation of only primary alignment BAM files: to use for haplotype phasing
- Haplotype phasing with WhatsHap: Haplotype phasing to final variant callset
Note on in-house reference database: The data includes true and false variants from the targeted amplicon panel (detected by variant calling methods - longshot or PEPPER) for known reference strains (3D7 and BB12) at 100X coverage. Two clinical samples are also sequenced at 200X coverage and variants detected by each variant method are cross-checked with variants from Naung et al., 2022, and concensus variants are assumed as true variants from clinical samples.
Disclaimer: These samples are not used in the following benchmarking to avoid overfitting.
The usage of this workflow is described in the Snakemake Workflow Catalog.
- Install snakemake, which requires conda & mamba, according to the documentation
- Clone/download this repository (e.g.
git clone https://github.com/myonaung/sm-SNIPER.git
)
-
Input (FastQ)
files must be annotated with sample name, and thus(sample_name).fastq
. -
Based on the analyses, the following parameters in the
workflow/config/config.yaml
file and resource files inworkflow/resources/
are to be adjustedreference
- name of the target reference genome along with index .fai file from workflow/resources/ folder (e.g. resources/2. PlasmoDB-46_Pfalciparum3D7_Genome.fasta)bed
- bed coordinate files for region of interestdata
- file path to folder that contains fastq files (e.g. desktop/fastq)ont_chemistry
- the chemistry of flowcell used for sequencing (default is R9 flowcell that isont_r9_guppy5_sup
, other options includeont_r10_q20
for R10 chemistry orhifi
(for Hifi).min_coverage
: minimum coverage used for variant callingmax_coverage
: maximum coverage used for variant callingmin_alt_frac
: specification of a potential SNV (or minor clones in the case of malaria multiclonal infection) to have at least this fraction of alternate allele observations
-
Based on nature of data to be analysed, it is recommended to change
svm_training_longshot.txt
andsvm_training_pepper.txt
fromworkflow/resources/
folder. However, in the absence of reference dataset, thetraining dataset
fromworkflow/resources/
should suffice. -
Failed log from downstream
PEPPER
variant calling steps are to be ignored at the moment since they are not relying for the pipeline.
It is recommended to execute always from within top level of the pipeline directory (i.e sm-SNIPER/
). Firstly, conda environment that includes all the core software has to be installed upon the first run of the workflow. It might take several minutes.
###envname can be replaced by any name
git clone https://github.com/myonaung/sm-SNIPER.git
conda env create --name envname --file=workflow/envs/default.yml
conda activate envname
Singularity image for PEPPER variant calling step has to be downloaded, and placed it under workflow/envs
folder.
cd sm-SNIPER
singularity pull docker://kishwars/pepper_deepvariant:r0.7
mv pepper_deepvariant_r0.7.sif workflow/envs
Relevent folders to run sm-SNIPER is created using init.sh
. The path to data (i.e. fastq files) has to be added to the init.sh
file.
sh init.sh
Checking the pipeline with dry-run options. It is to print a summary of the DAG of jobs
cd workflow
snakemake -p -n
Command for execution with two cores
cd workflow
snakemake -p -c2 -k
We can invoke snakemake with the profile to automatically submit to the cluster with sbatch
as follow:
snakemake --profile slurm/
Parameters can be customised by editing to config.yaml
file from workflow/slurm
folder. Details customisation can be found at this repository.
To ensure reproducibility of results and to make the pipeline easy-to-replicate, we provide all required reference data for the analysis on Zendodo:
- nanopore simulated BAM
- nanopore amplicon-seq BAM file from 3D7 reference strain mock infection
- nanopore ama1 BAM file from synthetic mocked infection of 3D7 and BB12 mixture
The final variant output can be found in the sm-SNIPER/workflow/results/{SAMPLE}/out/final_vcf_SNIPER
.
Benchmarking of the reference strain shown in the following example was done based on the amplicons from Naung et al., 2023. Genomic coordinates of these can be found in the bed file.
Information on read-depth and coverage can be found at workflow/results/{SAMPLE}/out/QC
.
- _amplicon_depth.txt represents per base-pair sequenced read-depth.
- _mapping_summary.txt represents mapping summary.
- _amplicon_total_depth.txt represents average read-depth for each amplicon specified in bed file.
- _mean_depth.txt represents mean read-depth of the amplicon panel.
To quantify baseline error rates for sm-SNIPER in the context of STAR-seq, we mapped raw amplicon sequencing data from probe-capture based 3D7 mocked infection (with high human dna background) to the publicly available P. falciparum 3D7 reference genome (version 3). The sequencing was done an average of 250X coverage. The SNVs identification of sm-SNIPER with popular variant calling methods are evaluated.
Method | coverage (X) | length (bps) | No. expected SNVs | No. observed SNVs |
---|---|---|---|---|
Freebayes v1.3.6 | 250 | 140629 | 0 | 75 |
BCFtools v1.15.1 | 250 | 140629 | 0 | 16 |
PEPPER | 250 | 140629 | 0 | 8 |
Longshot | 250 | 140629 | 0 | 1 |
sm-SNIPER | 250 | 140629 | 0 | 0 |
sm-SNIPER | 50 | 140629 | 0 | 0 |
sm-SNIPER | 30 | 140629 | 0 | 0 |
sm-SNIPER | 15 | 140629 | 0 | 0 |
sm-SNIPER | 5 | 140629 | 0 | 10 |
Freebayes was executed as follows:
freebayes -f reference.fasta -t bed.bed -p 2 -m 20 -q 7 --limit-coverage 3000 --min-coverage 10 -C 10 --throw-away-indel-obs bam.bam > out.vcf
BCFtools was executed as follows:
bcftools mpileup -d 3000 -Q 7 -Ou -I -f reference.fasta -R bed.bed bam.bam | bcftools call -mv -Ov -o out_raw.vcf
bcftools filter -sLowQual -e'%QUAL<20 & MQ < 10 | AC<=3 & DP<5 | %MAX(AC)/%MAX(DP) < 0.01' out_raw.vcf > out_filtered.vcf
Execution of longshot and PEPPER was similar to the ones used in sm-SNIPER.
We used synthetic mocked infection of 3D7 and BB12 mixture to evaluate true variant discovery with sm-SNIPER, and minor clone is BB12 strains. We used highly polymorphic and repeat-free ama1 (PF3D7_1133400) sequenced at 200X coverage with STAR-seq as a benchmarking amplicon.
Mocked infection | Minor clone prop | Gene ID | bps | No. expected SNVs | No. detected true SNVs | No. detected false SNVs | FDR | Precision | Recall | F1 score |
---|---|---|---|---|---|---|---|---|---|---|
mixed-1 | 0.33 | PF3D7_1133400 | 1869 | 32 | 31 | 0 | 0 | 1.00 | 0.97 | 0.98 |
mixed-2 | < 0.1 | PF3D7_1133400 | 1869 | 32 | 14 | 0 | 0 | 1.00 | 0.43 | 0.60 |
To understand the extent of baseline error and SNV detection at lower coverage, we subsampled the alignments file (mixed-1) to 30X, 15X, and 5X coverage with seqtk v1.3. Variant calls were evaluated at each respective coverage.
#subset 10000 reads from fastq
seqtk sample sample.fastq 10000 > subsample.fastq
Coverage | Gene ID | bps | No. expected SNVs | No. detected true SNVs | No. detected false SNVs | Precision | Recall | F1 score |
---|---|---|---|---|---|---|---|---|
50X | PF3D7_1133400 | 1869 | 32 | 30 | 0 | 1 | 0.94 | 0.97 |
30X | PF3D7_1133400 | 1869 | 32 | 30 | 0 | 1 | 0.94 | 0.97 |
15X | PF3D7_1133400 | 1869 | 32 | 30 | 0 | 1 | 0.94 | 0.97 |
5X | PF3D7_1133400 | 1869 | 32 | 13 | 1 | 0.93 | 0.41 | 0.57 |
Haplotype phasing method used in sm-SNIPER
was based on longshot
and WhatsHap, and thus haplotype switch error may still present in the cases where there are more than 2 infected strains in a single sample. Therefore, for the best results in haplotype phasing, we recommend to use SHAPEIT4 which uses phase information (value of phase tag (PS) block) from WhatsHap to extract phase information along with our filtered bam file. See here for example workflow of Plasmodium falciparum.
See phased results on the example vcf files at ama1_final_SNIPER.vcf, where major clone is the reference strain and thus clear seperation of 0|1
is expected for all variants in genotype (GT)field.
To reconstruct the fasta files from a phased VCF, the bcftools consensus
command can be used.
bgzip phased.vcf
tabix phased.vcf.gz
bcftools consensus -H 1 -f reference.fasta phased.vcf.gz > haplotype1.fasta
bcftools consensus -H 2 -f reference.fasta phased.vcf.gz > haplotype2.fasta
Here are some tips for troubleshooting & FAQs:
- always first perform a dry-run with option
-n
- always run the pipeline with
-k
options to complete independent steps if an upstream step fails - in case the pipeline crashes, manually cancel the pipeline as follow
snakemake --unlock
snakemake -k -c8 --rerun-incomplete
- command for generating the directed acyclic graph (DAG) of all jobs with current resultsration (installation of graphviz will be required)
snakemake --dag --forceall | dot -Tsvg > workflow/dags/all_DAG.svg
Our sm-SNIPER pipeline cannot perform indel discovery and SNV calling is restricted to MOI = 2.
Please create issue here for any problem developed from using sm-SNIPER
or to request a new features.