Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lofreq #17

Merged
merged 21 commits into from
Feb 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@

* `pear`: Paired-end read merger (PR #10).

* `lofreq/call`: Call variants from a BAM file (PR #17).

* `lofreq/indelqual`: Insert indel qualities into BAM file (PR #17).

## MAJOR CHANGES

## MINOR CHANGES
Expand Down
251 changes: 251 additions & 0 deletions src/lofreq/call/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
functionality:
name: lofreq_call
namespace: lofreq
description: |
Call variants from a BAM file.

LoFreq* (i.e. LoFreq version 2) is a fast and sensitive variant-caller for inferring SNVs and indels from next-generation sequencing data. It makes full use of base-call qualities and other sources of errors inherent in sequencing (e.g. mapping or base/indel alignment uncertainty), which are usually ignored by other methods or only used for filtering.

LoFreq* can run on almost any type of aligned sequencing data (e.g. Illumina, IonTorrent or Pacbio) since no machine- or sequencing-technology dependent thresholds are used. It automatically adapts to changes in coverage and sequencing quality and can therefore be applied to a variety of data-sets e.g. viral/quasispecies, bacterial, metagenomics or somatic data.

LoFreq* is very sensitive; most notably, it is able to predict variants below the average base-call quality (i.e. sequencing error rate). Each variant call is assigned a p-value which allows for rigorous false positive control. Even though it uses no approximations or heuristics, it is very efficient due to several runtime optimizations and also provides a (pseudo-)parallel implementation. LoFreq* is generic and fast enough to be applied to high-coverage data and large genomes. On a single processor it takes a minute to analyze Dengue genome sequencing data with nearly 4000X coverage, roughly one hour to call SNVs on a 600X coverage E.coli genome and also roughly an hour to run on a 100X coverage human exome dataset.
info:
keywords: [ "variant calling", "low frequancy variant calling", "lofreq", "lofreq/call"]
links:
homepage: https://csb5.github.io/lofreq/
documentation: https://csb5.github.io/lofreq/commands/
reference:
doi: 10.1093/nar/gks918
license: "MIT"
requirements:
commands: [ lofreq ]
argument_groups:
- name: Inputs
arguments:
- name: --input
type: file
description: |
Input BAM file.
required: true
example: "normal.bam"
- name: --input_bai
type: file
description: |
Index file for the input BAM file.
required: true
example: "normal.bai"
- name: --ref
alternatives: -f
type: file
description: |
Indexed reference fasta file (gzip supported). Default: none.
required: true
example: "reference.fasta"
- name: Outputs
arguments:
- name: --out
alternatives: -o
type: file
description: |
Vcf output file. Default: stdout.
required: true
direction: output
example: "output.vcf"
- name: Arguments
arguments:
- name: --region
alternatives: -r
type: string
description: |
Limit calls to this region (chrom:start-end). Default: none.
required: false
example: "chr1:1000-2000"
- name: --bed
alternatives: -l
type: file
description: |
List of positions (chr pos) or regions (BED). Default: none.
required: false
example: "regions.bed"
- name: --min_bq
alternatives: -q
type: integer
description: |
Skip any base with baseQ smaller than INT. Default: 6.
required: false
example: 6
- name: --min_alt_bq
alternatives: -Q
type: integer
description: |
Skip alternate bases with baseQ smaller than INT. Default: 6.
required: false
example: 6
- name: --def_alt_bq
alternatives: -R
type: integer
description: |
Overwrite baseQs of alternate bases (that passed bq filter) with this value (-1: use median ref-bq; 0: keep). Default: 0.
required: false
example: 0
- name: --min_jq
alternatives: -j
type: integer
description: |
Skip any base with joinedQ smaller than INT. Default: 0.
example: 0
- name: --min_alt_jq
alternatives: -J
type: integer
description: |
Skip alternate bases with joinedQ smaller than INT. Default: 0.
required: false
example: 0
- name: --def_alt_jq
alternatives: -K
type: integer
description: |
Overwrite joinedQs of alternate bases (that passed jq filter) with this value (-1: use median ref-bq; 0: keep). Default: 0.
required: false
example: 0
- name: --no_baq
alternatives: -B
type: boolean_true
description: |
Disable use of base-alignment quality (BAQ).
- name: --no_idaq
alternatives: -A
type: boolean_true
description: |
Don't use IDAQ values (NOT recommended under ANY circumstances other than debugging).
- name: --del_baq
alternatives: -D
type: boolean_true
description: |
Delete pre-existing BAQ values, i.e. compute even if already present in BAM.
- name: --no_ext_baq
alternatives: -e
type: boolean_true
description: |
Use 'normal' BAQ (samtools default) instead of extended BAQ (both computed on the fly if not already present in lb tag).
- name: --min_mq
alternatives: -m
type: integer
description: |
Skip reads with mapping quality smaller than INT. Default: 0.
required: false
example: 0
- name: --max_mq
alternatives: -M
type: integer
description: |
Cap mapping quality at INT. Default: 255.
required: false
example: 255
- name: --no_mq
alternatives: -N
type: boolean_true
description: |
Don't merge mapping quality in LoFreq's model.
- name: --call_indels
type: boolean_true
description: |
Enable indel calls (note: preprocess your file to include indel alignment qualities!).
- name: --only_indels
type: boolean_true
description: |
Only call indels; no SNVs.
- name: --src_qual
alternatives: -s
type: boolean_true
description: |
Enable computation of source quality.
- name: --ign_vcf
alternatives: -S
type: file
description: |
Ignore variants in this vcf file for source quality computation. Multiple files can be given separated by commas.
required: false
example: "variants.vcf"
- name: --def_nm_q
alternatives: -T
type: integer
description: |
If >= 0, then replace non-match base qualities with this default value. Default: -1.
required: false
example: -1
- name: --sig
alternatives: -a
type: double
description: |
P-Value cutoff / significance level. Default: 0.010000.
required: false
example: 0.01
- name: --bonf
alternatives: -b
type: string
description: |
Bonferroni factor. 'dynamic' (increase per actually performed test) or INT. Default: Dynamic.
required: false
example: "dynamic"
- name: --min_cov
alternatives: -C
type: integer
description: |
Test only positions having at least this coverage. Default: 1.
(note: without --no-default-filter default filters (incl. coverage) kick in after predictions are done).
required: false
example: 1
- name: --max_depth
alternatives: -d
type: integer
description: |
Cap coverage at this depth. Default: 1000000.
required: false
example: 1000000
- name: --illumina_13
type: boolean_true
description: |
Assume the quality is Illumina-1.3-1.7/ASCII+64 encoded.
- name: --use_orphan
type: boolean_true
description: |
Count anomalous read pairs (i.e. where mate is not aligned properly).
- name: --plp_summary_only
type: boolean_true
description: |
No variant calling. Just output pileup summary per column.
- name: --no_default_filter
type: boolean_true
description: |
Don't run default 'lofreq filter' automatically after calling variants.
- name: --force_overwrite
type: boolean_true
description: |
Overwrite any existing output.
- name: --verbose
type: boolean_true
description: |
Be verbose.
- name: --debug
type: boolean_true
description: |
Enable debugging.
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
platforms:
- type: docker
image: quay.io/biocontainers/lofreq:2.1.5--py38h794fc9e_10
setup:
- type: docker
run: |
version=$(lofreq version | grep 'version' | sed 's/version: //') && \
echo "lofreq: $version" > /var/software_versions.txt
- type: nextflow

49 changes: 49 additions & 0 deletions src/lofreq/call/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
lofreq call: call variants from BAM file

Usage: lofreq call [options] in.bam

Options:
- Reference:
-f | --ref FILE Indexed reference fasta file (gzip supported) [null]
- Output:
-o | --out FILE Vcf output file [- = stdout]
- Regions:
-r | --region STR Limit calls to this region (chrom:start-end) [null]
-l | --bed FILE List of positions (chr pos) or regions (BED) [null]
- Base-call quality:
-q | --min-bq INT Skip any base with baseQ smaller than INT [6]
-Q | --min-alt-bq INT Skip alternate bases with baseQ smaller than INT [6]
-R | --def-alt-bq INT Overwrite baseQs of alternate bases (that passed bq filter) with this value (-1: use median ref-bq; 0: keep) [0]
-j | --min-jq INT Skip any base with joinedQ smaller than INT [0]
-J | --min-alt-jq INT Skip alternate bases with joinedQ smaller than INT [0]
-K | --def-alt-jq INT Overwrite joinedQs of alternate bases (that passed jq filter) with this value (-1: use median ref-bq; 0: keep) [0]
- Base-alignment (BAQ) and indel-aligment (IDAQ) qualities:
-B | --no-baq Disable use of base-alignment quality (BAQ)
-A | --no-idaq Don't use IDAQ values (NOT recommended under ANY circumstances other than debugging)
-D | --del-baq Delete pre-existing BAQ values, i.e. compute even if already present in BAM
-e | --no-ext-baq Use 'normal' BAQ (samtools default) instead of extended BAQ (both computed on the fly if not already present in lb tag)
- Mapping quality:
-m | --min-mq INT Skip reads with mapping quality smaller than INT [0]
-M | --max-mq INT Cap mapping quality at INT [255]
-N | --no-mq Don't merge mapping quality in LoFreq's model
- Indels:
--call-indels Enable indel calls (note: preprocess your file to include indel alignment qualities!)
--only-indels Only call indels; no SNVs
- Source quality:
-s | --src-qual Enable computation of source quality
-S | --ign-vcf FILE Ignore variants in this vcf file for source quality computation. Multiple files can be given separated by commas
-T | --def-nm-q INT If >= 0, then replace non-match base qualities with this default value [-1]
- P-values:
-a | --sig P-Value cutoff / significance level [0.010000]
-b | --bonf Bonferroni factor. 'dynamic' (increase per actually performed test) or INT ['dynamic']
- Misc.:
-C | --min-cov INT Test only positions having at least this coverage [1]
(note: without --no-default-filter default filters (incl. coverage) kick in after predictions are done)
-d | --max-depth INT Cap coverage at this depth [1000000]
--illumina-1.3 Assume the quality is Illumina-1.3-1.7/ASCII+64 encoded
--use-orphan Count anomalous read pairs (i.e. where mate is not aligned properly)
--plp-summary-only No variant calling. Just output pileup summary per column
--no-default-filter Don't run default 'lofreq filter' automatically after calling variants
--force-overwrite Overwrite any existing output
--verbose Be verbose
--debug Enable debugging
57 changes: 57 additions & 0 deletions src/lofreq/call/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash

## VIASH START
## VIASH END

# Unset all parameters that are set to "false"
[[ "$par_no_baq" == "false" ]] && unset par_no_baq
[[ "$par_no_idaq" == "false" ]] && unset par_no_idaq
[[ "$par_del_baq" == "false" ]] && unset par_del_baq
[[ "$par_no_ext_baq" == "false" ]] && unset par_no_ext_baq
[[ "$par_no_mq" == "false" ]] && unset par_no_mq
[[ "$par_call_indels" == "false" ]] && unset par_call_indels
[[ "$par_only_indels" == "false" ]] && unset par_only_indels
[[ "$par_src_qual" == "false" ]] && unset par_src_qual
[[ "$par_illumina_13" == "false" ]] && unset par_illumina_13
[[ "$par_use_orphan" == "false" ]] && unset par_use_orphan
[[ "$par_plp_summary_only" == "false" ]] && unset par_plp_summary_only
[[ "$par_no_default_filter" == "false" ]] && unset par_no_default_filter
[[ "$par_force_overwrite" == "false" ]] && unset par_force_overwrite
[[ "$par_verbose" == "false" ]] && unset par_verbose
[[ "$par_debug" == "false" ]] && unset par_debug

# Run lofreq call
lofreq call \
-f "$par_ref" \
-o "$par_out" \
${par_region:+-r "${par_region}"} \
${par_bed:+-l "${par_bed}"} \
${par_min_bq:+-q "${par_min_bq}"} \
${par_min_alt_bq:+-Q "${par_min_alt_bq}"} \
${par_def_alt_bq:+-R "${par_def_alt_bq}"} \
${par_min_jq:+-j "${par_min_jq}"} \
${par_alt_jq:+-K "${par_alt_jq}"} \
${par_no_baq:+-B} \
${par_no_idaq:+-A} \
${par_del_baq:+-D} \
${par_no_ext_baq:+-e} \
${par_min_mq:+-m "${par_min_mq}"} \
${par_max_mq:+-M "${par_max_mq}"} \
${par_no_mq:+-N} \
${par_call_indels:+--call-indels} \
${par_only_indels:+--only-indels} \
${par_src_qual:+-s} \
${par_ign_vcf:+-S "${par_ign_vcf}"} \
${par_def_nm_q:+-T "${par_def_nm_q}"} \
${par_sig:+-a "${par_sig}"} \
${par_bonf:+-b "${par_bonf}"} \
${par_min_cov:+-C "${par_min_cov}"} \
${par_max_depth:+-d "${par_max_depth}"} \
${par_illumina_13:+--illumina-1.3} \
${par_use_orphan:+--use-orphan} \
${par_plp_summary_only:+--plp-summary-only} \
${par_no_default_filter:+--no-default-filter} \
${par_force_overwrite:+--force-overwrite} \
${par_verbose:+--verbose} \
${par_debug:+--debug} \
"$par_input"
20 changes: 20 additions & 0 deletions src/lofreq/call/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash

set -e

dir_in="${meta_resources_dir%/}/test_data"

echo "> Run lofreq call"
"$meta_executable" \
--input "$dir_in/a.bam" \
--input_bai "$dir_in/a.bai" \
--ref "$dir_in/genome.fasta" \
--out "output.vcf" \

echo ">> Checking output"
[ ! -f "output.vcf" ] && echo "Output file output.vcf does not exist" && exit 1

echo ">> Check if output is empty"
[ ! -s "output.vcf" ] && echo "Output file output.vcf is empty" && exit 1

echo "> Test successful"
Binary file added src/lofreq/call/test_data/a.bai
Binary file not shown.
Binary file added src/lofreq/call/test_data/a.bam
Binary file not shown.
Loading
Loading