Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bcftools Norm #144

Merged
merged 14 commits into from
Sep 10, 2024
Merged
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
* `rsem/rsem_prepare_reference`: Prepare transcript references for RSEM (PR #89).

* `bcftools`:
- `bcftools_norm`: Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows (PR #144).
- `bcftools_annotate`: Add or remove annotations from a VCF/BCF file (PR #143).
- `bcftools/bcftools_stats`: Parses VCF or BCF and produces a txt stats file which can be plotted using plot-vcfstats (PR #142).
- `bcftools/bcftools_sort`: Sorts BCF/VCF files by position and other criteria (PR #141).
Expand Down
194 changes: 194 additions & 0 deletions src/bcftools/bcftools_norm/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
name: bcftools_norm
namespace: bcftools
description: |
Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows;
recover multiallelics from multiple rows.
keywords: [Normalize, VCF, BCF]
links:
homepage: https://samtools.github.io/bcftools/
documentation: https://samtools.github.io/bcftools/bcftools.html#norm
repository: https://github.com/samtools/bcftools
issue_tracker: https://github.com/samtools/bcftools/issues
references:
doi: https://doi.org/10.1093/gigascience/giab008
license: MIT/Expat, GNU
requirements:
commands: [bcftools]
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [author]

argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: -i
type: file
description: Input VCF/BCF file.
required: true

- name: Outputs
arguments:
- name: --output
alternatives: -o
direction: output
type: file
description: Output normalized VCF/BCF file.
required: true

- name: Options
arguments:

- name: --atomize
alternatives: -a
type: boolean_true
description: |
Decompose complex variants (e.g., MNVs become consecutive SNVs).

- name: --atom_overlaps
type: string
choices: [".", "*"]
description: |
Use the star allele (*) for overlapping alleles or set to missing (.).

- name: --check_ref
alternatives: -c
type: string
choices: ['e', 'w', 'x', 's']
description: |
Check REF alleles and exit (e), warn (w), exclude (x), or set (s) bad sites.

- name: --remove_duplicates
alternatives: -d
type: string
choices: ['snps', 'indels', 'both', 'all', 'exact', 'none']
description: Remove duplicate snps, indels, both, all, exact matches, or none (old -D option).

- name: --fasta_ref
alternatives: -f
type: file
description: Reference fasta sequence file.

- name: --force
type: boolean_true
description: |
Try to proceed even if malformed tags are encountered.
Experimental, use at your own risk.

- name: --keep_sum
type: string
description: |
Keep vector sum constant when splitting multiallelics (see github issue #360).

- name: --multiallelics
alternatives: -m
type: string
choices: ['+snps', '+indels', '+both', '+any', '-snps', '-indels', '-both', '-any']
description: |
Split multiallelics (-) or join biallelics (+), type: snps, indels, both, any [default: both].

- name: --no_version
type: boolean_true
description: Do not append version and command line information to the header.

- name: --do_not_normalize
alternatives: -N
type: boolean_true
description: Do not normalize indels (with -m or -c s).

- name: --output_type
alternatives: --O
type: string
choices: ['u', 'z', 'b', 'v']
description: |
Output type:
u: uncompressed BCF
z: compressed VCF
b: compressed BCF
v: uncompressed VCF

- name: --old_rec_tag
type: string
description: Annotate modified records with INFO/STR indicating the original variant.

- name: --regions
alternatives: --r
type: string
description: |
Restrict to comma-separated list of regions.
Following formats are supported: chr|chr:pos|chr:beg-end|chr:beg-[,…​].
example: '20:1000000-2000000'

- name: --regions_file
alternatives: --R
type: file
description: |
Restrict to regions listed in a file.
Regions can be specified either on a VCF, BED, or tab-delimited file (the default).
For more information check manual.

- name: --regions_overlap
type: string
choices: ['pos', 'record', 'variant', '0', '1', '2']
description: |
This option controls how overlapping records are determined:
set to 'pos' or '0' if the VCF record has to have POS inside a region (this corresponds to the default behavior of -t/-T);
set to 'record' or '1' if also overlapping records with POS outside a region should be included (this is the default behavior of -r/-R,
and includes indels with POS at the end of a region, which are technically outside the region);
or set to 'variant' or '2' to include only true overlapping variation (compare the full VCF representation "TA>T-" vs the true sequence variation "A>-").

- name: --site_win
alternatives: -w
type: integer
description: |
Buffer for sorting lines that changed position during realignment.

- name: --strict_filter
alternatives: -s
type: boolean_true
description: When merging (-m+), merged site is PASS only if all sites being merged PASS.

- name: --targets
alternatives: -t
type: string
description: Similar to --regions but streams rather than index-jumps.
example: '20:1000000-2000000'

- name: --targets_file
alternatives: -T
type: file
description: Similar to --regions_file but streams rather than index-jumps.

- name: --targets_overlap
type: string
choices: ['pos', 'record', 'variant', '0', '1', '2']
description: |
Include if POS in the region (0), record overlaps (1), variant overlaps (2).
Similar to --regions_overlap.

resources:
- type: bash_script
path: script.sh

test_resources:
- type: bash_script
path: test.sh

engines:
- type: docker
image: debian:stable-slim
setup:
- type: apt
packages: [bcftools, procps]
- type: docker
run: |
echo "bcftools: \"$(bcftools --version | grep 'bcftools' | sed -n 's/^bcftools //p')\"" > /var/software_versions.txt
test_setup:
- type: apt
packages: [tabix]

runners:
- type: executable
- type: nextflow


41 changes: 41 additions & 0 deletions src/bcftools/bcftools_norm/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
```
bcftools norm -h
```

About: Left-align and normalize indels; check if REF alleles match the reference;
split multiallelic sites into multiple rows; recover multiallelics from
multiple rows.
Usage: bcftools norm [options] <in.vcf.gz>

Options:
-a, --atomize Decompose complex variants (e.g. MNVs become consecutive SNVs)
--atom-overlaps '*'|. Use the star allele (*) for overlapping alleles or set to missing (.) [*]
-c, --check-ref e|w|x|s Check REF alleles and exit (e), warn (w), exclude (x), or set (s) bad sites [e]
-D, --remove-duplicates Remove duplicate lines of the same type.
-d, --rm-dup TYPE Remove duplicate snps|indels|both|all|exact
-f, --fasta-ref FILE Reference sequence
--force Try to proceed even if malformed tags are encountered. Experimental, use at your own risk
--keep-sum TAG,.. Keep vector sum constant when splitting multiallelics (see github issue #360)
-m, --multiallelics -|+TYPE Split multiallelics (-) or join biallelics (+), type: snps|indels|both|any [both]
--no-version Do not append version and command line to the header
-N, --do-not-normalize Do not normalize indels (with -m or -c s)
--old-rec-tag STR Annotate modified records with INFO/STR indicating the original variant
-o, --output FILE Write output to a file [standard output]
-O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v]
-r, --regions REGION Restrict to comma-separated list of regions
-R, --regions-file FILE Restrict to regions listed in a file
--regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1]
-s, --strict-filter When merging (-m+), merged site is PASS only if all sites being merged PASS
-t, --targets REGION Similar to -r but streams rather than index-jumps
-T, --targets-file FILE Similar to -R but streams rather than index-jumps
--targets-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0]
--threads INT Use multithreading with <int> worker threads [0]
-w, --site-win INT Buffer for sorting lines which changed position during realignment [1000]

Examples:
# normalize and left-align indels
bcftools norm -f ref.fa in.vcf

# split multi-allelic sites
bcftools norm -m- in.vcf

49 changes: 49 additions & 0 deletions src/bcftools/bcftools_norm/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/bin/bash

## VIASH START
## VIASH END

# Exit on error
set -eo pipefail

# Unset parameters
unset_if_false=(
par_atomize
par_remove_duplicates
par_force
par_no_version
par_do_not_normalize
par_strict_filter
)

for par in ${unset_if_false[@]}; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
done

# Execute bcftools norm with the provided arguments
bcftools norm \
${par_atomize:+--atomize} \
${par_atom_overlaps:+--atom-overlaps "$par_atom_overlaps"} \
${par_check_ref:+-c "$par_check_ref"} \
${par_remove_duplicates:+-d "$par_remove_duplicates"} \
${par_fasta_ref:+-f "$par_fasta_ref"} \
${par_force:+--force} \
${par_keep_sum:+--keep-sum "$par_keep_sum"} \
${par_multiallelics:+-m "$par_multiallelics"} \
${par_no_version:+--no-version} \
${par_do_not_normalize:+-N} \
${par_old_rec_tag:+--old-rec-tag "$par_old_rec_tag"} \
${par_regions:+-r "$par_regions"} \
${par_regions_file:+-R "$par_regions_file"} \
${par_regions_overlap:+--regions-overlap "$par_regions_overlap"} \
${par_site_win:+-w "$par_site_win"} \
${par_strict_filter:+-s} \
${par_targets:+-t "$par_targets"} \
${par_targets_file:+-T "$par_targets_file"} \
${par_targets_overlap:+--targets-overlap "$par_targets_overlap"} \
${meta_cpus:+--threads "$meta_cpus"} \
${par_output_type:+-O "$par_output_type"} \
-o $par_output \
$par_input

Loading