Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bcftools stats #142

Merged
merged 15 commits into from
Sep 10, 2024
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
* `rsem/rsem_prepare_reference`: Prepare transcript references for RSEM (PR #89).

* `bcftools`:
- `bcftools/bcftools_stats`: Parses VCF or BCF and produces a txt stats file which can be plotted using plot-vcfstats (PR #142).
- `bcftools/bcftools_sort`: Sorts BCF/VCF files by position and other criteria (PR #141).

## MINOR CHANGES
Expand Down
240 changes: 240 additions & 0 deletions src/bcftools/bcftools_stats/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
name: bcftools_stats
namespace: bcftools
description: |
Parses VCF or BCF and produces a txt stats file which can be plotted using plot-vcfstats.
When two files are given, the program generates separate stats for intersection
and the complements. By default only sites are compared, -s/-S must given to include
also sample columns.
keywords: [Stats, VCF, BCF]
links:
homepage: https://samtools.github.io/bcftools/
documentation: https://samtools.github.io/bcftools/bcftools.html#stats
repository: https://github.com/samtools/bcftools
issue_tracker: https://github.com/samtools/bcftools/issues
references:
doi: https://doi.org/10.1093/gigascience/giab008
license: MIT/Expat, GNU
requirements:
commands: [bcftools]
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author ]

argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: -i
type: file
multiple: true
description: Input VCF/BCF file. Maximum of two files.
required: true

- name: Outputs
arguments:
- name: --output
alternatives: -o
direction: output
type: file
description: Output txt statistics file.
required: true

- name: Options
arguments:

- name: --allele_frequency_bins
alternatives: --af_bins
type: string
description: |
Allele frequency bins, a list of bin values (0.1,0.5,1).
example: 0.1,0.5,1

- name: --allele_frequency_bins_file
alternatives: --af_bins_file
type: file
description: |
Same as allele_frequency_bins, but in a file.
Format of file is one value per line.
e.g.
0.1
0.5
1

- name: --allele_frequency_tag
alternatives: --af_tag
type: string
description: |
Allele frequency tag to use, by default estimated from AN,AC or GT.

- name: --first_allele_only
alternatives: --first_only
type: boolean_true
description: |
Include only 1st allele at multiallelic sites.

- name: --collapse
alternatives: --c
type: string
choices: [ snps, indels, both, all, some, none ]
description: |
Treat as identical records with <snps|indels|both|all|some|none>.
See https://samtools.github.io/bcftools/bcftools.html#common_options for details.

- name: --depth
alternatives: --d
type: string
description: |
Depth distribution: min,max,bin size.
example: 0,500,1

- name: --exclude
alternatives: --e
type: string
description: |
Exclude sites for which the expression is true.
See https://samtools.github.io/bcftools/bcftools.html#expressions for details.
example: 'QUAL < 30 && DP < 10'

- name: --exons
alternatives: --E
type: file
description: |
tab-delimited file with exons for indel frameshifts statistics.
The columns of the file are CHR, FROM, TO, with 1-based, inclusive, positions.
The file is BGZF-compressed and indexed with tabix (e.g. tabix -s1 -b2 -e3 file.gz).

- name: --apply_filters
alternatives: --f
type: string
description: |
Require at least one of the listed FILTER strings (e.g. "PASS,.").

- name: --fasta_reference
alternatives: --F
type: file
description: |
Faidx indexed reference sequence file to determine INDEL context.

- name: --include
alternatives: --i
type: string
description: |
Select sites for which the expression is true.
See https://samtools.github.io/bcftools/bcftools.html#expressions for details.
example: 'QUAL >= 30 && DP >= 10'

- name: --split_by_ID
alternatives: --I
type: boolean_true
description: |
Collect stats for sites with ID separately (known vs novel).

- name: --regions
alternatives: --r
type: string
description: |
Restrict to comma-separated list of regions.
Following formats are supported: chr|chr:pos|chr:beg-end|chr:beg-[,…​].
example: '20:1000000-2000000'

- name: --regions_file
alternatives: --R
type: file
description: |
Restrict to regions listed in a file.
Regions can be specified either on a VCF, BED, or tab-delimited file (the default).
For more information check manual.

- name: --regions_overlap
type: string
choices: ['pos', 'record', 'variant', '0', '1', '2']
description: |
This option controls how overlapping records are determined:
set to 'pos' or '0' if the VCF record has to have POS inside a region (this corresponds to the default behavior of -t/-T);
set to 'record' or '1' if also overlapping records with POS outside a region should be included (this is the default behavior of -r/-R,
and includes indels with POS at the end of a region, which are technically outside the region);
or set to 'variant' or '2' to include only true overlapping variation (compare the full VCF representation "TA>T-" vs the true sequence variation "A>-").

- name: --samples
alternatives: --s
type: string
description: |
List of samples for sample stats, "-" to include all samples.

- name: --samples_file
alternatives: --S
type: file
description: |
File of samples to include.
tgaspe marked this conversation as resolved.
Show resolved Hide resolved
e.g.
sample1 1
sample2 2
sample3 2

- name: --targets
alternatives: --t
type: string
description: |
Similar as -r, --regions, but the next position is accessed by streaming the whole VCF/BCF
rather than using the tbi/csi index. Both -r and -t options can be applied simultaneously: -r uses the
index to jump to a region and -t discards positions which are not in the targets. Unlike -r, targets
can be prefixed with "^" to request logical complement. For example, "^X,Y,MT" indicates that
sequences X, Y and MT should be skipped. Yet another difference between the -t/-T and -r/-R is
that -r/-R checks for proper overlaps and considers both POS and the end position of an indel,
while -t/-T considers the POS coordinate only (by default; see also --regions-overlap and --targets-overlap).
Note that -t cannot be used in combination with -T.
tgaspe marked this conversation as resolved.
Show resolved Hide resolved
Following formats are supported: chr|chr:pos|chr:beg-end|chr:beg-[,…​].
example: '20:1000000-2000000'

- name: --targets_file
alternatives: --T
type: file
description: |
Similar to --regions_file option but streams rather than index-jumps.

- name: --targets_overlaps
type: string
choices: ['pos', 'record', 'variant', '0', '1', '2']
description: |
Include if POS in the region (0), record overlaps (1), variant overlaps (2).

- name: --user_tstv
alternatives: --u
type: string
description: |
Collect Ts/Tv stats for any tag using the given binning [0:1:100].
Format is <TAG[:min:max:n]>.
A subfield can be selected as e.g. 'PV4[0]', here the first value of the PV4 tag.


- name: --verbose
alternatives: --v
type: boolean_true
description: |
Produce verbose per-site and per-sample output.

resources:
- type: bash_script
path: script.sh

test_resources:
- type: bash_script
path: test.sh

engines:
- type: docker
image: debian:stable-slim
setup:
- type: apt
packages: [bcftools, procps]
- type: docker
run: |
echo "bcftools: \"$(bcftools --version | grep 'bcftools' | sed -n 's/^bcftools //p')\"" > /var/software_versions.txt
test_setup:
- type: apt
packages: [tabix]

runners:
- type: executable
- type: nextflow

35 changes: 35 additions & 0 deletions src/bcftools/bcftools_stats/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
```
bcftools stats -h
```

About: Parses VCF or BCF and produces stats which can be plotted using plot-vcfstats.
When two files are given, the program generates separate stats for intersection
and the complements. By default only sites are compared, -s/-S must given to include
also sample columns.
Usage: bcftools stats [options] <A.vcf.gz> [<B.vcf.gz>]

Options:
--af-bins LIST Allele frequency bins, a list (0.1,0.5,1) or a file (0.1\n0.5\n1)
--af-tag STRING Allele frequency tag to use, by default estimated from AN,AC or GT
-1, --1st-allele-only Include only 1st allele at multiallelic sites
-c, --collapse STRING Treat as identical records with <snps|indels|both|all|some|none>, see man page for details [none]
-d, --depth INT,INT,INT Depth distribution: min,max,bin size [0,500,1]
-e, --exclude EXPR Exclude sites for which the expression is true (see man page for details)
-E, --exons FILE.gz Tab-delimited file with exons for indel frameshifts (chr,beg,end; 1-based, inclusive, bgzip compressed)
-f, --apply-filters LIST Require at least one of the listed FILTER strings (e.g. "PASS,.")
-F, --fasta-ref FILE Faidx indexed reference sequence file to determine INDEL context
-i, --include EXPR Select sites for which the expression is true (see man page for details)
-I, --split-by-ID Collect stats for sites with ID separately (known vs novel)
-r, --regions REGION Restrict to comma-separated list of regions
-R, --regions-file FILE Restrict to regions listed in a file
--regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1]
-s, --samples LIST List of samples for sample stats, "-" to include all samples
-S, --samples-file FILE File of samples to include
-t, --targets REGION Similar to -r but streams rather than index-jumps
-T, --targets-file FILE Similar to -R but streams rather than index-jumps
--targets-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0]
-u, --user-tstv TAG[:min:max:n] Collect Ts/Tv stats for any tag using the given binning [0:1:100]
A subfield can be selected as e.g. 'PV4[0]', here the first value of the PV4 tag
--threads INT Use multithreading with <int> worker threads [0]
-v, --verbose Produce verbose per-site and per-sample output

56 changes: 56 additions & 0 deletions src/bcftools/bcftools_stats/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/bin/bash

## VIASH START
## VIASH END

# Exit on error
set -eo pipefail

# Unset parameters
unset_if_false=(
par_first_allele_only
par_split_by_ID
par_verbose
)

for par in ${unset_if_false[@]}; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
done

# Create input array
IFS=";" read -ra input <<< $par_input

# Check the size of the input array
if [[ ${#input[@]} -gt 2 ]]; then
echo "Error: --input only takes a max of two files!"
exit 1
fi

# Execute bcftools stats with the provided arguments
bcftools stats \
${par_first_allele_only:+--1st-allele-only} \
${par_split_by_ID:+--split-by-ID} \
${par_verbose:+--verbose} \
${par_allele_frequency_bins:+--af-bins "${par_allele_frequency_bins}"} \
${par_allele_frequency_bins_file:+--af-bins "${par_allele_frequency_bins_file}"} \
${par_allele_frequency_tag:+--af-tag "${par_allele_frequency_tag}"} \
${par_collapse:+-c "${par_collapse}"} \
${par_depth:+-d "${par_depth}"} \
${par_exclude:+-e "${par_exclude}"} \
${par_exons:+-E "${par_exons}"} \
${par_apply_filters:+-f "${par_apply_filters}"} \
${par_fasta_reference:+-F "${par_fasta_reference}"} \
${par_include:+-i "${par_include}"} \
${par_regions:+-r "${par_regions}"} \
${par_regions_file:+-R "${par_regions_file}"} \
${par_regions_overlap:+--regions-overlap "${par_regions_overlap}"} \
${par_samples:+-s "${par_samples}"} \
${par_samples_file:+-S "${par_samples_file}"} \
${par_targets:+-t "${par_targets}"} \
${par_targets_file:+-T "${par_targets_file}"} \
${par_targets_overlaps:+--targets-overlap "${par_targets_overlaps}"} \
${par_user_tstv:+-u "${par_user_tstv}"} \
"${input[@]}" \
> $par_output

Loading