Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bedtools GroupBY #123

Merged
merged 18 commits into from
Sep 2, 2024
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
* `bedtools`:
- `bedtools/bedtools_intersect`: Allows one to screen for overlaps between two sets of genomic features (PR #94).
- `bedtools/bedtools_sort`: Sorts a feature file (bed/gff/vcf) by chromosome and other criteria (PR #98).

- `bedtools/bedtools_groupby`: Summarizes a dataset column based upon common column groupings. Akin to the SQL "group by" command (PR #123).

## MINOR CHANGES

Expand Down
155 changes: 155 additions & 0 deletions src/bedtools/bedtools_groupby/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
name: bedtools_groupby
namespace: bedtools
description: |
Summarizes a dataset column based upon common column groupings.
Akin to the SQL "group by" command.
keywords: [groupby, BED]
links:
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/groupby.html
repository: https://github.com/arq5x/bedtools2
homepage: https://bedtools.readthedocs.io/en/latest/#
issue_tracker: https://github.com/arq5x/bedtools2/issues
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]

argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: -i
type: file
direction: input
description: |
The input BED file to be used.
required: true
example: input_a.bed

- name: Outputs
arguments:
- name: --output
type: file
direction: output
description: |
The output groupby BED file.
required: true
example: output.bed

- name: Options
arguments:
- name: --groupby
alternatives: [-g, -grp]
type: string
description: |
Specify the columns (1-based) for the grouping.
The columns must be comma separated.
- Default: 1,2,3
required: true
jakubmajercik marked this conversation as resolved.
Show resolved Hide resolved

- name: --column
alternatives: [-c, -opCols]
type: string
description: |
Specify the column (1-based) that should be summarized.
required: true
tgaspe marked this conversation as resolved.
Show resolved Hide resolved

- name: --operation
alternatives: [-o, -ops]
type: string
description: |
Specify the operation that should be applied to opCol.
Valid operations:
sum, count, count_distinct, min, max,
mean, median, mode, antimode,
stdev, sstdev (sample standard dev.),
collapse (i.e., print a comma separated list (duplicates allowed)),
distinct (i.e., print a comma separated list (NO duplicates allowed)),
distinct_sort_num (as distinct, but sorted numerically, ascending),
distinct_sort_num_desc (as distinct, but sorted numerically, descending),
concat (i.e., merge values into a single, non-delimited string),
freqdesc (i.e., print desc. list of values:freq)
freqasc (i.e., print asc. list of values:freq)
first (i.e., print first value)
last (i.e., print last value)

Default value: sum

If there is only column, but multiple operations, all operations will be
applied on that column. Likewise, if there is only one operation, but
multiple columns, that operation will be applied to all columns.
Otherwise, the number of columns must match the the number of operations,
and will be applied in respective order.
E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5,
the mean of column 4, and the count of column 6.
The order of output columns will match the ordering given in the command.

- name: --full
type: boolean_true
description: |
Print all columns from input file. The first line in the group is used.
Default: print only grouped columns.

- name: --inheader
type: boolean_true
description: |
Input file has a header line - the first line will be ignored.

- name: --outheader
type: boolean_true
description: |
Print header line in the output, detailing the column names.
If the input file has headers (-inheader), the output file
will use the input's column names.
If the input file has no headers, the output file
will use "col_1", "col_2", etc. as the column names.

- name: --header
type: boolean_true
description: same as '-inheader -outheader'.

- name: --ignorecase
type: boolean_true
description: |
Group values regardless of upper/lower case.

- name: --precision
alternatives: -prec
type: integer
description: |
Sets the decimal precision for output.
default: 5

- name: --delimiter
alternatives: -delim
type: string
description: |
Specify a custom delimiter for the collapse operations.
example: "|"
default: ","
jakubmajercik marked this conversation as resolved.
Show resolved Hide resolved

resources:
- type: bash_script
path: script.sh

test_resources:
- type: bash_script
path: test.sh

engines:
- type: docker
image: debian:stable-slim
setup:
- type: apt
packages: [bedtools, procps]
- type: docker
run: |
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt

runners:
- type: executable
- type: nextflow
93 changes: 93 additions & 0 deletions src/bedtools/bedtools_groupby/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
```bash
bedtools groupby
```

Tool: bedtools groupby
Version: v2.30.0
Summary: Summarizes a dataset column based upon
common column groupings. Akin to the SQL "group by" command.

Usage: bedtools groupby -g [group_column(s)] -c [op_column(s)] -o [ops]
cat [FILE] | bedtools groupby -g [group_column(s)] -c [op_column(s)] -o [ops]

Options:
-i Input file. Assumes "stdin" if omitted.

-g -grp Specify the columns (1-based) for the grouping.
The columns must be comma separated.
- Default: 1,2,3

-c -opCols Specify the column (1-based) that should be summarized.
- Required.

-o -ops Specify the operation that should be applied to opCol.
Valid operations:
sum, count, count_distinct, min, max,
mean, median, mode, antimode,
stdev, sstdev (sample standard dev.),
collapse (i.e., print a comma separated list (duplicates allowed)),
distinct (i.e., print a comma separated list (NO duplicates allowed)),
distinct_sort_num (as distinct, but sorted numerically, ascending),
distinct_sort_num_desc (as distinct, but sorted numerically, descending),
concat (i.e., merge values into a single, non-delimited string),
freqdesc (i.e., print desc. list of values:freq)
freqasc (i.e., print asc. list of values:freq)
first (i.e., print first value)
last (i.e., print last value)
- Default: sum

If there is only column, but multiple operations, all operations will be
applied on that column. Likewise, if there is only one operation, but
multiple columns, that operation will be applied to all columns.
Otherwise, the number of columns must match the the number of operations,
and will be applied in respective order.
E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5,
the mean of column 4, and the count of column 6.
The order of output columns will match the ordering given in the command.


-full Print all columns from input file. The first line in the group is used.
Default: print only grouped columns.

-inheader Input file has a header line - the first line will be ignored.

-outheader Print header line in the output, detailing the column names.
If the input file has headers (-inheader), the output file
will use the input's column names.
If the input file has no headers, the output file
will use "col_1", "col_2", etc. as the column names.

-header same as '-inheader -outheader'

-ignorecase Group values regardless of upper/lower case.

-prec Sets the decimal precision for output (Default: 5)

-delim Specify a custom delimiter for the collapse operations.
- Example: -delim "|"
- Default: ",".

Examples:
$ cat ex1.out
chr1 10 20 A chr1 15 25 B.1 1000 ATAT
chr1 10 20 A chr1 25 35 B.2 10000 CGCG

$ groupBy -i ex1.out -g 1,2,3,4 -c 9 -o sum
chr1 10 20 A 11000

$ groupBy -i ex1.out -grp 1,2,3,4 -opCols 9,9 -ops sum,max
chr1 10 20 A 11000 10000

$ groupBy -i ex1.out -g 1,2,3,4 -c 8,9 -o collapse,mean
chr1 10 20 A B.1,B.2, 5500

$ cat ex1.out | groupBy -g 1,2,3,4 -c 8,9 -o collapse,mean
chr1 10 20 A B.1,B.2, 5500

$ cat ex1.out | groupBy -g 1,2,3,4 -c 10 -o concat
chr1 10 20 A ATATCGCG

Notes:
(1) The input file/stream should be sorted/grouped by the -grp. columns
(2) If -i is unspecified, input is assumed to come from stdin.

36 changes: 36 additions & 0 deletions src/bedtools/bedtools_groupby/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/bin/bash

## VIASH START
## VIASH END

# Exit on error
set -eo pipefail

# Unset parameters
unset_if_false=(
par_full
par_inheader
par_outheader
par_header
par_ignorecase
)

for par in ${unset_if_false[@]}; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
done

bedtools groupby \
${par_full:+-full} \
${par_inheader:+-inheader} \
${par_outheader:+-outheader} \
${par_header:+-header} \
${par_ignorecase:+-ignorecase} \
${par_precision:+-prec "$par_precision"} \
${par_delimiter:+-delim "$par_delimiter"} \
-i "$par_input" \
-g "$par_groupby" \
-c "$par_column" \
${par_operation:+-o "$par_operation"} \
> "$par_output"

Loading