Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bedtools merge #118

Merged
merged 13 commits into from
Sep 2, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
* `bedtools`:
- `bedtools/bedtools_intersect`: Allows one to screen for overlaps between two sets of genomic features (PR #94).
- `bedtools/bedtools_sort`: Sorts a feature file (bed/gff/vcf) by chromosome and other criteria (PR #98).
- `bedtools/bedtools_merge`: Merges overlapping BED/GFF/VCF entries into a single interval (PR #118).


## MINOR CHANGES
Expand Down
166 changes: 166 additions & 0 deletions src/bedtools/bedtools_merge/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
name: bedtools_merge
namespace: bedtools
description: |
Merges overlapping BED/GFF/VCF entries into a single interval.
links:
documentation: https://bedtools.readthedocs.io/en/latest/content/tools/merge.html
repository: https://github.com/arq5x/bedtools2
homepage: https://bedtools.readthedocs.io/en/latest/#
issue_tracker: https://github.com/arq5x/bedtools2/issues
references:
doi: 10.1093/bioinformatics/btq033
license: MIT
requirements:
commands: [bedtools]
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]

argument_groups:
- name: Inputs
arguments:
- name: --input
alternatives: -i
type: file
description: Input file (BED/GFF/VCF) to be merged.
required: true

- name: Outputs
arguments:
- name: --output
type: file
direction: output
description: Output merged file BED to be written.
tgaspe marked this conversation as resolved.
Show resolved Hide resolved

- name: Options
arguments:
- name: --strand
alternatives: -s
type: boolean_true
description: |
Force strandedness. That is, only merge features
that are on the same strand.
- By default, merging is done without respect to strand.

- name: --specific_strand
alternatives: -S
type: string
choices: ["+", "-"]
description: |
Force merge for one specific strand only.
Follow with + or - to force merge from only
the forward or reverse strand, respectively.
- By default, merging is done without respect to strand.

- name: --distance
alternatives: -d
type: integer
description: |
Maximum distance between features allowed for features
to be merged.
- Def. 0. That is, overlapping & book-ended features are merged.
- (INTEGER)
- Note: negative values enforce the number of b.p. required for overlap.

- name: --columns
alternatives: -c
type: string
description: |
Specify columns from the B file to map onto intervals in A.
Default: 5.
Multiple columns can be specified in a comma-delimited list.
tgaspe marked this conversation as resolved.
Show resolved Hide resolved

- name: --operation
alternatives: -o
type: string
description: |
Specify the operation that should be applied to -c.
Valid operations:
sum, min, max, absmin, absmax,
mean, median, mode, antimode
stdev, sstdev
collapse (i.e., print a delimited list (duplicates allowed)),
distinct (i.e., print a delimited list (NO duplicates allowed)),
distinct_sort_num (as distinct, sorted numerically, ascending),
distinct_sort_num_desc (as distinct, sorted numerically, desscending),
distinct_only (delimited list of only unique values),
count
count_distinct (i.e., a count of the unique values in the column),
first (i.e., just the first value in the column),
last (i.e., just the last value in the column),
Default: sum
Multiple operations can be specified in a comma-delimited list.

If there is only column, but multiple operations, all operations will be
applied on that column. Likewise, if there is only one operation, but
multiple columns, that operation will be applied to all columns.
Otherwise, the number of columns must match the the number of operations,
and will be applied in respective order.
E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5,
the mean of column 4, and the count of column 6.
The order of output columns will match the ordering given in the command.

- name: --delimiter
alternatives: -delim
type: string
description: |
Specify a custom delimiter for the collapse operations.
example: "|"
default: ","

- name: --precision
alternatives: -prec
type: integer
description: |
Sets the decimal precision for output (Default: 5).

- name: --bed
type: boolean_true
description: |
If using BAM input, write output as BED.

- name: --header
type: boolean_true
description: |
Print the header from the A file prior to results.

- name: --no_buffer
alternatives: -nobuf
type: boolean_true
description: |
Disable buffered output. Using this option will cause each line
of output to be printed as it is generated, rather than saved
in a buffer. This will make printing large output files
noticeably slower, but can be useful in conjunction with
other software tools and scripts that need to process one
line of bedtools output at a time.

# - name: --io_buffer
# type: boolean_true
# description: |
# Specify amount of memory to use for input buffer.
# Takes an integer argument. Optional suffixes K/M/G supported.
# Note: currently has no effect with compressed files.
tgaspe marked this conversation as resolved.
Show resolved Hide resolved

resources:
- type: bash_script
path: script.sh

test_resources:
- type: bash_script
path: test.sh
- path: test_data

engines:
- type: docker
image: debian:stable-slim
setup:
- type: apt
packages: [bedtools, procps]
- type: docker
run: |
echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt

runners:
- type: executable
- type: nextflow
85 changes: 85 additions & 0 deletions src/bedtools/bedtools_merge/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
```bash
bedtools merge
```

Tool: bedtools merge (aka mergeBed)
Version: v2.30.0
Summary: Merges overlapping BED/GFF/VCF entries into a single interval.

Usage: bedtools merge [OPTIONS] -i <bed/gff/vcf>

Options:
-s Force strandedness. That is, only merge features
that are on the same strand.
- By default, merging is done without respect to strand.

-S Force merge for one specific strand only.
Follow with + or - to force merge from only
the forward or reverse strand, respectively.
- By default, merging is done without respect to strand.

-d Maximum distance between features allowed for features
to be merged.
- Def. 0. That is, overlapping & book-ended features are merged.
- (INTEGER)
- Note: negative values enforce the number of b.p. required for overlap.

-c Specify columns from the B file to map onto intervals in A.
Default: 5.
Multiple columns can be specified in a comma-delimited list.

-o Specify the operation that should be applied to -c.
Valid operations:
sum, min, max, absmin, absmax,
mean, median, mode, antimode
stdev, sstdev
collapse (i.e., print a delimited list (duplicates allowed)),
distinct (i.e., print a delimited list (NO duplicates allowed)),
distinct_sort_num (as distinct, sorted numerically, ascending),
distinct_sort_num_desc (as distinct, sorted numerically, desscending),
distinct_only (delimited list of only unique values),
count
count_distinct (i.e., a count of the unique values in the column),
first (i.e., just the first value in the column),
last (i.e., just the last value in the column),
Default: sum
Multiple operations can be specified in a comma-delimited list.

If there is only column, but multiple operations, all operations will be
applied on that column. Likewise, if there is only one operation, but
multiple columns, that operation will be applied to all columns.
Otherwise, the number of columns must match the the number of operations,
and will be applied in respective order.
E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5,
the mean of column 4, and the count of column 6.
The order of output columns will match the ordering given in the command.


-delim Specify a custom delimiter for the collapse operations.
- Example: -delim "|"
- Default: ",".

-prec Sets the decimal precision for output (Default: 5)

-bed If using BAM input, write output as BED.

-header Print the header from the A file prior to results.

-nobuf Disable buffered output. Using this option will cause each line
of output to be printed as it is generated, rather than saved
in a buffer. This will make printing large output files
noticeably slower, but can be useful in conjunction with
other software tools and scripts that need to process one
line of bedtools output at a time.

-iobuf Specify amount of memory to use for input buffer.
Takes an integer argument. Optional suffixes K/M/G supported.
Note: currently has no effect with compressed files.

Notes:
(1) The input file (-i) file must be sorted by chrom, then start.




***** ERROR: No input file given. Exiting. *****
35 changes: 35 additions & 0 deletions src/bedtools/bedtools_merge/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/bin/bash

## VIASH START
## VIASH END

# Exit on error
set -eo pipefail

# Unset parameters
unset_if_false=(
par_strand
par_bed
par_header
par_no_buffer
)

for par in ${unset_if_false[@]}; do
test_val="${!par}"
[[ "$test_val" == "false" ]] && unset $par
done

# Execute bedtools merge with the provided arguments
bedtools merge \
${par_strand:+-s} \
${par_specific_strand:+-S "$par_specific_strand"} \
${par_bed:+-bed} \
${par_header:+-header} \
${par_no_buffer:+-nobuf} \
${par_distance:+-d "$par_distance"} \
${par_columns:+-c "$par_columns"} \
${par_operation:+-o "$par_operation"} \
${par_delimiter:+-delim "$par_delimiter"} \
${par_precision:+-prec "$par_precision"} \
-i "$par_input" \
> "$par_output"
Loading