Skip to content

Commit

Permalink
Add pear (#10)
Browse files Browse the repository at this point in the history
* WIP

* update arguments in config

* add script code

* Add test

* update CHANGELOG

* add requirements

* Apply suggestions from code review

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Update with review comments

* update test_data

* update outputs

* update with latest review

* Apply suggestions from code review

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>
  • Loading branch information
KaiWaldrant and rcannood authored Feb 2, 2024
1 parent 16d7fd3 commit f54af0a
Show file tree
Hide file tree
Showing 8 changed files with 359 additions and 2 deletions.
7 changes: 5 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,16 @@
## NEW FEATURES

* `arriba`: Detect gene fusions from RNA-seq data (PR #1).
* `busco`: Assess genome assembly and annotation completeness with single copy orthologs (PR #6).

* `fastp`: An ultra-fast all-in-one FASTQ preprocessor (PR #3).

* `busco`: Assess genome assembly and annotation completeness with single copy orthologs (PR #6).

* `featurecounts`: Assign sequence reads to genomic features (PR #11).

* `bgzip`: Add bgzip functionality to compress and decompress files (PR #13).

* `featurecounts`: Assign sequence reads to genomic features (PR #11).
* `pear`: Paired-end read merger (PR #10).

## MAJOR CHANGES

Expand Down
159 changes: 159 additions & 0 deletions src/pear/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
functionality:
name: pear
description: |
PEAR is an ultrafast, memory-efficient and highly accurate pair-end read merger. It is fully parallelized and can run with as low as just a few kilobytes of memory.
PEAR evaluates all possible paired-end read overlaps and without requiring the target fragment size as input. In addition, it implements a statistical test for minimizing false-positive results. Together with a highly optimized implementation, it can merge millions of paired end reads within a couple of minutes on a standard desktop computer.
info:
keywords: [ "pair-end", "read", "merge" ]
homepage: https://cme.h-its.org/exelixis/web/software/pear
repository: https://github.com/tseemann/PEAR
documentation: https://cme.h-its.org/exelixis/web/software/pear/doc.html
reference: doi:10.1093/bioinformatics/btt593
licence: "CC-BY-NC-SA-3.0"
requirements:
commands: [ pear , gzip ]
argument_groups:
- name: Inputs
arguments:
- name: --forward_fastq
alternatives: -f
type: file
description: Forward paired-end FASTQ file
required: true
example: "forward.fastq"
- name: --reverse_fastq
alternatives: -r
type: file
description: Reverse paired-end FASTQ file
required: true
example: "reverse.fastq"
- name: Outputs
arguments:
- name: --assembled
type: file
description: The output file containing assembled reads. Can be compressed with gzip.
required: true
direction: output
- name: --unassembled_forward
type: file
description: The output file containing forward reads that could not be assembled. Can be compressed with gzip.
required: true
direction: output
- name: --unassembled_reverse
type: file
description: The output file containing reverse reads that could not be assembled. Can be compressed with gzip.
required: true
direction: output
- name: --discarded
type: file
description: The output file containing reads that were discarded due to too low quality or too many uncalled bases. Can be compressed with gzip.
required: true
direction: output
- name: Arguments
arguments:
- name: --p_value
alternatives: -p
type: double
description: |
Specify a p-value for the statistical test. If the computed p-value of a possible assembly exceeds the specified p-value then paired-end read will not be assembled. Valid options are: 0.0001, 0.001, 0.01, 0.05 and 1.0. Setting 1.0 disables the test.
example: 0.01
required: false
- name: --min_overlap
alternatives: -v
type: integer
description: |
Specify the minimum overlap size. The minimum overlap may be set to 1 when the statistical test is used. However, further restricting the minimum overlap size to a proper value may reduce false-positive assembles.
required: false
example: 10
- name: --max_assembly_length
alternatives: -m
type: integer
description: |
Specify the maximum possible length of the assembled sequences. Setting this value to 0 disables the restriction and assembled sequences may be arbitrary long.
required: false
example: 0
- name: --min_assembly_length
alternatives: -n
type: integer
description: |
Specify the minimum possible length of the assembled sequences. Setting this value to 0 disables the restriction and assembled sequences may be arbitrary short.
required: false
example: 0
- name: --min_trim_length
alternatives: -t
type: integer
description: |
Specify the minimum length of reads after trimming the low quality part (see option -q)
required: false
example: 1
- name: --quality_threshold
alternatives: -q
type: integer
description: |
Specify the quality threshold for trimming the low quality part of a read. If the quality scores of two consecutive bases are strictly less than the specified threshold, the rest of the read will be trimmed.
required: false
example: 0
- name: --max_uncalled_base
alternatives: -u
type: double
description: |
Specify the maximal proportion of uncalled bases in a read. Setting this value to 0 will cause PEAR to discard all reads containing uncalled bases. The other extreme setting is 1 which causes PEAR to process all reads independent on the number of uncalled bases.
example: 1.0
required: false
- name: --test_method
alternatives: -g
type: integer
description: |
Specify the type of statistical test. Two options are available. 1: Given the minimum allowed overlap, test using the highest OES. Note that due to its discrete nature, this test usually yields a lower p-value for the assembled read than the cut- off (specified by -p). For example, setting the cut-off to 0.05 using this test, the assembled reads might have an actual p-value of 0.02.
2. Use the acceptance probability (m.a.p). This test methods computes the same probability as test method 1. However, it assumes that the minimal overlap is the observed overlap with the highest OES, instead of the one specified by -v. Therefore, this is not a valid statistical test and the 'p-value' is in fact the maximal probability for accepting the assembly. Nevertheless, we observed in practice that for the case the actual overlap sizes are relatively small, test 2 can correctly assemble more reads with only slightly higher false-positive rate.
required: false
example: 1
- name: --emperical_freqs
alternatives: -e
type: boolean_true
description: |
Disable empirical base frequencies.
- name: --score_method
alternatives: -s
type: integer
description: |
Specify the scoring method. 1. OES with +1 for match and -1 for mismatch. 2: Assembly score (AS). Use +1 for match and -1 for mismatch multiplied by base quality scores. 3: Ignore quality scores and use +1 for a match and -1 for a mismatch.
required: false
example: 2
- name: --phred_base
alternatives: -b
type: integer
description: |
Base PHRED quality score.
required: false
example: 33
- name: --cap
alternatives: -c
type: integer
description: |
Specify the upper bound for the resulting quality score. If set to zero, capping is disabled.
required: false
example: 40
- name: --nbase
alternatives: -z
type: boolean_true
description: |
When merging a base-pair that consists of two non-equal bases out of which none is degenerate, set the merged base to N and use the highest quality score of the two bases
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
platforms:
- type: docker
image: quay.io/biocontainers/pear:0.9.6--h9d449c0_10
setup:
- type: docker
run: |
version=$(pear -h | grep 'PEAR v' | sed 's/PEAR v//' | sed 's/ .*//') && \
echo "pear: $version" > /var/software_versions.txt
- type: nextflow
91 changes: 91 additions & 0 deletions src/pear/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
```bash
pear -h
```

____ _____ _ ____
| _ \| ____| / \ | _ \
| |_) | _| / _ \ | |_) |
| __/| |___ / ___ \| _ <
|_| |_____/_/ \_\_| \_\
PEAR v0.9.6 [January 15, 2015] - [+bzlib +zlib]

Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR
Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593

License: Creative Commons Licence
Bug-reports and requests to: [email protected] and [email protected]


Usage: pear <options>
Standard (mandatory):
-f, --forward-fastq <str> Forward paired-end FASTQ file.
-r, --reverse-fastq <str> Reverse paired-end FASTQ file.
-o, --output <str> Output filename.
Optional:
-p, --p-value <float> Specify a p-value for the statistical test. If the computed
p-value of a possible assembly exceeds the specified p-value
then paired-end read will not be assembled. Valid options
are: 0.0001, 0.001, 0.01, 0.05 and 1.0. Setting 1.0 disables
the test. (default: 0.01)
-v, --min-overlap <int> Specify the minimum overlap size. The minimum overlap may be
set to 1 when the statistical test is used. However, further
restricting the minimum overlap size to a proper value may
reduce false-positive assembles. (default: 10)
-m, --max-assembly-length <int> Specify the maximum possible length of the assembled
sequences. Setting this value to 0 disables the restriction
and assembled sequences may be arbitrary long. (default: 0)
-n, --min-assembly-length <int> Specify the minimum possible length of the assembled
sequences. Setting this value to 0 disables the restriction
and assembled sequences may be arbitrary short. (default:
50)
-t, --min-trim-length <int> Specify the minimum length of reads after trimming the low
quality part (see option -q). (default: 1)
-q, --quality-threshold <int> Specify the quality score threshold for trimming the low
quality part of a read. If the quality scores of two
consecutive bases are strictly less than the specified
threshold, the rest of the read will be trimmed. (default:
0)
-u, --max-uncalled-base <float> Specify the maximal proportion of uncalled bases in a read.
Setting this value to 0 will cause PEAR to discard all reads
containing uncalled bases. The other extreme setting is 1
which causes PEAR to process all reads independent on the
number of uncalled bases. (default: 1)
-g, --test-method <int> Specify the type of statistical test. Two options are
available. (default: 1)
1: Given the minimum allowed overlap, test using the highest
OES. Note that due to its discrete nature, this test usually
yields a lower p-value for the assembled read than the cut-
off (specified by -p). For example, setting the cut-off to
0.05 using this test, the assembled reads might have an
actual p-value of 0.02.

2. Use the acceptance probability (m.a.p). This test methods
computes the same probability as test method 1. However, it
assumes that the minimal overlap is the observed overlap
with the highest OES, instead of the one specified by -v.
Therefore, this is not a valid statistical test and the
'p-value' is in fact the maximal probability for accepting
the assembly. Nevertheless, we observed in practice that for
the case the actual overlap sizes are relatively small, test
2 can correctly assemble more reads with only slightly
higher false-positive rate.
-e, --empirical-freqs Disable empirical base frequencies. (default: use empirical
base frequencies)
-s, --score-method <int> Specify the scoring method. (default: 2)
1. OES with +1 for match and -1 for mismatch.
2: Assembly score (AS). Use +1 for match and -1 for mismatch
multiplied by base quality scores.
3: Ignore quality scores and use +1 for a match and -1 for a
mismatch.
-b, --phred-base <int> Base PHRED quality score. (default: 33)
-y, --memory <str> Specify the amount of memory to be used. The number may be
followed by one of the letters K, M, or G denoting
Kilobytes, Megabytes and Gigabytes, respectively. Bytes are
assumed in case no letter is specified.
-c, --cap <int> Specify the upper bound for the resulting quality score. If
set to zero, capping is disabled. (default: 40)
-j, --threads <int> Number of threads to use
-z, --nbase When merging a base-pair that consists of two non-equal
bases out of which none is degenerate, set the merged base
to N and use the highest quality score of the two bases
-h, --help This help screen.
63 changes: 63 additions & 0 deletions src/pear/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/bin/bash

## VIASH START
## VIASH END

[[ "$par_emperical_freqs" == "false" ]] && unset par_emperical_freqs
[[ "$par_nbase" == "false" ]] && unset par_nbase

if [[ "${par_forward_fastq##*.}" == "gz" ]]; then
gunzip $par_forward_fastq
par_forward_fastq=${par_forward_fastq%.*}
fi
if [[ "${par_reverse_fastq##*.}" == "gz" ]]; then
gunzip $par_reverse_fastq
par_reverse_fastq=${par_reverse_fastq%.*}
fi

output_dir=$(mktemp -d -p "$meta_temp_dir" "pear.XXXXXX")

pear \
-f "$par_forward_fastq" \
-r "$par_reverse_fastq" \
-o "$output_dir" \
${par_p_value:+-p "${par_p_value}"} \
${par_min_overlap:+-v "${par_min_overlap}"} \
${par_max_assembly_length:+-m "${par_max_assembly_length}"} \
${par_min_assembly_length:+-n "${par_min_assembly_length}"} \
${par_min_trim_length:+-t "${par_min_trim_length}"} \
${par_quality_threshold:+-q "${par_quality_threshold}"} \
${par_max_uncalled_base:+-u "${par_max_uncalled_base}"} \
${par_test_method:+-g "${par_test_method}"} \
${par_score_method:+-s "${par_score_method}"} \
${par_phred_base:+-b "${par_phred_base}"} \
${meta_memory_mb:+--memory "${meta_memory_mb}M"} \
${par_cap:+-c "${par_cap}"} \
${meta_cpus:+-j "${meta_cpus}"} \
${par_emperical_freqs:+-e} \
${par_nbase:+-z}


if [[ "${par_assembled##*.}" == "gz" ]]; then
gzip -9 -c ${output_dir}.assembled.fastq > ${par_assembled}
else
mv ${output_dir}.assembled.fastq ${par_assembled}
fi

if [[ "${par_unassembled_forward##*.}" == "gz" ]]; then
gzip -9 -c ${output_dir}.unassembled.forward.fastq > ${par_unassembled_forward}
else
mv ${output_dir}.unassembled.forward.fastq ${par_unassembled_forward}
fi

if [[ "${par_unassembled_reverse##*.}" == "gz" ]]; then
gzip -9 -c ${output_dir}.unassembled.reverse.fastq > ${par_unassembled_reverse}
else
mv ${output_dir}.unassembled.reverse.fastq ${par_unassembled_reverse}
fi

if [[ "${par_discarded##*.}" == "gz" ]]; then
gzip -9 -c ${output_dir}.discarded.fastq > ${par_discarded}
else
mv ${output_dir}.discarded.fastq ${par_discarded}
fi
23 changes: 23 additions & 0 deletions src/pear/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash

set -e

dir_in="${meta_resources_dir%/}/test_data"

echo "> Run PEAR"
"$meta_executable" \
--forward_fastq "$dir_in/a.1.fastq" \
--reverse_fastq "$dir_in/a.2.fastq" \
--assembled "test.assembled.fastq.gz" \
--unassembled_forward "test.unassembled.forward.fastq.gz" \
--unassembled_reverse "test.unassembled.reverse.fastq.gz" \
--discarded "test.discarded.fastq.gz" \
--p_value 0.01

echo ">> Checking output"
[ ! -f "test.assembled.fastq.gz" ] && echo "Output file test.assembled.fastq.gz does not exist" && exit 1
[ ! -f "test.unassembled.forward.fastq.gz" ] && echo "Output file test.unassembled.forward.fastq.gz does not exist" && exit 1
[ ! -f "test.unassembled.reverse.fastq.gz" ] && echo "Output file test.unassembled.reverse.fastq.gz does not exist" && exit 1
[ ! -f "test.discarded.fastq.gz" ] && echo "Output file ftest.discarded.fastq.gz does not exist" && exit 1

echo "> Test successful"
4 changes: 4 additions & 0 deletions src/pear/test_data/a.1.fastq
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
@1
ACGGCAT
+
!!!!!!!
4 changes: 4 additions & 0 deletions src/pear/test_data/a.2.fastq
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
@1
ACGGCAT
+
!!!!!!!
10 changes: 10 additions & 0 deletions src/pear/test_data/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# pear test data

# Test data was obtained from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/fastp/test

if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi

cp -r /tmp/snakemake-wrappers/bio/fastp/test/reads/pe/* src/pear/test_data

0 comments on commit f54af0a

Please sign in to comment.