Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fq subsample #147

Merged
merged 14 commits into from
Sep 9, 2024
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,8 @@
- `bedtools_getfasta`: extract sequences from a FASTA file for each of the
intervals defined in a BED/GFF/VCF file (PR #59).

* `fq_subsample`: Sample a subset of records from single or paired FASTQ files (PR #147).

## MINOR CHANGES

* Uniformize component metadata (PR #23).
Expand Down
73 changes: 73 additions & 0 deletions src/fq_subsample/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
name: fq_subsample
description: fq subsample outputs a subset of records from single or paired FASTQ files.
keywords: [fastq, subsample, subset]
links:
homepage: https://github.com/stjude-rust-labs/fq/blob/master/README.md
documentation: https://github.com/stjude-rust-labs/fq/blob/master/README.md
repository: https://github.com/stjude-rust-labs/fq
license: MIT

argument_groups:
- name: "Input"
arguments:
- name: "--input_1"
type: file
required: true
description: First input fastq file to subsample. Accepts both raw and gzipped FASTQ inputs.
- name: "--input_2"
type: file
description: Second input fastq files to subsample. Accepts both raw and gzipped FASTQ inputs.

- name: "Output"
arguments:
- name: "--output_1"
type: file
direction: output
description: Sampled read 1 fastq files. Output will be gzipped if ends in `.gz`.
- name: "--output_2"
type: file
direction: output
description: Sampled read 2 fastq files. Output will be gzipped if ends in `.gz`.

- name: "Options"
arguments:
- name: "--probability"
type: double
description: The probability a record is kept, as a percentage (0.0, 1.0). Cannot be used with `record-count`
- name: "--record_count"
type: integer
description: The exact number of records to keep. Cannot be used with `probability`
- name: "--seed"
type: integer
description: Seed to use for the random number generator

resources:
- type: bash_script
path: script.sh

test_resources:
- type: bash_script
path: test.sh
- path: test_data

engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
env:
- TZ=Europe/Brussels
run: |
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone && \
apt-get update && \
apt-get install -y --no-install-recommends build-essential git-all curl && \
curl https://sh.rustup.rs -sSf | sh -s -- -y && \
. "$HOME/.cargo/env" && \
git clone --depth 1 --branch v0.12.0 https://github.com/stjude-rust-labs/fq.git && \
mv fq /usr/local/ && cd /usr/local/fq && \
cargo install --locked --path . && \
mv /usr/local/fq/target/release/fq /usr/local/bin/

runners:
- type: executable
- type: nextflow
20 changes: 20 additions & 0 deletions src/fq_subsample/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
```
fq subsample -h
```

Outputs a subset of records

Usage: fq subsample [OPTIONS] --r1-dst <R1_DST> <--probability <PROBABILITY>|--record-count <RECORD_COUNT>> <R1_SRC> [R2_SRC]

Arguments:
<R1_SRC> Read 1 source. Accepts both raw and gzipped FASTQ inputs
[R2_SRC] Read 2 source. Accepts both raw and gzipped FASTQ inputs

Options:
-p, --probability <PROBABILITY> The probability a record is kept, as a percentage (0.0, 1.0). Cannot be used with `record-count`
-n, --record-count <RECORD_COUNT> The exact number of records to keep. Cannot be used with `probability`
-s, --seed <SEED> Seed to use for the random number generator
--r1-dst <R1_DST> Read 1 destination. Output will be gzipped if ends in `.gz`
--r2-dst <R2_DST> Read 2 destination. Output will be gzipped if ends in `.gz`
-h, --help Print help
-V, --version
26 changes: 26 additions & 0 deletions src/fq_subsample/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/bin/bash

## VIASH START
## VIASH END

set -eo pipefail


required_args=("-p" "--probability" "-n" "--record_count")

# exclusive OR for required arguments $par_probability and $par_record_count
if [[ -n $par_probability && -n $par_record_count ]] || [[ -z $par_probability && -z $par_record_count ]]; then
echo "FQ/SUBSAMPLE requires either --probability or --record_count to be specified"
exit 1
fi


fq subsample \
${par_output_1:+--r1-dst "${par_output_1}"} \
${par_output_2:+--r2-dst "${par_output_2}"} \
${par_probability:+--probability "${par_probability}"} \
${par_record_count:+--record-count "${par_record_count}"} \
${par_seed:+--seed "${par_seed}"} \
${par_input_1} \
${par_input_2}

36 changes: 36 additions & 0 deletions src/fq_subsample/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/bin/bash

echo ">>> Testing $meta_executable"

echo ">>> Testing for paired-end reads"
"$meta_executable" \
--input_1 $meta_resources_dir/test_data/a.3.fastq.gz \
--input_2 $meta_resources_dir/test_data/a.4.fastq.gz \
--record_count 3 \
--seed 1 \
--output_1 a.1.subsampled.fastq \
--output_2 a.2.subsampled.fastq

echo ">> Checking if the correct files are present"
[ ! -f "a.1.subsampled.fastq" ] && echo "Subsampled FASTQ file for read 1 is missing!" && exit 1
[ $(wc -l < a.1.subsampled.fastq) -ne 12 ] && echo "Subsampled FASTQ file for read 1 does not contain the expected number of records" && exit 1
[ ! -f "a.2.subsampled.fastq" ] && echo "Subsampled FASTQ file for read 2 is missing" && exit 1
[ $(wc -l < a.2.subsampled.fastq) -ne 12 ] && echo "Subsampled FASTQ file for read 2 does not contain the expected number of records" && exit 1

rm a.1.subsampled.fastq a.2.subsampled.fastq

echo ">>> Testing for single-end reads"
"$meta_executable" \
--input_1 $meta_resources_dir/test_data/a.3.fastq.gz \
--record_count 3 \
--seed 1 \
--output_1 a.1.subsampled.fastq


echo ">> Checking if the correct files are present"
[ ! -f "a.1.subsampled.fastq" ] && echo "Subsampled FASTQ file is missing" && exit 1
[ $(wc -l < a.1.subsampled.fastq) -ne 12 ] && echo "Subsampled FASTQ file does not contain the expected number of records" && exit 1

echo ">>> Tests finished successfully"
exit 0

Binary file added src/fq_subsample/test_data/a.3.fastq.gz
Binary file not shown.
Binary file added src/fq_subsample/test_data/a.4.fastq.gz
Binary file not shown.