Skip to content

Commit

Permalink
Seqtk sample (#68)
Browse files Browse the repository at this point in the history
* tests added

* tests extended

* changelog entry added

* reorganized seqtk namespace + added seqtk subseq config and script

* added subseq help.txt

* revert to seqtk sample only

* remove subseq

* updated tests, added tags

* Update two_pass_mode

Co-authored-by: Robrecht Cannoodt <[email protected]>

* author added to config

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>
  • Loading branch information
jakubmajercik and rcannood authored Jul 17, 2024
1 parent 13c5439 commit e615d2a
Show file tree
Hide file tree
Showing 12 changed files with 207 additions and 0 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,8 @@

* `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).

* `seqtk/seqtk_sample`: Sample sequences from FASTA/Q(.gz) files to FASTA/Q (PR #68).

* `umitools`:
- `umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).

Expand Down
10 changes: 10 additions & 0 deletions src/_authors/jakub_majercik.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
name: Jakub Majercik
info:
links:
email: [email protected]
github: jakubmajercik
linkedin: jakubmajercik
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Bioinformatics Engineer
57 changes: 57 additions & 0 deletions src/seqtk/seqtk_sample/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
name: seqtk_sample
namespace: seqtk
description: Subsamples sequences from FASTA/Q files.
keywords: [sample, FASTA, FASTQ]
links:
repository: https://github.com/lh3/seqtk/tree/v1.4
license: MIT
authors:
- __merge__: /src/_authors/jakub_majercik.yaml
roles: [ author, maintainer ]

argument_groups:
- name: Inputs
arguments:
- name: --input
type: file
description: The input FASTA/Q file.
required: true

- name: Outputs
arguments:
- name: --output
type: file
description: The output FASTA/Q file.
required: true
direction: output

- name: Options
arguments:
- name: --seed
type: integer
description: Seed for random generator.
example: 42
- name: --fraction_number
type: double
description: Fraction or number of sequences to sample.
required: true
example: 0.1
- name: --two_pass_mode
type: boolean_true
description: Twice as slow but with much reduced memory

resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: ../test_data

engines:
- type: docker
image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
runners:
- type: executable
- type: nextflow
9 changes: 9 additions & 0 deletions src/seqtk/seqtk_sample/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
```
seqtk_subseq
```
Usage: seqtk subseq [options] <in.fa> <in.bed>|<name.list>
Options:
-t TAB delimited output
-s strand aware
-l INT sequence line length [0]
Note: Use 'samtools faidx' if only a few regions are intended.
11 changes: 11 additions & 0 deletions src/seqtk/seqtk_sample/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash

## VIASH START
## VIASH END

seqtk sample \
${par_two_pass_mode:+-2} \
${par_seed:+-s "$par_seed"} \
"$par_input" \
"$par_fraction_number" \
> "$par_output"
104 changes: 104 additions & 0 deletions src/seqtk/seqtk_sample/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
#!/bin/bash

set -e

## VIASH START
meta_executable="target/executable/seqtk/seqtk_sample"
meta_resources_dir="src/seqtk"
## VIASH END

#########################################################################################
mkdir seqtk_sample_se
cd seqtk_sample_se

echo "> Run seqtk_sample on fastq SE"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
--seed 42 \
--fraction_number 3 \
--output "sampled.fastq"

echo ">> Check if output exists"
if [ ! -f "sampled.fastq" ]; then
echo ">> sampled.fastq does not exist"
exit 1
fi

echo ">> Count number of samples"
num_samples=$(grep -c '^@' sampled.fastq)
if [ "$num_samples" -ne 3 ]; then
echo ">> sampled.fastq does not contain 3 samples"
exit 1
fi

#########################################################################################
cd ..
mkdir seqtk_sample_pe_number
cd seqtk_sample_pe_number

echo ">> Run seqtk_sample on fastq.gz PE with number of reads"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
--seed 42 \
--fraction_number 3 \
--output "sampled_1.fastq"

"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.2.fastq.gz" \
--seed 42 \
--fraction_number 3 \
--output "sampled_2.fastq"

echo ">> Check if output exists"
if [ ! -f "sampled_1.fastq" ] || [ ! -f "sampled_2.fastq" ]; then
echo ">> One or both output files do not exist"
exit 1
fi

echo ">> Compare reads"
# Extract headers
headers1=$(grep '^@' sampled_1.fastq | sed -e's/ 1$//' | sort)
headers2=$(grep '^@' sampled_2.fastq | sed -e 's/ 2$//' | sort)

# Compare headers
diff <(echo "$headers1") <(echo "$headers2") || { echo "Mismatch detected"; exit 1; }

echo ">> Count number of samples"
num_headers=$(echo "$headers1" | wc -l)
if [ "$num_headers" -ne 3 ]; then
echo ">> sampled_1.fastq does not contain 3 headers"
exit 1
fi

#########################################################################################
cd ..
mkdir seqtk_sample_pe_fraction
cd seqtk_sample_pe_fraction

echo ">> Run seqtk_sample on fastq.gz PE with fraction of reads"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
--seed 42 \
--fraction_number 0.5 \
--output "sampled_1.fastq"

"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.2.fastq.gz" \
--seed 42 \
--fraction_number 0.5 \
--output "sampled_2.fastq"

echo ">> Check if output exists"
if [ ! -f "sampled_1.fastq" ] || [ ! -f "sampled_2.fastq" ]; then
echo ">> One or both output files do not exist"
exit 1
fi

echo ">> Compare reads"
# Extract headers
headers1=$(grep '^@' sampled_1.fastq | sed -e's/ 1$//' | sort)
headers2=$(grep '^@' sampled_2.fastq | sed -e 's/ 2$//' | sort)

# Compare headers
diff <(echo "$headers1") <(echo "$headers2") || { echo "Mismatch detected"; exit 1; }

Binary file added src/seqtk/test_data/reads/a.1.fastq.gz
Binary file not shown.
Binary file added src/seqtk/test_data/reads/a.2.fastq.gz
Binary file not shown.
4 changes: 4 additions & 0 deletions src/seqtk/test_data/reads/a.fastq
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
@1
ACGGCAT
+
!!!!!!!
Binary file added src/seqtk/test_data/reads/a.fastq.gz
Binary file not shown.
1 change: 1 addition & 0 deletions src/seqtk/test_data/reads/id.list
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1
9 changes: 9 additions & 0 deletions src/seqtk/test_data/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# clone repo
if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi

# copy test data
cp -r /tmp/snakemake-wrappers/bio/seqtk/test/* src/seqtk/test_data

rm src/seqtk/test_data/Snakefile

0 comments on commit e615d2a

Please sign in to comment.