Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seqtk sample #68

Merged
merged 12 commits into from
Jul 17, 2024
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,9 @@

* `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).

* `seqtk/seqtk_sample`: Sample sequences from FASTA/Q(.gz) files
to FASTA/Q (PR #68).
jakubmajercik marked this conversation as resolved.
Show resolved Hide resolved


## MAJOR CHANGES

Expand Down
54 changes: 54 additions & 0 deletions src/seqtk/seqtk_sample/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
name: seqtk_sample
namespace: seqtk
description: Subsamples sequences from FASTA/Q files.
keywords: [tag1, tag2]
jakubmajercik marked this conversation as resolved.
Show resolved Hide resolved
links:
repository: https://github.com/lh3/seqtk/tree/v1.4
license: MIT

argument_groups:
- name: Inputs
arguments:
- name: --input
type: file
description: The input FASTA/Q file.
required: true

- name: Outputs
arguments:
- name: --output
type: file
description: The output FASTA/Q file.
required: true
direction: output

- name: Options
arguments:
- name: --seed
type: integer
description: Seed for random generator.
default: 42
jakubmajercik marked this conversation as resolved.
Show resolved Hide resolved
- name: --fraction_number
type: double
description: Fraction or number of sequences to sample.
default: 0.1
jakubmajercik marked this conversation as resolved.
Show resolved Hide resolved
- name: --two_pass_mode
type: boolean
jakubmajercik marked this conversation as resolved.
Show resolved Hide resolved
description: twice as slow but with much reduced memory
default: false

resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: ../test_data

engines:
- type: docker
image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
runners:
- type: executable
- type: nextflow
9 changes: 9 additions & 0 deletions src/seqtk/seqtk_sample/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
```
seqtk_subseq
```
Usage: seqtk subseq [options] <in.fa> <in.bed>|<name.list>
Options:
-t TAB delimited output
-s strand aware
-l INT sequence line length [0]
Note: Use 'samtools faidx' if only a few regions are intended.
11 changes: 11 additions & 0 deletions src/seqtk/seqtk_sample/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash

## VIASH START
## VIASH END

seqtk sample \
${par_two_pass_mode:+-2} \
${par_seed:+-s "$par_seed"} \
"$par_input" \
"$par_fraction_number" \
> "$par_output"
90 changes: 90 additions & 0 deletions src/seqtk/seqtk_sample/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#!/bin/bash

set -e

## VIASH START
meta_executable="target/executable/seqtk/seqtk_sample"
meta_resources_dir="src/seqtk"
## VIASH END

#########################################################################################
mkdir seqtk_sample_se
cd seqtk_sample_se

echo "> Run seqtk_sample on fastq SE"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.fastq" \
--seed 42 \
--fraction_number 3 \
--output "sampled.fastq"

echo ">> Check if output exists"
if [ ! -f "sampled.fastq" ]; then
echo ">> sampled.fastq.gz does not exist"
exit 1
fi
jakubmajercik marked this conversation as resolved.
Show resolved Hide resolved

#########################################################################################
cd ..
mkdir seqtk_sample_pe_number
cd seqtk_sample_pe_number

echo ">> Run seqtk_sample on fastq.gz PE with number of reads"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
--seed 42 \
--fraction_number 3 \
--output "sampled_1.fastq"

"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.2.fastq.gz" \
--seed 42 \
--fraction_number 3 \
--output "sampled_2.fastq"

echo ">> Check if output exists"
if [ ! -f "sampled_1.fastq" ] || [ ! -f "sampled_2.fastq" ]; then
echo ">> One or both output files do not exist"
exit 1
fi

echo ">> Compare reads"
# Extract headers
headers1=$(grep '^@' sampled_1.fastq | sed -e's/ 1$//' | sort)
headers2=$(grep '^@' sampled_2.fastq | sed -e 's/ 2$//' | sort)

# Compare headers
diff <(echo "$headers1") <(echo "$headers2") || { echo "Mismatch detected"; exit 1; }

#########################################################################################
cd ..
mkdir seqtk_sample_pe_fraction
cd seqtk_sample_pe_fraction

echo ">> Run seqtk_sample on fastq.gz PE with fraction of reads"
"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.1.fastq.gz" \
--seed 42 \
--fraction_number 0.5 \
--output "sampled_1.fastq"

"$meta_executable" \
--input "$meta_resources_dir/test_data/reads/a.2.fastq.gz" \
--seed 42 \
--fraction_number 0.5 \
--output "sampled_2.fastq"
jakubmajercik marked this conversation as resolved.
Show resolved Hide resolved

echo ">> Check if output exists"
if [ ! -f "sampled_1.fastq" ] || [ ! -f "sampled_2.fastq" ]; then
echo ">> One or both output files do not exist"
exit 1
fi

echo ">> Compare reads"
# Extract headers
headers1=$(grep '^@' sampled_1.fastq | sed -e's/ 1$//' | sort)
headers2=$(grep '^@' sampled_2.fastq | sed -e 's/ 2$//' | sort)

# Compare headers
diff <(echo "$headers1") <(echo "$headers2") || { echo "Mismatch detected"; exit 1; }

60 changes: 60 additions & 0 deletions src/seqtk/seqtk_subseq/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
name: seqtk_subseq
namespace: seqtk
description:
keywords: [tag1, tag2]
links:
repository: https://github.com/lh3/seqtk/tree/v1.4
license: MIT

argument_groups:
- name: Inputs
arguments:
- name: --input
type: file
description: The input FASTA/Q file.
required: true
- name: "--regions_file"
type: file
description: |
File with regions to extract. Can be either a list file
with one sequence name per line or a bed file.
required: true

- name: Outputs
arguments:
- name: --output
type: file
description: The output FASTA/Q file.
required: true
direction: output

- name: Options
arguments:
- name: "--tab"
type: boolean
description: Output in tab-delimited format.
default: false
- name: "--strand_aware"
type: boolean
description: Strand-aware mode.
default: false
- name: "--line_length"
type: integer
description: Number of bases per line.
default: 60

resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: ../test_data

engines:
- type: docker
image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
runners:
- type: executable
- type: nextflow
7 changes: 7 additions & 0 deletions src/seqtk/seqtk_subseq/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
```
seqtk_sample
```
Usage: seqtk sample [-2] [-s seed=11] <in.fa> <frac>|<number> > <out.fa>

Options: -s INT RNG seed [11]
-2 2-pass mode: twice as slow but with much reduced memory
11 changes: 11 additions & 0 deletions src/seqtk/seqtk_subseq/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash

## VIASH START
## VIASH END

seqtk sample \
${par_tab:+-t} \
${par_strand_aware:+-s} \
${par_line_length:+-l "$par_line_length"} \
"$par_input" \
> "$par_output"
Binary file added src/seqtk/test_data/reads/a.1.fastq.gz
Binary file not shown.
Binary file added src/seqtk/test_data/reads/a.2.fastq.gz
Binary file not shown.
4 changes: 4 additions & 0 deletions src/seqtk/test_data/reads/a.fastq
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
@1
ACGGCAT
+
!!!!!!!
Binary file added src/seqtk/test_data/reads/a.fastq.gz
Binary file not shown.
1 change: 1 addition & 0 deletions src/seqtk/test_data/reads/id.list
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1
9 changes: 9 additions & 0 deletions src/seqtk/test_data/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# clone repo
if [ ! -d /tmp/snakemake-wrappers ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
fi

# copy test data
cp -r /tmp/snakemake-wrappers/bio/seqtk/test/* src/seqtk/test_data

rm src/seqtk/test_data/Snakefile