Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seqtk subseq #85

Merged
merged 30 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
1b5d696
Config file and help.txt file
tgaspe Jul 15, 2024
54b23ff
Added script.sh
tgaspe Jul 15, 2024
4157f4b
Created test.sh
tgaspe Jul 15, 2024
9ca7071
Update on test.sh
tgaspe Jul 16, 2024
999c582
update
tgaspe Jul 16, 2024
628a335
Bug fixes
tgaspe Jul 16, 2024
1337b0f
Update test
tgaspe Jul 16, 2024
5b7f207
Update CHANGELOG.md
tgaspe Jul 16, 2024
84f714e
Improvement on test.sh
tgaspe Jul 16, 2024
88cf6d7
Added more test
tgaspe Jul 16, 2024
bd5de0a
Update on tests
tgaspe Jul 17, 2024
cc6746e
Bug fixed
tgaspe Jul 17, 2024
c782e2a
Update CHANGELOG.md
tgaspe Jul 17, 2024
06e1fe8
Fixed Tabular test bug
tgaspe Jul 17, 2024
20ac10a
Strand Aware Test
tgaspe Jul 17, 2024
0059ede
Input validation for list file
tgaspe Jul 17, 2024
aab5679
Sugested Changes
tgaspe Jul 17, 2024
32c084d
Added author info
tgaspe Jul 17, 2024
3dfc028
Merge branch 'main' into seqtk_subseq
rcannood Jul 18, 2024
a842777
Update CHANGELOG.md
rcannood Jul 18, 2024
819bd9b
Update theodoro_gasperin.yaml
rcannood Jul 18, 2024
d05cbf0
add newline
rcannood Jul 18, 2024
c22b383
add newline
rcannood Jul 18, 2024
dfc0a65
Update src/seqtk/seqtk_subseq/config.vsh.yaml
tgaspe Jul 18, 2024
0e9060a
Version Fix
tgaspe Jul 18, 2024
c8e5bb2
Merge branch 'seqtk_subseq' of https://github.com/tgaspe/biobox into …
tgaspe Jul 18, 2024
3a27502
Update on config
tgaspe Jul 18, 2024
d00c944
Helper bed.sh
tgaspe Jul 18, 2024
a276b9e
Deleted _helpers
tgaspe Jul 18, 2024
590e138
don't forget exit when a test fails
rcannood Jul 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,8 @@
- `bedtools_getfasta`: extract sequences from a FASTA file for each of the
intervals defined in a BED/GFF/VCF file (PR #59).

* `seqtk subseq`: (PR #)
tgaspe marked this conversation as resolved.
Show resolved Hide resolved

## MINOR CHANGES

* Uniformize component metadata (PR #23).
Expand Down
Empty file added myout.fa
tgaspe marked this conversation as resolved.
Show resolved Hide resolved
Empty file.
75 changes: 75 additions & 0 deletions src/seqtk/seqtk_subseq/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
name: seqtk_subseq
namespace: seqtk
description: |
Extract subsequences from FASTA/Q files. Takes as input a FASTA/Q file and a name.lst (sequence ids file) or a reg.bed (genomic regions file).
keywords: [subseq, FASTA, FASTQ]
links:
repository: https://github.com/lh3/seqtk/tree/v1.4
license: MIT

tgaspe marked this conversation as resolved.
Show resolved Hide resolved
argument_groups:
- name: Inputs
arguments:
- name: "--input"
type: file
direction: input
description: The input FASTA/Q file.
required: true
example: input.fa

- name: "--name_list"
type: file
direction: input
description: |
List of sequence names (name.lst) or genomic regions (reg.bed) to extract.
required: true
example: list.lst

- name: Outputs
arguments:
- name: "--output"
alternatives: -o
type: file
direction: output
description: The output FASTA/Q file.
required: true
default: output.fa

- name: Options
arguments:
- name: "--tab"
alternatives: -t
type: boolean_true
description: TAB delimited output.

- name: "--strand_aware"
alternatives: -s
type: boolean_true
description: Strand aware.

- name: "--sequence_line_length"
alternatives: -l
type: integer
description: Sequence line length of input fasta file.
example: 16
tgaspe marked this conversation as resolved.
Show resolved Hide resolved


resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test1.sh
- type: file
path: test_data

engines:
- type: docker
image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
setup:
- type: docker
run: |
echo "xxx: \"0.1.0\"" > /var/software_versions.txt
tgaspe marked this conversation as resolved.
Show resolved Hide resolved
runners:
- type: executable
- type: nextflow
9 changes: 9 additions & 0 deletions src/seqtk/seqtk_subseq/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
```bash
seqtk subseq
```
Usage: seqtk subseq [options] <in.fa> <in.bed>|<name.list>
Options:
-t TAB delimited output
-s strand aware
-l INT sequence line length [0]
Note: Use 'samtools faidx' if only a few regions are intended.
12 changes: 12 additions & 0 deletions src/seqtk/seqtk_subseq/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash

## VIASH START
## VIASH END

seqtk subseq \
${par_tab:+-t} \
${par_strand_aware:+-s} \
${par_sequence_line_length:+-l "$par_sequence_line_length"} \
"$par_input" \
"$par_name_list" \
> "$par_output"
156 changes: 156 additions & 0 deletions src/seqtk/seqtk_subseq/test.sh
tgaspe marked this conversation as resolved.
Show resolved Hide resolved
tgaspe marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
#!/bin/bash

# exit on error
set -e

## VIASH START
meta_executable="target/executable/seqtk/seqtk_subseq"
meta_resources_dir="src/seqtk"
## VIASH END

#########################################################################################
mkdir test1
cd test1

echo "> Run seqtk_subseq on FASTA/Q file"
"$meta_executable" \
--input "$meta_resources_dir/test_data/a.1.fastq" \
--name_list "$meta_resources_dir/test_data/id.list" \
--output "sub_sample.fq"

echo ">> Check if output exists"
if [ ! -f "sub_sample.fq" ]; then
echo ">> sub_sample.fq does not exist"
exit 1
fi

echo ">> Check number of lines in output"
n_lines=$(wc -l < sub_sample.fq)
n_lines=$(echo "$n_lines" | awk '{print $1}')

if [ "$n_lines" -ne 2 ]; then
echo ">> sub_sample.fq does not contain exactly two lines"
exit 1
fi

echo ">> Check content in output"
result=$(sed -n '2p' sub_sample.fq)
expected=$(sed -n '2p' "$meta_resources_dir/test_data/a.1.fastq")
if [ "$result" == "$expected" ]; then
echo "--> content are equal"
else
echo "--> content are not equal"
fi

########################################################################################
#-- tab option --
cd ..
mkdir test2
cd test2

echo "> Run seqtk_subseq with TAB option"
"$meta_executable" \
--tab \
--input "$meta_resources_dir/test_data/input.fasta" \
--name_list "$meta_resources_dir/test_data/ids.txt" \
--output "sub_sample.fq"

echo ">> Check if output exists"
if [ ! -f "sub_sample.fq" ]; then
echo ">> sub_sample.fq does not exist"
exit 1
fi

echo ">> Check number of lines in output"
n_lines=$(wc -l < sub_sample.fq)
n_lines=$(echo "$n_lines" | awk '{print $1}')

if [ "$n_lines" -ne 2 ]; then
echo ">> sub_sample.fq does not contain exactly two lines"
exit 1
fi

echo ">> Check content in output"
result=$(sed -n '2p' sub_sample.fq)
expected=$(sed -n '2p' "$meta_resources_dir/test_data/a.1.fastq")
if [ "$result" == "$expected" ]; then
echo "--> content are equal"
else
echo "--> content are not equal"
fi

cat sub_sample.fq

########################################################################################
-- strand aware option --
cd ..
mkdir test3
cd test3
echo "> Run seqtk_subseq with Strand Aware option"

"$meta_executable" \
--strand_aware \
--input "$meta_resources_dir/test_data/a.1.fastq" \
--name_list "$meta_resources_dir/test_data/id.list" \
--output "sub_sample.fq"

echo ">> Check if output exists"
if [ ! -f "sub_sample.fq" ]; then
echo ">> sub_sample.fq does not exist"
exit 1
fi

echo ">> Check number of lines in output"
n_lines=$(wc -l < sub_sample.fq)
n_lines=$(echo "$n_lines" | awk '{print $1}')

if [ "$n_lines" -ne 2 ]; then
echo ">> sub_sample.fq does not contain exactly two lines"
exit 1
fi

echo ">> Check content in output"
result=$(sed -n '2p' sub_sample.fq)
expected=$(sed -n '2p' "$meta_resources_dir/test_data/a.1.fastq")
if [ "$result" == "$expected" ]; then
echo "--> content are equal"
else
echo "--> content are not equal"
fi

########################################################################################
-- sequence line length option --
cd ..
mkdir test4
cd test4

echo "> Run seqtk_subseq with line length option"
"$meta_executable" \
--sequence_line_length 10 \
--input "$meta_resources_dir/test_data/a.1.fastq" \
--name_list "$meta_resources_dir/test_data/id.list" \
--output "sub_sample.fq"

echo ">> Check if output exists"
if [ ! -f "sub_sample.fq" ]; then
echo ">> sub_sample.fq does not exist"
exit 1
fi

echo ">> Check number of lines in output"
n_lines=$(wc -l < sub_sample.fq)
n_lines=$(echo "$n_lines" | awk '{print $1}')

if [ "$n_lines" -ne 2 ]; then
echo ">> sub_sample.fq does not contain exactly two lines"
exit 1
fi

echo ">> Check content in output"
result=$(sed -n '2p' sub_sample.fq)
expected=$(sed -n '2p' "$meta_resources_dir/test_data/a.1.fastq")
if [ "$result" == "$expected" ]; then
echo "--> content are equal"
else
echo "--> content are not equal"
fi
122 changes: 122 additions & 0 deletions src/seqtk/seqtk_subseq/test1.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
#!/bin/bash

# exit on error
set -e

## VIASH START
meta_executable="target/executable/seqtk/seqtk_subseq"
meta_resources_dir="src/seqtk"
## VIASH END

#########################################################################################
# Run basic test
mkdir test1
cd test1

echo "> Run seqtk_subseq on FASTA/Q file"
"$meta_executable" \
--input "$meta_resources_dir/test_data/input.fasta" \
--name_list "$meta_resources_dir/test_data/id.list" \
--output "sub_sample.fq"

expected_output_basic=">KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTCAAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG"
output_basic=$(cat sub_sample.fq)

if [ "$output_basic" == "$expected_output_basic" ]; then
echo "Basic test passed"
else
echo "Basic test failed"
echo "Expected:"
echo "$expected_output_basic"
echo "Got:"
echo "$output_basic"
fi

#########################################################################################
# Run reg.bed as name list input test
cd ..
mkdir test2
cd test2

echo "> Run seqtk_subseq on FASTA/Q file with BED file as name list"
"$meta_executable" \
--input "$meta_resources_dir/test_data/input.fasta" \
--name_list "$meta_resources_dir/test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_basic=">KU562861.1:11-20
AGTGTTCGAG
>MH150936.1:11-20
TGAAAACTTT"
output_basic=$(cat sub_sample.fq)

if [ "$output_basic" == "$expected_output_basic" ]; then
echo "Test passed!"
else
echo "Test failed!"
echo "Expected:"
echo "$expected_output_basic"
echo "Got:"
echo "$output_basic"
fi

#########################################################################################
# Run tab option output test
cd ..
mkdir test3
cd test3

echo "> Run seqtk_subseq with TAB option"
"$meta_executable" \
--tab \
--input "$meta_resources_dir/test_data/input.fasta" \
--name_list "$meta_resources_dir/test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_tabular="KU562861.1\t11\tAGTGTTCGAG
MH150936.1\t11\tTGAAAACTTT"
output_tabular=$(cat sub_sample.fq)

if [ "$output_tabular" == "$expected_output_tabular" ]; then
echo "Tabular output test passed"
else
echo "Tabular output test failed"
echo "Expected:"
echo "$expected_output_tabular"
echo "Got:"
echo "$output_tabular"
fi

#########################################################################################
# Run line option output test
cd ..
mkdir test4
cd test4

echo "> Run seqtk_subseq with line length option"
"$meta_executable" \
--sequence_line_length 5 \
--input "$meta_resources_dir/test_data/input.fasta" \
--name_list "$meta_resources_dir/test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_wrapped=">KU562861.1:11-20
AGTGT
TCGAG
>MH150936.1:11-20
TGAAA
ACTTT"
output_wrapped=$(cat sub_sample.fq)

if [ "$output_wrapped" == "$expected_output_wrapped" ]; then
echo "Line-wrapped output test passed"
else
echo "Line-wrapped output test failed"
echo "Expected:"
echo "$expected_output_wrapped"
echo "Got:"
echo "$output_wrapped"
fi
Loading