Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seqtk subseq #85

Merged
merged 30 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
1b5d696
Config file and help.txt file
tgaspe Jul 15, 2024
54b23ff
Added script.sh
tgaspe Jul 15, 2024
4157f4b
Created test.sh
tgaspe Jul 15, 2024
9ca7071
Update on test.sh
tgaspe Jul 16, 2024
999c582
update
tgaspe Jul 16, 2024
628a335
Bug fixes
tgaspe Jul 16, 2024
1337b0f
Update test
tgaspe Jul 16, 2024
5b7f207
Update CHANGELOG.md
tgaspe Jul 16, 2024
84f714e
Improvement on test.sh
tgaspe Jul 16, 2024
88cf6d7
Added more test
tgaspe Jul 16, 2024
bd5de0a
Update on tests
tgaspe Jul 17, 2024
cc6746e
Bug fixed
tgaspe Jul 17, 2024
c782e2a
Update CHANGELOG.md
tgaspe Jul 17, 2024
06e1fe8
Fixed Tabular test bug
tgaspe Jul 17, 2024
20ac10a
Strand Aware Test
tgaspe Jul 17, 2024
0059ede
Input validation for list file
tgaspe Jul 17, 2024
aab5679
Sugested Changes
tgaspe Jul 17, 2024
32c084d
Added author info
tgaspe Jul 17, 2024
3dfc028
Merge branch 'main' into seqtk_subseq
rcannood Jul 18, 2024
a842777
Update CHANGELOG.md
rcannood Jul 18, 2024
819bd9b
Update theodoro_gasperin.yaml
rcannood Jul 18, 2024
d05cbf0
add newline
rcannood Jul 18, 2024
c22b383
add newline
rcannood Jul 18, 2024
dfc0a65
Update src/seqtk/seqtk_subseq/config.vsh.yaml
tgaspe Jul 18, 2024
0e9060a
Version Fix
tgaspe Jul 18, 2024
c8e5bb2
Merge branch 'seqtk_subseq' of https://github.com/tgaspe/biobox into …
tgaspe Jul 18, 2024
3a27502
Update on config
tgaspe Jul 18, 2024
d00c944
Helper bed.sh
tgaspe Jul 18, 2024
a276b9e
Deleted _helpers
tgaspe Jul 18, 2024
590e138
don't forget exit when a test fails
rcannood Jul 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,10 @@
- `bedtools_getfasta`: extract sequences from a FASTA file for each of the
intervals defined in a BED/GFF/VCF file (PR #59).

* `seqtk`:
- `subseq`: Extract the sequences (complete or subsequence) from the FASTA/FASTQ files
based on a provided sequence IDs or region coordinates file (PR #85).

## MINOR CHANGES

* Uniformize component metadata (PR #23).
Expand Down
76 changes: 76 additions & 0 deletions src/seqtk/seqtk_subseq/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
name: seqtk_subseq
namespace: seqtk
description: |
Extract subsequences from FASTA/Q files. Takes as input a FASTA/Q file and a name.lst (sequence ids file) or a reg.bed (genomic regions file).
keywords: [subseq, FASTA, FASTQ]
links:
repository: https://github.com/lh3/seqtk/tree/v1.4
license: MIT

tgaspe marked this conversation as resolved.
Show resolved Hide resolved
argument_groups:
- name: Inputs
arguments:
- name: "--input"
type: file
direction: input
description: The input FASTA/Q file.
required: true
example: input.fa

- name: "--name_list"
type: file
direction: input
description: |
List of sequence names (name.lst) or genomic regions (reg.bed) to extract.
required: true
example: list.lst

- name: Outputs
arguments:
- name: "--output"
alternatives: -o
type: file
direction: output
description: The output FASTA/Q file.
required: true
default: output.fa

- name: Options
arguments:
- name: "--tab"
alternatives: -t
type: boolean_true
description: TAB delimited output.

- name: "--strand_aware"
alternatives: -s
type: boolean_true
description: Strand aware.

- name: "--sequence_line_length"
alternatives: -l
type: integer
description: Sequence line length of input fasta file.
example: 16
tgaspe marked this conversation as resolved.
Show resolved Hide resolved


resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data

engines:
- type: docker
image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
setup:
- type: docker
run: |
echo "seqtk version: 1.4 (r122)" > /var/software_versions.txt
tgaspe marked this conversation as resolved.
Show resolved Hide resolved

runners:
- type: executable
- type: nextflow
9 changes: 9 additions & 0 deletions src/seqtk/seqtk_subseq/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
```bash
seqtk subseq
```
Usage: seqtk subseq [options] <in.fa> <in.bed>|<name.list>
Options:
-t TAB delimited output
-s strand aware
-l INT sequence line length [0]
Note: Use 'samtools faidx' if only a few regions are intended.
91 changes: 91 additions & 0 deletions src/seqtk/seqtk_subseq/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
#!/bin/bash

## VIASH START
## VIASH END

# Function to check if a file is a valid BED file
check_bed_file() {
local file="$1"
num_columns=$(head -n 1 "$file" | cut -f 1- | tr '\t' '\n' | wc -l)

# Check if the file exists
if [[ ! -f "$file" ]]; then
echo "Error: The specified file does not exist."
return 1
fi

# Check if the file is non-empty
if [[ ! -s "$file" ]]; then
echo "Error: The specified file is empty."
return 1
fi

# Check if the file is a valid BED file (minimum 3 tab-separated columns)
if [[ $num_columns -lt 3 ]]; then
echo "The specified file is not a valid BED file. It should have at least three tab-separated columns."
return 1
fi

# Additional check: Ensure that the 6th column (if present) is either + or -
if [[ $num_columns -eq 6 ]]; then
while IFS=$'\t' read -r -a columns; do
columns[5]=${columns[5]%$'\n'}
if [[ ${columns[5]} == "+" || ${columns[5]} == "-" ]]; then
return 0
else
echo "Error: The 6th column of the specified BED file should be either + or -."
echo "Offending line: ${columns[5]}"
return 1
fi
done < "$file"
fi

return 0
}

# Function to check if a file is a valid list of FASTA file IDs
check_fasta_id_list() {
local file="$1"

# Check if the file exists
if [[ ! -f "$file" ]]; then
echo "Error: The specified file does not exist."
return 1
fi

# Check if the file is non-empty
if [[ ! -s "$file" ]]; then
echo "Error: The specified file is empty."
return 1
fi

# Additional check: Ensure that each line contains only one word (FASTA ID)
if ! awk 'NF != 1 { exit 1 }' "$file"; then
return 1
fi

return 0
}

# Check if the par_name_list is given and validate accordingly
if [[ -n "$par_name_list" ]]; then
if check_fasta_id_list "$par_name_list"; then
echo "The specified file is a valid list of FASTA IDs."
elif check_bed_file "$par_name_list"; then
echo "The specified file is a valid BED file."
else
echo "Error: The specified file is neither a valid BED file nor a valid list of FASTA IDs."
exit 1
fi
fi
jakubmajercik marked this conversation as resolved.
Show resolved Hide resolved

[[ "$par_tab" == "false" ]] && unset par_tab
[[ "$par_strand_aware" == "false" ]] && unset par_strand_aware

seqtk subseq \
${par_tab:+-t} \
${par_strand_aware:+-s} \
${par_sequence_line_length:+-l "$par_sequence_line_length"} \
"$par_input" \
"$par_name_list" \
> "$par_output"
154 changes: 154 additions & 0 deletions src/seqtk/seqtk_subseq/test.sh
tgaspe marked this conversation as resolved.
Show resolved Hide resolved
tgaspe marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
#!/bin/bash

# exit on error
set -e

## VIASH START
meta_executable="target/executable/seqtk/seqtk_subseq"
meta_resources_dir="src/seqtk"
## VIASH END

# TODO:
# - Fix Tab option test
# - Add strand aware test (create new fasta file with right configuration)

#########################################################################################
# Run basic test
mkdir test1
cd test1

echo "> Run seqtk_subseq on FASTA/Q file"
"$meta_executable" \
--input "$meta_resources_dir/test_data/input.fasta" \
--name_list "$meta_resources_dir/test_data/id.list" \
--output "sub_sample.fq"

expected_output_basic=">KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTCAAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG"
output_basic=$(cat sub_sample.fq)

if [ "$output_basic" == "$expected_output_basic" ]; then
echo "Test passed"
else
echo "Test failed"
echo "Expected:"
echo "$expected_output_basic"
echo "Got:"
echo "$output_basic"
fi

#########################################################################################
# Run reg.bed as name list input test
cd ..
mkdir test2
cd test2

echo "> Run seqtk_subseq on FASTA/Q file with BED file as name list"
"$meta_executable" \
--input "$meta_resources_dir/test_data/input.fasta" \
--name_list "$meta_resources_dir/test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_basic=">KU562861.1:11-20
AGTGTTCGAG
>MH150936.1:11-20
TGAAAACTTT"
output_basic=$(cat sub_sample.fq)

if [ "$output_basic" == "$expected_output_basic" ]; then
echo "Test passed"
else
echo "Test failed"
echo "Expected:"
echo "$expected_output_basic"
echo "Got:"
echo "$output_basic"
fi

#########################################################################################
# Run tab option output test
cd ..
mkdir test3
cd test3

echo "> Run seqtk_subseq with TAB option"
"$meta_executable" \
--tab \
--input "$meta_resources_dir/test_data/input.fasta" \
--name_list "$meta_resources_dir/test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_tabular=$'KU562861.1\t11\tAGTGTTCGAG\nMH150936.1\t11\tTGAAAACTTT'
output_tabular=$(cat sub_sample.fq)

if [ "$output_tabular" == "$expected_output_tabular" ]; then
echo "Test passed"
else
echo "Test failed"
echo "Expected:"
echo "$expected_output_tabular"
echo "Got:"
echo "$output_tabular"
fi

#########################################################################################
# Run line option output test
cd ..
mkdir test4
cd test4

echo "> Run seqtk_subseq with line length option"
"$meta_executable" \
--sequence_line_length 5 \
--input "$meta_resources_dir/test_data/input.fasta" \
--name_list "$meta_resources_dir/test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_wrapped=">KU562861.1:11-20
AGTGT
TCGAG
>MH150936.1:11-20
TGAAA
ACTTT"
output_wrapped=$(cat sub_sample.fq)

if [ "$output_wrapped" == "$expected_output_wrapped" ]; then
echo "Test passed"
else
echo "Test failed"
echo "Expected:"
echo "$expected_output_wrapped"
echo "Got:"
echo "$output_wrapped"
fi

#########################################################################################
# Run Strand Aware option output test
cd ..
mkdir test5
cd test5

echo "> Run seqtk_subseq with strand aware option"
"$meta_executable" \
--strand_aware \
--input "$meta_resources_dir/test_data/input.fasta" \
--name_list "$meta_resources_dir/test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_wrapped=">KU562861.1:11-20
AGTGTTCGAG
>MH150936.1:11-20
AAAGTTTTCA"
output_wrapped=$(cat sub_sample.fq)

if [ "$output_wrapped" == "$expected_output_wrapped" ]; then
echo "Test passed"
else
echo "Test failed"
echo "Expected:"
echo "$expected_output_wrapped"
echo "Got:"
echo "$output_wrapped"
fi
2 changes: 2 additions & 0 deletions src/seqtk/seqtk_subseq/test_data/id.list
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
KU562861.1
MH150936.1
15 changes: 15 additions & 0 deletions src/seqtk/seqtk_subseq/test_data/input.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
>KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCA
AGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>GU056837.1
CTAATTTTATTTTTTTATAATAATTATTGGAGGAACTAAAACATTAATGAAATAATAATTATCATAATTA
TTAATTACATATTTATTAGGTATAATATTTAAGGAAAAATATATTTTATGTTAATTGTAATAATTAGAAC
>CP097510.1
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGAC
AGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
>JAMFTS010000002.1
CCTAAACCCTAAACCCTAAACCCCCTACAAACCTTACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA
ACCCGAAACCCTATACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCAAACCTAATCCCTAAACC
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTC
AAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG
3 changes: 3 additions & 0 deletions src/seqtk/seqtk_subseq/test_data/reg.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
KU562861.1 10 20 region 0 +
MH150936.1 10 20 region 0 -