Skip to content

Commit

Permalink
Seqtk subseq (#85)
Browse files Browse the repository at this point in the history
* Config file and help.txt file

Created:
- config.vsh.yaml
- help.txt

* Added script.sh

File added:
- script.sh

* Created test.sh

updates:
- changes to config.vsh.yaml
- created test.sh
- created some test files

Problems:
- there is some error in config file that is preventing me from running the component and testing

* Update on test.sh

* update

* Bug fixes

- config required: false bug

* Update test

* Update CHANGELOG.md

* Improvement on test.sh

* Added more test

I tried out different option of the command with different fasta and fastq files and different list, but the output does not seem to change.

* Update on tests

- got unstuck
- I need to create a docker image with the lastest version of seqtk
-

* Bug fixed

- removed some test files
- fixed bug with the help of Toni
- added correct software_versions.txt to config

Still Needs:
- add one more test to strand aware
- fix tab test

* Update CHANGELOG.md

* Fixed Tabular test bug

* Strand Aware Test

- implementation of strand aware test
- change of format for reg.bed file

* Input validation for list file

- input validation for name_list parameter

* Sugested Changes

- removed test_data dir
- removed input validation

* Added author info

* Update CHANGELOG.md

* Update theodoro_gasperin.yaml

* add newline

* add newline

* Update src/seqtk/seqtk_subseq/config.vsh.yaml

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Version Fix

* Update on config

* Helper bed.sh

* Deleted _helpers

* don't forget exit when a test fails

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>
  • Loading branch information
tgaspe and rcannood authored Jul 18, 2024
1 parent e8b82b5 commit 2f9c7f7
Show file tree
Hide file tree
Showing 6 changed files with 305 additions and 19 deletions.
30 changes: 11 additions & 19 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,16 @@

## NEW FEATURES

* `bd_rhapsody`:
* `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline (PR #75).

- `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline (PR #75).
* `umitools/umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).

* `seqtk`:
- `seqtk/seqtk_sample`: Subsamples sequences from FASTA/Q files (PR #68).
- `seqtk/seqtk_subseq`: Extract the sequences (complete or subsequence) from the FASTA/FASTQ files
based on a provided sequence IDs or region coordinates file (PR #85).

* `agat/agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).

## MINOR CHANGES

Expand Down Expand Up @@ -38,14 +45,8 @@

* `multiqc`: update multiple separator to `;` (PR #81).

# biobox 0.1.0

## BREAKING CHANGES

* Change default `multiple_sep` to `;` (PR #25). This aligns with an upcoming breaking change in
Viash 0.9.0 in order to avoid issues with the current default separator `:` unintentionally
splitting up certain file paths.

# biobox 0.1.0

## NEW FEATURES

Expand Down Expand Up @@ -94,21 +95,12 @@
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTQ (PR #52).
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53).


* `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).

* `seqtk/seqtk_sample`: Sample sequences from FASTA/Q(.gz) files to FASTA/Q (PR #68).

* `umitools`:
- `umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).

* `bedtools`:
- `bedtools_getfasta`: extract sequences from a FASTA file for each of the
intervals defined in a BED/GFF/VCF file (PR #59).

* `agat`:
- `agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).

## MINOR CHANGES

* Uniformize component metadata (PR #23).
Expand All @@ -129,4 +121,4 @@

* Add escaping character before leading hashtag in the description field of the config file (PR #50).

* Format URL in biobase/bcl_convert description (PR #55).
* Format URL in biobase/bcl_convert description (PR #55).
10 changes: 10 additions & 0 deletions src/_authors/theodoro_gasperin.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
name: Theodoro Gasperin Terra Camargo
info:
links:
email: [email protected]
github: tgaspe
linkedin: theodoro-gasperin-terra-camargo
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Bioinformatician
78 changes: 78 additions & 0 deletions src/seqtk/seqtk_subseq/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
name: seqtk_subseq
namespace: seqtk
description: |
Extract subsequences from FASTA/Q files. Takes as input a FASTA/Q file and a name.lst (sequence ids file) or a reg.bed (genomic regions file).
keywords: [subseq, FASTA, FASTQ]
links:
repository: https://github.com/lh3/seqtk/tree/v1.4
license: MIT
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]

argument_groups:
- name: Inputs
arguments:
- name: "--input"
type: file
direction: input
description: The input FASTA/Q file.
required: true
example: input.fa

- name: "--name_list"
type: file
direction: input
description: |
List of sequence names (name.lst) or genomic regions (reg.bed) to extract.
required: true
example: list.lst

- name: Outputs
arguments:
- name: "--output"
alternatives: -o
type: file
direction: output
description: The output FASTA/Q file.
required: true
default: output.fa

- name: Options
arguments:
- name: "--tab"
alternatives: -t
type: boolean_true
description: TAB delimited output.

- name: "--strand_aware"
alternatives: -s
type: boolean_true
description: Strand aware.

- name: "--sequence_line_length"
alternatives: -l
type: integer
description: |
Sequence line length of input fasta file. Default: 0.
example: 0


resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh

engines:
- type: docker
image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
setup:
- type: docker
run: |
echo $(echo $(seqtk 2>&1) | sed -n 's/.*\(Version: [^ ]*\).*/\1/p') > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
9 changes: 9 additions & 0 deletions src/seqtk/seqtk_subseq/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
```bash
seqtk subseq
```
Usage: seqtk subseq [options] <in.fa> <in.bed>|<name.list>
Options:
-t TAB delimited output
-s strand aware
-l INT sequence line length [0]
Note: Use 'samtools faidx' if only a few regions are intended.
15 changes: 15 additions & 0 deletions src/seqtk/seqtk_subseq/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash

## VIASH START
## VIASH END

[[ "$par_tab" == "false" ]] && unset par_tab
[[ "$par_strand_aware" == "false" ]] && unset par_strand_aware

seqtk subseq \
${par_tab:+-t} \
${par_strand_aware:+-s} \
${par_sequence_line_length:+-l "$par_sequence_line_length"} \
"$par_input" \
"$par_name_list" \
> "$par_output"
182 changes: 182 additions & 0 deletions src/seqtk/seqtk_subseq/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
#!/bin/bash

# exit on error
set -e

## VIASH START
meta_executable="target/executable/seqtk/seqtk_subseq"
meta_resources_dir="src/seqtk"
## VIASH END

# Create directories for tests
echo "Creating Test Data..."
mkdir test_data

# Create and populate input.fasta
cat > "test_data/input.fasta" <<EOL
>KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCA
AGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>GU056837.1
CTAATTTTATTTTTTTATAATAATTATTGGAGGAACTAAAACATTAATGAAATAATAATTATCATAATTA
TTAATTACATATTTATTAGGTATAATATTTAAGGAAAAATATATTTTATGTTAATTGTAATAATTAGAAC
>CP097510.1
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGAC
AGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
>JAMFTS010000002.1
CCTAAACCCTAAACCCTAAACCCCCTACAAACCTTACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA
ACCCGAAACCCTATACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCAAACCTAATCCCTAAACC
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTC
AAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG
EOL

# Update id.list with new entries
cat > "test_data/id.list" <<EOL
KU562861.1
MH150936.1
EOL

# Create and populate reg.bed
cat > "test_data/reg.bed" <<EOL
KU562861.1$(echo -e "\t")10$(echo -e "\t")20$(echo -e "\t")region$(echo -e "\t")0$(echo -e "\t")+$(echo -e "\n")
MH150936.1$(echo -e "\t")10$(echo -e "\t")20$(echo -e "\t")region$(echo -e "\t")0$(echo -e "\t")-
EOL

#########################################################################################
# Run basic test
mkdir test1
cd test1

echo "> Run seqtk_subseq on FASTA/Q file"
"$meta_executable" \
--input "../test_data/input.fasta" \
--name_list "../test_data/id.list" \
--output "sub_sample.fq"

expected_output_basic=">KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTCAAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG"
output_basic=$(cat sub_sample.fq)

if [ "$output_basic" != "$expected_output_basic" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_basic"
echo "Got:"
echo "$output_basic"
exit 1
fi

#########################################################################################
# Run reg.bed as name list input test
cd ..
mkdir test2
cd test2

echo "> Run seqtk_subseq on FASTA/Q file with BED file as name list"
"$meta_executable" \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_basic=">KU562861.1:11-20
AGTGTTCGAG
>MH150936.1:11-20
TGAAAACTTT"
output_basic=$(cat sub_sample.fq)

if [ "$output_basic" != "$expected_output_basic" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_basic"
echo "Got:"
echo "$output_basic"
exit 1
fi

#########################################################################################
# Run tab option output test
cd ..
mkdir test3
cd test3

echo "> Run seqtk_subseq with TAB option"
"$meta_executable" \
--tab \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_tabular=$'KU562861.1\t11\tAGTGTTCGAG\nMH150936.1\t11\tTGAAAACTTT'
output_tabular=$(cat sub_sample.fq)

if [ "$output_tabular" != "$expected_output_tabular" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_tabular"
echo "Got:"
echo "$output_tabular"
exit 1
fi

#########################################################################################
# Run line option output test
cd ..
mkdir test4
cd test4

echo "> Run seqtk_subseq with line length option"
"$meta_executable" \
--sequence_line_length 5 \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_wrapped=">KU562861.1:11-20
AGTGT
TCGAG
>MH150936.1:11-20
TGAAA
ACTTT"
output_wrapped=$(cat sub_sample.fq)

if [ "$output_wrapped" != "$expected_output_wrapped" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_wrapped"
echo "Got:"
echo "$output_wrapped"
exit 1
fi

#########################################################################################
# Run Strand Aware option output test
cd ..
mkdir test5
cd test5

echo "> Run seqtk_subseq with strand aware option"
"$meta_executable" \
--strand_aware \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_wrapped=">KU562861.1:11-20
AGTGTTCGAG
>MH150936.1:11-20
AAAGTTTTCA"
output_wrapped=$(cat sub_sample.fq)

if [ "$output_wrapped" != "$expected_output_wrapped" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_wrapped"
echo "Got:"
echo "$output_wrapped"
exit 1
fi

echo "All tests succeeded!"

0 comments on commit 2f9c7f7

Please sign in to comment.