Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seqtk subseq #85

Merged
merged 30 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
1b5d696
Config file and help.txt file
tgaspe Jul 15, 2024
54b23ff
Added script.sh
tgaspe Jul 15, 2024
4157f4b
Created test.sh
tgaspe Jul 15, 2024
9ca7071
Update on test.sh
tgaspe Jul 16, 2024
999c582
update
tgaspe Jul 16, 2024
628a335
Bug fixes
tgaspe Jul 16, 2024
1337b0f
Update test
tgaspe Jul 16, 2024
5b7f207
Update CHANGELOG.md
tgaspe Jul 16, 2024
84f714e
Improvement on test.sh
tgaspe Jul 16, 2024
88cf6d7
Added more test
tgaspe Jul 16, 2024
bd5de0a
Update on tests
tgaspe Jul 17, 2024
cc6746e
Bug fixed
tgaspe Jul 17, 2024
c782e2a
Update CHANGELOG.md
tgaspe Jul 17, 2024
06e1fe8
Fixed Tabular test bug
tgaspe Jul 17, 2024
20ac10a
Strand Aware Test
tgaspe Jul 17, 2024
0059ede
Input validation for list file
tgaspe Jul 17, 2024
aab5679
Sugested Changes
tgaspe Jul 17, 2024
32c084d
Added author info
tgaspe Jul 17, 2024
3dfc028
Merge branch 'main' into seqtk_subseq
rcannood Jul 18, 2024
a842777
Update CHANGELOG.md
rcannood Jul 18, 2024
819bd9b
Update theodoro_gasperin.yaml
rcannood Jul 18, 2024
d05cbf0
add newline
rcannood Jul 18, 2024
c22b383
add newline
rcannood Jul 18, 2024
dfc0a65
Update src/seqtk/seqtk_subseq/config.vsh.yaml
tgaspe Jul 18, 2024
0e9060a
Version Fix
tgaspe Jul 18, 2024
c8e5bb2
Merge branch 'seqtk_subseq' of https://github.com/tgaspe/biobox into …
tgaspe Jul 18, 2024
3a27502
Update on config
tgaspe Jul 18, 2024
d00c944
Helper bed.sh
tgaspe Jul 18, 2024
a276b9e
Deleted _helpers
tgaspe Jul 18, 2024
590e138
don't forget exit when a test fails
rcannood Jul 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 11 additions & 19 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,16 @@

## NEW FEATURES

* `bd_rhapsody`:
* `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline (PR #75).

- `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline (PR #75).
* `umitools/umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).

* `seqtk`:
- `seqtk/seqtk_sample`: Subsamples sequences from FASTA/Q files (PR #68).
- `seqtk/seqtk_subseq`: Extract the sequences (complete or subsequence) from the FASTA/FASTQ files
based on a provided sequence IDs or region coordinates file (PR #85).

* `agat/agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).

## MINOR CHANGES

Expand Down Expand Up @@ -38,14 +45,8 @@

* `multiqc`: update multiple separator to `;` (PR #81).

# biobox 0.1.0

## BREAKING CHANGES

* Change default `multiple_sep` to `;` (PR #25). This aligns with an upcoming breaking change in
Viash 0.9.0 in order to avoid issues with the current default separator `:` unintentionally
splitting up certain file paths.

# biobox 0.1.0

## NEW FEATURES

Expand Down Expand Up @@ -94,21 +95,12 @@
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTQ (PR #52).
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53).


* `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).

* `seqtk/seqtk_sample`: Sample sequences from FASTA/Q(.gz) files to FASTA/Q (PR #68).

* `umitools`:
- `umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).

* `bedtools`:
- `bedtools_getfasta`: extract sequences from a FASTA file for each of the
intervals defined in a BED/GFF/VCF file (PR #59).

* `agat`:
- `agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).

## MINOR CHANGES

* Uniformize component metadata (PR #23).
Expand All @@ -129,4 +121,4 @@

* Add escaping character before leading hashtag in the description field of the config file (PR #50).

* Format URL in biobase/bcl_convert description (PR #55).
* Format URL in biobase/bcl_convert description (PR #55).
10 changes: 10 additions & 0 deletions src/_authors/theodoro_gasperin.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
name: Theodoro Gasperin Terra Camargo
info:
links:
email: [email protected]
github: tgaspe
linkedin: theodoro-gasperin-terra-camargo
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Bioinformatician
78 changes: 78 additions & 0 deletions src/seqtk/seqtk_subseq/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
name: seqtk_subseq
namespace: seqtk
description: |
Extract subsequences from FASTA/Q files. Takes as input a FASTA/Q file and a name.lst (sequence ids file) or a reg.bed (genomic regions file).
keywords: [subseq, FASTA, FASTQ]
links:
repository: https://github.com/lh3/seqtk/tree/v1.4
license: MIT
authors:
- __merge__: /src/_authors/theodoro_gasperin.yaml
roles: [ author, maintainer ]

tgaspe marked this conversation as resolved.
Show resolved Hide resolved
argument_groups:
- name: Inputs
arguments:
- name: "--input"
type: file
direction: input
description: The input FASTA/Q file.
required: true
example: input.fa

- name: "--name_list"
type: file
direction: input
description: |
List of sequence names (name.lst) or genomic regions (reg.bed) to extract.
required: true
example: list.lst

- name: Outputs
arguments:
- name: "--output"
alternatives: -o
type: file
direction: output
description: The output FASTA/Q file.
required: true
default: output.fa

- name: Options
arguments:
- name: "--tab"
alternatives: -t
type: boolean_true
description: TAB delimited output.

- name: "--strand_aware"
alternatives: -s
type: boolean_true
description: Strand aware.

- name: "--sequence_line_length"
alternatives: -l
type: integer
description: |
Sequence line length of input fasta file. Default: 0.
example: 0


resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh

engines:
- type: docker
image: quay.io/biocontainers/seqtk:1.4--he4a0461_2
setup:
- type: docker
run: |
echo $(echo $(seqtk 2>&1) | sed -n 's/.*\(Version: [^ ]*\).*/\1/p') > /var/software_versions.txt

runners:
- type: executable
- type: nextflow
9 changes: 9 additions & 0 deletions src/seqtk/seqtk_subseq/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
```bash
seqtk subseq
```
Usage: seqtk subseq [options] <in.fa> <in.bed>|<name.list>
Options:
-t TAB delimited output
-s strand aware
-l INT sequence line length [0]
Note: Use 'samtools faidx' if only a few regions are intended.
15 changes: 15 additions & 0 deletions src/seqtk/seqtk_subseq/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash

## VIASH START
## VIASH END

[[ "$par_tab" == "false" ]] && unset par_tab
[[ "$par_strand_aware" == "false" ]] && unset par_strand_aware

seqtk subseq \
${par_tab:+-t} \
${par_strand_aware:+-s} \
${par_sequence_line_length:+-l "$par_sequence_line_length"} \
"$par_input" \
"$par_name_list" \
> "$par_output"
182 changes: 182 additions & 0 deletions src/seqtk/seqtk_subseq/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
#!/bin/bash

# exit on error
set -e

## VIASH START
meta_executable="target/executable/seqtk/seqtk_subseq"
meta_resources_dir="src/seqtk"
## VIASH END

# Create directories for tests
echo "Creating Test Data..."
mkdir test_data

# Create and populate input.fasta
cat > "test_data/input.fasta" <<EOL
>KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCA
AGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>GU056837.1
CTAATTTTATTTTTTTATAATAATTATTGGAGGAACTAAAACATTAATGAAATAATAATTATCATAATTA
TTAATTACATATTTATTAGGTATAATATTTAAGGAAAAATATATTTTATGTTAATTGTAATAATTAGAAC
>CP097510.1
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGAC
AGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
>JAMFTS010000002.1
CCTAAACCCTAAACCCTAAACCCCCTACAAACCTTACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA
ACCCGAAACCCTATACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCAAACCTAATCCCTAAACC
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTC
AAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG
EOL

# Update id.list with new entries
cat > "test_data/id.list" <<EOL
KU562861.1
MH150936.1
EOL

# Create and populate reg.bed
cat > "test_data/reg.bed" <<EOL
KU562861.1$(echo -e "\t")10$(echo -e "\t")20$(echo -e "\t")region$(echo -e "\t")0$(echo -e "\t")+$(echo -e "\n")
MH150936.1$(echo -e "\t")10$(echo -e "\t")20$(echo -e "\t")region$(echo -e "\t")0$(echo -e "\t")-
EOL

#########################################################################################
# Run basic test
mkdir test1
cd test1

echo "> Run seqtk_subseq on FASTA/Q file"
"$meta_executable" \
--input "../test_data/input.fasta" \
--name_list "../test_data/id.list" \
--output "sub_sample.fq"

expected_output_basic=">KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTCAAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG"
output_basic=$(cat sub_sample.fq)

if [ "$output_basic" != "$expected_output_basic" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_basic"
echo "Got:"
echo "$output_basic"
exit 1
fi

#########################################################################################
# Run reg.bed as name list input test
cd ..
mkdir test2
cd test2

echo "> Run seqtk_subseq on FASTA/Q file with BED file as name list"
"$meta_executable" \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_basic=">KU562861.1:11-20
AGTGTTCGAG
>MH150936.1:11-20
TGAAAACTTT"
output_basic=$(cat sub_sample.fq)

if [ "$output_basic" != "$expected_output_basic" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_basic"
echo "Got:"
echo "$output_basic"
exit 1
fi

#########################################################################################
# Run tab option output test
cd ..
mkdir test3
cd test3

echo "> Run seqtk_subseq with TAB option"
"$meta_executable" \
--tab \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_tabular=$'KU562861.1\t11\tAGTGTTCGAG\nMH150936.1\t11\tTGAAAACTTT'
output_tabular=$(cat sub_sample.fq)

if [ "$output_tabular" != "$expected_output_tabular" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_tabular"
echo "Got:"
echo "$output_tabular"
exit 1
fi

#########################################################################################
# Run line option output test
cd ..
mkdir test4
cd test4

echo "> Run seqtk_subseq with line length option"
"$meta_executable" \
--sequence_line_length 5 \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_wrapped=">KU562861.1:11-20
AGTGT
TCGAG
>MH150936.1:11-20
TGAAA
ACTTT"
output_wrapped=$(cat sub_sample.fq)

if [ "$output_wrapped" != "$expected_output_wrapped" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_wrapped"
echo "Got:"
echo "$output_wrapped"
exit 1
fi

#########################################################################################
# Run Strand Aware option output test
cd ..
mkdir test5
cd test5

echo "> Run seqtk_subseq with strand aware option"
"$meta_executable" \
--strand_aware \
--input "../test_data/input.fasta" \
--name_list "../test_data/reg.bed" \
--output "sub_sample.fq"

expected_output_wrapped=">KU562861.1:11-20
AGTGTTCGAG
>MH150936.1:11-20
AAAGTTTTCA"
output_wrapped=$(cat sub_sample.fq)

if [ "$output_wrapped" != "$expected_output_wrapped" ]; then
echo "Test failed"
echo "Expected:"
echo "$expected_output_wrapped"
echo "Got:"
echo "$output_wrapped"
exit 1
fi

echo "All tests succeeded!"