Skip to content

Commit

Permalink
Samtools fasta (viash-hub#53)
Browse files Browse the repository at this point in the history
* initial commit dedup

* Revert "initial commit dedup"

This reverts commit 38f586b.

* Fasta component

* change script resource to samtools_fastq script, with dummy argument to specify the command

* add dummy argument to samtools_fastq to share the script with samtools_fasta

* fix path to script in config

* Update src/samtools/samtools_fastq/script.sh

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Change default fields to examples

* Two more default fields changed to examples

* Minor formatting changes

* Markdown formatting changes in configs

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>
  • Loading branch information
emmarousseau and rcannood authored Jul 5, 2024
1 parent ec69d9a commit 7544a75
Show file tree
Hide file tree
Showing 13 changed files with 450 additions and 23 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,8 @@
- `samtools/samtools_collate`: Shuffles and groups reads in SAM/BAM/CRAM files together by their names (PR #42).
- `samtools/samtools_view`: Views and converts SAM/BAM/CRAM files (PR #48).
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTQ (PR #52).
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53).


* `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).

Expand Down
191 changes: 191 additions & 0 deletions src/samtools/samtools_fasta/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
name: samtools_fasta
namespace: samtools
description: Converts a SAM, BAM or CRAM to FASTA format.
keywords: [fasta, bam, sam, cram]
links:
homepage: https://www.htslib.org/
documentation: https://www.htslib.org/doc/samtools-fasta.html
repository: https://github.com/samtools/samtools
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat

argument_groups:
- name: Inputs
arguments:
- name: --input
type: file
description: input SAM/BAM/CRAM file
required: true
- name: Outputs
arguments:
- name: --output
type: file
description: output FASTA file
required: true
direction: output
- name: Options
arguments:
- name: --no_suffix
alternatives: -n
type: boolean_true
description: |
By default, either '/1' or '/2' is added to the end of read names where the corresponding
READ1 or READ2 FLAG bit is set. Using -n causes read names to be left as they are.
- name: --suffix
alternatives: -N
type: boolean_true
description: |
Always add either '/1' or '/2' to the end of read names even when put into different files.
- name: --use_oq
alternatives: -O
type: boolean_true
description: |
Use quality values from OQ tags in preference to standard quality string if available.
- name: --singleton
alternatives: -s
type: file
description: write singleton reads to FILE.
- name: --copy_tags
alternatives: -t
type: boolean_true
description: |
Copy RG, BC and QT tags to the FASTA header line, if they exist.
- name: --copy_tags_list
alternatives: -T
type: string
description: |
Specify a comma-separated list of tags to copy to the FASTA header line, if they exist.
TAGLIST can be blank or `*` to indicate all tags should be copied to the output. If using `*`,
be careful to quote it to avoid unwanted shell expansion.
- name: --read1
alternatives: -1
type: file
description: |
Write reads with the READ1 FLAG set (and READ2 not set) to FILE instead of outputting them.
If the -s option is used, only paired reads will be written to this file.
direction: output
- name: --read2
alternatives: -2
type: file
description: |
Write reads with the READ2 FLAG set (and READ1 not set) to FILE instead of outputting them.
If the -s option is used, only paired reads will be written to this file.
direction: output
- name: --output_reads
alternatives: -o
type: file
description: |
Write reads with either READ1 FLAG or READ2 flag set to FILE instead of outputting them to stdout.
This is equivalent to -1 FILE -2 FILE.
direction: output
- name: --output_reads_both
alternatives: -0
type: file
description: |
Write reads where the READ1 and READ2 FLAG bits set are either both set or both unset to FILE
instead of outputting them.
direction: output
- name: --filter_flags
alternatives: -f
type: integer
description: |
Only output alignments with all bits set in INT present in the FLAG field. INT can be specified
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0'
(i.e. /^0[0-7]+/). Default: `0`.
example: 0
- name: --excl_flags
alternatives: -F
type: string
description: |
Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0'
(i.e. /^0[0-7]+/). This defaults to 0x900 representing filtering of secondary and
supplementary alignments. Default: `0x900`.
example: "0x900"
- name: --incl_flags
alternatives: --rf
type: string
description: |
Only output alignments with any bits set in INT present in the FLAG field. INT can be specified
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with '0'
(i.e. /^0[0-7]+/), as a decimal number not beginning with '0' or as a comma-separated list of
flag names. Default: `0`.
example: 0
- name: --excl_flags_all
alternatives: -G
type: integer
description: |
Only EXCLUDE reads with all of the bits set in INT present in the FLAG field. INT can be specified
in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0' (i.e. /^0[0-7]+/).
Default: `0`.
example: 0
- name: --aux_tag
alternatives: -d
type: string
description: |
Only output alignments containing an auxiliary tag matching both TAG and VAL. If VAL is omitted
then any value is accepted. The tag types supported are i, f, Z, A and H. "B" arrays are not
supported. This is comparable to the method used in samtools view --tag. The option may be specified
multiple times and is equivalent to using the --aux_tag_file option.
- name: --aux_tag_file
alternatives: -D
type: string
description: |
Only output alignments containing an auxiliary tag matching TAG and having a value listed in FILE.
The format of the file is one line per value. This is equivalent to specifying --aux_tag multiple times.
- name: --casava
alternatives: -i
type: boolean_true
description: add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
- name: --compression
alternatives: -c
type: integer
description: set compression level when writing gz or bgzf fasta files.
example: 0
- name: --index1
alternatives: --i1
type: file
description: write first index reads to FILE.
- name: --index2
alternatives: --i2
type: file
description: write second index reads to FILE.
- name: --barcode_tag
type: string
description: |
Auxiliary tag to find index reads in. Default: `BC`.
example: "BC"
- name: --quality_tag
type: string
description: |
Auxiliary tag to find index quality in. Default: `QT`.
example: "QT"
- name: --index_format
type: string
description: |
string to describe how to parse the barcode and quality tags. For example:
* `i14i8`: the first 14 characters are index 1, the next 8 characters are index 2.
* `n8i14`: ignore the first 8 characters, and use the next 14 characters for index 1.
If the tag contains a separator, then the numeric part can be replaced with`*` to mean
'read until the separator or end of tag', for example: `n*i*`.
resources:
- type: bash_script
path: ../samtools_fastq/script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1
setup:
- type: docker
run: |
samtools --version 2>&1 | grep -E '^(samtools|Using htslib)' | \
sed 's#Using ##;s# \([0-9\.]*\)$#: \1#' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
80 changes: 80 additions & 0 deletions src/samtools/samtools_fasta/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
```
samtools fastq
```

Usage: samtools fastq [options...] <in.bam>

Description:
Converts a SAM, BAM or CRAM to FASTQ format.

Options:
-0 FILE write reads designated READ_OTHER to FILE
-1 FILE write reads designated READ1 to FILE
-2 FILE write reads designated READ2 to FILE
-o FILE write reads designated READ1 or READ2 to FILE
note: if a singleton file is specified with -s, only
paired reads will be written to the -1 and -2 files.
-d, --tag TAG[:VAL]
only include reads containing TAG, optionally with value VAL
-f, --require-flags INT
only include reads with all of the FLAGs in INT present [0]
-F, --excl[ude]-flags INT
only include reads with none of the FLAGs in INT present [0x900]
--rf, --incl[ude]-flags INT
only include reads with any of the FLAGs in INT present [0]
-G INT only EXCLUDE reads with all of the FLAGs in INT present [0]
-n don't append /1 and /2 to the read name
-N always append /1 and /2 to the read name
-O output quality in the OQ tag if present
-s FILE write singleton reads designated READ1 or READ2 to FILE
-t copy RG, BC and QT tags to the FASTQ header line
-T TAGLIST copy arbitrary tags to the FASTQ header line, '*' for all
-v INT default quality score if not given in file [1]
-i add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
-c INT compression level [0..9] to use when writing bgzf files [1]
--i1 FILE write first index reads to FILE
--i2 FILE write second index reads to FILE
--barcode-tag TAG
Barcode tag [BC]
--quality-tag TAG
Quality tag [QT]
--index-format STR
How to parse barcode and quality tags

--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
--verbosity INT
Set level of verbosity

The files will be automatically compressed if the file names have a .gz
or .bgzf extension. The input to this program must be collated by name.
Run 'samtools collate' or 'samtools sort -n' to achieve this.

Reads are designated READ1 if FLAG READ1 is set and READ2 is not set.
Reads are designated READ2 if FLAG READ1 is not set and READ2 is set.
Otherwise reads are designated READ_OTHER (both flags set or both flags unset).
Run 'samtools flags' for more information on flag codes and meanings.

The index-format string describes how to parse the barcode and quality tags.
It is made up of 'i' or 'n' followed by a length or '*'. For example:
i14i8 The first 14 characters are index 1, the next 8 are index 2
n8i14 Ignore the first 8 characters, and use the next 14 for index 1

If the tag contains a separator, then the numeric part can be replaced with
'*' to mean 'read until the separator or end of tag', for example:
i*i* Break the tag at the separator into index 1 and index 2
n*i* Ignore the left part of the tag until the separator,
then use the second part of the tag as index 1

Examples:
To get just the paired reads in separate files, use:
samtools fastq -1 pair1.fq -2 pair2.fq -0 /dev/null -s /dev/null -n in.bam

To get all non-supplementary/secondary reads in a single file, redirect
the output:
samtools fastq in.bam > all_reads.fq
96 changes: 96 additions & 0 deletions src/samtools/samtools_fasta/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
#!/bin/bash

test_dir="${meta_resources_dir}/test_data"
out_dir="${meta_resources_dir}/out_data"

############################################################################################

echo ">>> Test 1: Convert all reads from a bam file to fasta format"
"$meta_executable" \
--input "$test_dir/a.bam" \
--output "$out_dir/a.fa"

echo ">>> Check if output file exists"
[ ! -f "$out_dir/a.fa" ] && echo "Output file a.fa does not exist" && exit 1

echo ">>> Check if output is empty"
[ ! -s "$out_dir/a.fa" ] && echo "Output file a.fa is empty" && exit 1

echo ">>> Check if output matches expected output"
diff "$out_dir/a.fa" "$test_dir/a.fa" ||
(echo "Output file a.fa does not match expected output" && exit 1)

rm "$out_dir/a.fa"

############################################################################################

echo ">>> Test 2: Convert all reads from a sam file to fasta format"
"$meta_executable" \
--input "$test_dir/a.sam" \
--output "$out_dir/a.fa"

echo ">>> Check if output file exists"
[ ! -f "$out_dir/a.fa" ] && echo "Output file a.fa does not exist" && exit 1

echo ">>> Check if output is empty"
[ ! -s "$out_dir/a.fa" ] && echo "Output file a.fa is empty" && exit 1

echo ">>> Check if output matches expected output"
diff "$out_dir/a.fa" "$test_dir/a.fa" ||
(echo "Output file a.fa does not match expected output" && exit 1)

rm "$out_dir/a.fa"

############################################################################################

echo ">>> Test 3: Output reads from bam file to separate files"

"$meta_executable" \
--input "$test_dir/a.bam" \
--read1 "$out_dir/a.1.fa" \
--read2 "$out_dir/a.2.fa" \
--output "$out_dir/a.fa"

echo ">>> Check if output files exist"
[ ! -f "$out_dir/a.1.fa" ] && echo "Output file a.1.fa does not exist" && exit 1
[ ! -f "$out_dir/a.2.fa" ] && echo "Output file a.2.fa does not exist" && exit 1
[ ! -f "$out_dir/a.fa" ] && echo "Output file a.fa does not exist" && exit 1

echo ">>> Check if output files are empty"
[ ! -s "$out_dir/a.1.fa" ] && echo "Output file a.1.fa is empty" && exit 1
[ ! -s "$out_dir/a.2.fa" ] && echo "Output file a.2.fa is empty" && exit 1
# output should be empty since input has no singleton reads

echo ">>> Check if output files match expected output"
diff "$out_dir/a.1.fa" "$test_dir/a.1.fa" ||
(echo "Output file a.1.fa does not match expected output" && exit 1)
diff "$out_dir/a.2.fa" "$test_dir/a.2.fa" ||
(echo "Output file a.2.fa does not match expected output" && exit 1)

rm "$out_dir/a.1.fa" "$out_dir/a.2.fa" "$out_dir/a.fa"

############################################################################################

echo ">>> Test 4: Output only forward reads from bam file to fasta format"

"$meta_executable" \
--input "$test_dir/a.sam" \
--excl_flags "0x80" \
--output "$out_dir/half.fa"

echo ">>> Check if output file exists"
[ ! -f "$out_dir/half.fa" ] && echo "Output file half.fa does not exist" && exit 1

echo ">>> Check if output is empty"
[ ! -s "$out_dir/half.fa" ] && echo "Output file half.fa is empty" && exit 1

echo ">>> Check if output matches expected output"
diff "$out_dir/half.fa" "$test_dir/half.fa" ||
(echo "Output file half.fa does not match expected output" && exit 1)

rm "$out_dir/half.fa"

############################################################################################

echo "All tests succeeded!"
exit 0
6 changes: 6 additions & 0 deletions src/samtools/samtools_fasta/test_data/a.1.fa
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
>a1
AAAAAAAAAA
>b1
AAAAAAAAAA
>c1
AAAAAAAAAA
6 changes: 6 additions & 0 deletions src/samtools/samtools_fasta/test_data/a.2.fa
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
>a1
AAAAAAAAAA
>b1
AAAAAAAAAA
>c1
AAAAAAAAAA
Binary file added src/samtools/samtools_fasta/test_data/a.bam
Binary file not shown.
Loading

0 comments on commit 7544a75

Please sign in to comment.