Skip to content

Commit

Permalink
Samtools sort (viash-hub#36)
Browse files Browse the repository at this point in the history
* Initial version of samtools sort, no tests

* Add tests, final touches

* Update changelog

* Update src/samtools/samtools_sort/config.vsh.yaml

Remove "must_exist: false" since that is the default value

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Clean up test script, update changelog

* Minor changes, paths, config and script

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>
  • Loading branch information
emmarousseau and rcannood committed Apr 13, 2024
1 parent 8935a78 commit d3b2053
Show file tree
Hide file tree
Showing 13 changed files with 338 additions and 0 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
- `samtools/flagstat`: Counts the number of alignments in SAM/BAM/CRAM files for each FLAG type (PR #31).
- `samtools/idxstats`: Reports alignment summary statistics for a SAM/BAM/CRAM file (PR #32).
- `samtools/samtools_index`: Index SAM/BAM/CRAM files (PR #35).
- `samtools/samtools_sort`: Sort SAM/BAM/CRAM files (PR #36).
- `samtools/samtools_stats`: Reports alignment summary statistics for a BAM file (PR #39).

## MAJOR CHANGES
Expand Down
149 changes: 149 additions & 0 deletions src/samtools/samtools_sort/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
name: samtools_sort
namespace: samtools
description: Sort SAM/BAM/CRAM file.
keywords: [sort, bam, sam, cram]
links:
homepage: https://www.htslib.org/
documentation: https://www.htslib.org/doc/samtools-idxstats.html
repository: https://github.com/samtools/samtools
references:
doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008]
license: MIT/Expat

argument_groups:
- name: Inputs
arguments:
- name: --input
type: file
description: SAM/BAM/CRAM input file.
required: true
must_exist: true
- name: Outputs
arguments:
- name: --output
type: file
description: |
Write final output to file.
required: true
direction: output
example: out.bam
- name: --output_fmt
alternatives: -O
type: string
description: |
Specify output format (SAM, BAM, CRAM).
example: BAM
- name: --output_fmt_option
type: string
description: |
Specify a single output file format option in the form
of OPTION or OPTION=VALUE.
- name: --reference
type: file
description: |
Reference sequence FASTA FILE.
example: ref.fa
- name: --write_index
type: boolean_true
description: |
Automatically index the output files.
- name: --prefix
alternatives: -T
type: string
description: |
Write temporary files to PREFIX.nnnn.bam.
- name: --no_PG
type: boolean_true
description: |
Do not add a PG line.
- name: --template_coordinate
type: boolean_true
description: |
Sort by template-coordinate.
- name: --input_fmt_option
type: string
description: |
Specify a single input file format option in the form
of OPTION or OPTION=VALUE.
- name: Options
arguments:
- name: --compression
alternatives: -l
type: integer
description: |
Set compression level, from 0 (uncompressed) to 9 (best).
default: 0
- name: --uncompressed
alternatives: -u
type: boolean_true
description: |
Output uncompressed data (equivalent to --compression 0).
- name: --minimiser
alternatives: -M
type: boolean_true
description: |
Use minimiser for clustering unaligned/unplaced reads.
- name: --not_reverse
alternatives: -R
type: boolean_true
description: |
Do not use reverse strand (only compatible with --minimiser)
- name: --kmer_size
alternatives: -K
type: integer
description: |
Kmer size to use for minimiser.
example: 20
- name: --order
alternatives: -I
type: file
description: |
Order minimisers by their position in FILE FASTA.
example: ref.fa
- name: --window
alternatives: -w
type: integer
description: |
Window size for minimiser INDEXING VIA --order REF.FA.
example: 100
- name: --homopolymers
alternatives: -H
type: boolean_true
description: |
Squash homopolymers when computing minimiser.
- name: --natural_sort
alternatives: -n
type: boolean_true
description: |
Sort by read name (natural): cannot be used with samtools index.
- name: --ascii_sort
alternatives: -N
type: boolean_true
description: |
Sort by read name (ASCII): cannot be used with samtools index.
- name: --tag
alternatives: -t
type: string
description: |
Sort by value of TAG. Uses position as secondary index
(or read name if --natural_sort is set).
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1
setup:
- type: docker
run: |
samtools --version 2>&1 | grep -E '^(samtools|Using htslib)' | \
sed 's#Using ##;s# \([0-9\.]*\)$#: \1#' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
40 changes: 40 additions & 0 deletions src/samtools/samtools_sort/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
```
samtools sort
```

Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-u Output uncompressed data (equivalent to -l 0)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-M Use minimiser for clustering unaligned/unplaced reads
-R Do not use reverse strand (only compatible with -M)
-K INT Kmer size to use for minimiser [20]
-I FILE Order minimisers by their position in FILE FASTA
-w INT Window size for minimiser indexing via -I ref.fa [100]
-H Squash homopolymers when computing minimiser
-n Sort by read name (natural): cannot be used with samtools index
-N Sort by read name (ASCII): cannot be used with samtools index
-t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
-o FILE Write final output to FILE rather than standard output
-T PREFIX Write temporary files to PREFIX.nnnn.bam
--no-PG
Do not add a PG line
--template-coordinate
Sort by template-coordinate
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
--write-index
Automatically index the output files [off]
--verbosity INT
Set level of verbosity
43 changes: 43 additions & 0 deletions src/samtools/samtools_sort/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/bin/bash

## VIASH START
## VIASH END

set -e

[[ "$par_uncompressed" == "false" ]] && unset par_uncompressed
[[ "$par_minimiser" == "false" ]] && unset par_minimiser
[[ "$par_not_reverse" == "false" ]] && unset par_not_reverse
[[ "$par_homopolymers" == "false" ]] && unset par_homopolymers
[[ "$par_natural_sort" == "false" ]] && unset par_natural_sort
[[ "$par_ascii_sort" == "false" ]] && unset par_ascii_sort
[[ "$par_template_coordinate" == "false" ]] && unset par_template_coordinate
[[ "$par_write_index" == "false" ]] && unset par_write_index
[[ "$par_no_PG" == "false" ]] && unset par_no_PG


samtools sort \
${par_compression:+-l "$par_compression"} \
${par_uncompressed:+-u} \
${par_minimiser:+-M} \
${par_not_reverse:+-R} \
${par_kmer_size:+-K "$par_kmer_size"} \
${par_order:+-I "$par_order"} \
${par_window:+-w "$par_window"} \
${par_homopolymers:+-H} \
${par_natural_sort:+-n} \
${par_ascii_sort:+-N} \
${par_tag:+-t "$par_tag"} \
${par_input_fmt_option:+--input-fmt-option "$par_input_fmt_option"} \
${par_template_coordinate:+--template-coordinate} \
${par_write_index:+--write-index} \
${par_prefix:+-T "$par_prefix"} \
${par_no_PG:+--no-PG} \
${par_output_fmt:+-O "$par_output_fmt"} \
${par_output_fmt_option:+--output-fmt-option "$par_output_fmt_option"} \
${par_reference:+--reference "$par_reference"} \
-o "$par_output" \
"$par_input"

# save text files containing the output of samtools view for later comparison
samtools view "$par_output" -o "$par_output".txt
79 changes: 79 additions & 0 deletions src/samtools/samtools_sort/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
#!/bin/bash

test_dir="${meta_resources_dir}/test_data"
out_dir="${meta_resources_dir}/test_data/text"

# Files are compared using the "samtools view" output.
############################################################################################

echo ">>> Test 1: Sorting a BAM file"

"$meta_executable" \
--input "$test_dir/a.bam" \
--output "$test_dir/a.sorted.bam"

echo ">>> Check if output file exists"
[ ! -f "$test_dir/a.sorted.bam" ] \
&& echo "Output file a.sorted.bam does not exist" && exit 1

echo ">>> Check if output is empty"
[ ! -s "$test_dir/a.sorted.bam" ] \
&& echo "Output file a.sorted.bam is empty" && exit 1

echo ">>> Check if output matches expected output"
diff -a "$test_dir/a.sorted.bam.txt" "$out_dir/a_ref.sorted.txt" \
|| (echo "Output file a.sorted.bam does not match expected output" && exit 1)

rm "$test_dir/a.sorted.bam" "$test_dir/a.sorted.bam.txt"

############################################################################################

echo ">>> Test 2: Sorting a BAM file according to ascii order"

"$meta_executable" \
--input "$test_dir/a.bam" \
--ascii_sort \
--output "$test_dir/ascii.sorted.bam"

echo ">>> Check if output file exists"
[ ! -f "$test_dir/ascii.sorted.bam" ] \
&& echo "Output file ascii.sorted.bam does not exist" && exit 1

echo ">>> Check if output is empty"
[ ! -s "$test_dir/ascii.sorted.bam" ] \
&& echo "Output file ascii.sorted.bam is empty" && exit 1

echo ">>> Check if output matches expected output"
diff -a "$test_dir/ascii.sorted.bam.txt" "$out_dir/ascii_ref.sorted.txt" \
|| (echo "Output file ascii.sorted.bam does not match expected output" && exit 1)

rm "$test_dir/ascii.sorted.bam" "$test_dir/ascii.sorted.bam.txt"

############################################################################################

echo ">>> Test 3: Sorting a BAM file with compression"

"$meta_executable" \
--input "$test_dir/a.bam" \
--compression 5 \
--output "$test_dir/compressed.sorted.bam"

echo ">>> Check if output file exists"
[ ! -f "$test_dir/compressed.sorted.bam" ] \
&& echo "Output file compressed.sorted.bam does not exist" && exit 1

echo ">>> Check if output is empty"
[ ! -s "$test_dir/compressed.sorted.bam" ] \
&& echo "Output file compressed.sorted.bam is empty" && exit 1

echo ">>> Check if output matches expected output" #
diff "$test_dir/compressed.sorted.bam.txt" "$out_dir/compressed_ref.sorted.txt" \
|| (echo "Output file compressed.sorted.bam does not match expected output" && exit 1)

rm "$test_dir/compressed.sorted.bam" "$test_dir/compressed.sorted.bam.txt"

############################################################################################


echo "All tests succeeded!"
exit 0
Binary file added src/samtools/samtools_sort/test_data/a.bam
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
8 changes: 8 additions & 0 deletions src/samtools/samtools_sort/test_data/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/bin/bash

# dowload test data from snakemake wrapper
if [ ! -d /tmp/idxstats_source ]; then
git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers.git /tmp/sort_source
fi

cp -r /tmp/sort_source/bio/samtools/sort/test/mapped/* src/samtools/samtools_sort/test_data
6 changes: 6 additions & 0 deletions src/samtools/samtools_sort/test_data/text/a_ref.sorted.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
a1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
b1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
c1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
a1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********
b1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********
c1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
a1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
a1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********
b1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
b1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********
c1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
c1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
a1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
b1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
c1 99 xx 1 1 10M = 11 20 AAAAAAAAAA **********
a1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********
b1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********
c1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT **********

0 comments on commit d3b2053

Please sign in to comment.