Skip to content

Commit

Permalink
Add salmon (viash-hub#24)
Browse files Browse the repository at this point in the history
* add salmon index and quant

* add test resources

* add help text

* script and config

* add test

* update script and test

* add salmon quant

* update CHANGELOG.md

* update to viash 0.9 format

* remove echo ststement

* output the main salmon output file separately

* check if output file has the right columns

* check if correct output files were generated

* fix doi

Co-authored-by: Robrecht Cannoodt <[email protected]>

* use the default multiple separator

* rename components

* remove print statements

* check info.json output

* reduce size of fastq files and generate index in test script

* use smaller (manual) test data

* set A as the default lib_type

* Merge branch 'add_salmon' of https://github.com/viash-hub/biobase into add_salmon

* add test to check content of output index

* add more detailed description about libType

* add test data

* delete test data

* move to salmon_index and salmon_quant

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>
  • Loading branch information
sainirmayi and rcannood authored Mar 28, 2024
1 parent dc62023 commit d9e7fdf
Show file tree
Hide file tree
Showing 9 changed files with 2,140 additions and 1 deletion.
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,11 @@

* `star/star_align_reads`: Align reads to a reference genome (PR #22).

* `gffread`: Validate, filter, convert and perform other operations on GFF files (PR #29).
* `gffread`: Validate, filter, convert and perform other operations on GFF files (PR #29).

* `salmon`:
- `salmon/salmon_index`: Create a salmon index for the transcriptome to use Salmon in the mapping-based mode (PR #24).
- `salmon/salmon_quant`: Transcript quantification from RNA-seq data (PR #24).

## MAJOR CHANGES

Expand Down
113 changes: 113 additions & 0 deletions src/salmon/salmon_index/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
name: salmon_index
namespace: salmon
description: |
Salmon is a tool for wicked-fast transcript quantification from RNA-seq data. It can either make use of pre-computed alignments (in the form of a SAM/BAM file) to the transcripts rather than the raw reads, or can be run in the mapping-based mode. This component creates a salmon index for the transcriptome to use Salmon in the mapping-based mode. It is generally recommend that you build a decoy-aware transcriptome file. This is done using the entire genome of the organism as the decoy sequence by concatenating the genome to the end of the transcriptome to be indexed and populating the decoys.txt file with the chromosome names.
keywords: ["Transcriptome", "Index"]
links:
homepage: https://salmon.readthedocs.io/en/latest/salmon.html
documentation: https://salmon.readthedocs.io/en/latest/salmon.html
repository: https://github.com/COMBINE-lab/salmon
references:
doi: 10.1038/nmeth.4197
license: GPL-3.0
requirements:
commands: [ salmon ]

argument_groups:
- name: Inputs
arguments:
- name: --genome
type: file
description: |
Genome of the organism to prepare the set of decoy sequences. Required to build decoy-aware transccriptome.
required: false
example: genome.fasta
- name: --transcripts
alternatives: ["-t"]
type: file
description: |
Transcript fasta file.
required: true
example: transcriptome.fasta
- name: --kmer_len
alternatives: ["-k"]
type: integer
description: |
The size of k-mers that should be used for the quasi index.
required: false
example: 31
- name: --gencode
type: boolean_true
description: |
This flag will expect the input transcript fasta to be in GENCODE format, and will split the transcript name at the first '|' character. These reduced names will be used in the output and when looking for these transcripts in a gene to transcript GTF.
- name: --features
type: boolean_true
description: |
This flag will expect the input reference to be in the tsv file format, and will split the feature name at the first 'tab' character. These reduced names will be used in the output and when looking for the sequence of the features.GTF.
- name: --keep_duplicates
type: boolean_true
description: |
This flag will disable the default indexing behavior of discarding sequence-identical duplicate transcripts. If this flag is passed, then duplicate transcripts that appear in the input will be retained and quantified separately.
- name: --keep_fixed_fasta
type: boolean_true
description: |
Retain the fixed fasta file (without short transcripts and duplicates, clipped, etc.) generated during indexing.
- name: --filter_size
alternatives: ["-f"]
type: integer
description: |
The size of the Bloom filter that will be used by TwoPaCo during indexing. The filter will be of size 2^{filter_size}. The default value of -1 means that the filter size will be automatically set based on the number of distinct k-mers in the input, as estimated by nthll.
required: false
example: -1
- name: --sparse
type: boolean_true
description: |
Build the index using a sparse sampling of k-mer positions This will require less memory (especially during quantification), but will take longer to construct and can slow down mapping / alignment.
- name: --decoys
alternatives: ["-d"]
type: file
description: |
Treat these sequences ids from the reference as the decoys that may have sequence homologous to some known transcript. For example in case of the genome, provide a list of chromosome names (one per line).
required: false
example: decoys.txt
- name: --no_clip
type: boolean_true
description: |
Don't clip poly-A tails from the ends of target sequences.
- name: --type
alternatives: ["-n"]
type: string
description: |
The type of index to build; the only option is "puff" in this version of salmon.
required: false
example: puff

- name: Output
arguments:
- name: --index
alternatives: ["-i"]
type: file
direction: output
description: |
Salmon index
required: true
example: Salmon_index

resources:
- type: bash_script
path: script.sh

test_resources:
- type: bash_script
path: test.sh

engines:
- type: docker
image: quay.io/biocontainers/salmon:1.10.2--hecfa306_0
setup:
- type: docker
run: |
salmon index -v 2>&1 | sed 's/salmon \([0-9.]*\)/salmon: \1/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
66 changes: 66 additions & 0 deletions src/salmon/salmon_index/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
```bash
salmon index -h
```

Version Info: This is the most recent version of salmon.

Index
==========
Creates a salmon index.

Command Line Options:
-v [ --version ] print version string
-h [ --help ] produce help message
-t [ --transcripts ] arg Transcript fasta file.
-k [ --kmerLen ] arg (=31) The size of k-mers that should be used for the
quasi index.
-i [ --index ] arg salmon index.
--gencode This flag will expect the input transcript
fasta to be in GENCODE format, and will split
the transcript name at the first '|' character.
These reduced names will be used in the output
and when looking for these transcripts in a
gene to transcript GTF.
--features This flag will expect the input reference to be
in the tsv file format, and will split the
feature name at the first 'tab' character.
These reduced names will be used in the output
and when looking for the sequence of the
features.GTF.
--keepDuplicates This flag will disable the default indexing
behavior of discarding sequence-identical
duplicate transcripts. If this flag is passed,
then duplicate transcripts that appear in the
input will be retained and quantified
separately.
-p [ --threads ] arg (=2) Number of threads to use during indexing.
--keepFixedFasta Retain the fixed fasta file (without short
transcripts and duplicates, clipped, etc.)
generated during indexing
-f [ --filterSize ] arg (=-1) The size of the Bloom filter that will be used
by TwoPaCo during indexing. The filter will be
of size 2^{filterSize}. The default value of -1
means that the filter size will be
automatically set based on the number of
distinct k-mers in the input, as estimated by
nthll.
--tmpdir arg The directory location that will be used for
TwoPaCo temporary files; it will be created if
need be and be removed prior to indexing
completion. The default value will cause a
(temporary) subdirectory of the salmon index
directory to be used for this purpose.
--sparse Build the index using a sparse sampling of
k-mer positions This will require less memory
(especially during quantification), but will
take longer to construct and can slow down
mapping / alignment
-d [ --decoys ] arg Treat these sequences ids from the reference as
the decoys that may have sequence homologous to
some known transcript. for example in case of
the genome, provide a list of chromosome name
--- one per line
-n [ --no-clip ] Don't clip poly-A tails from the ends of target
sequences
--type arg (=puff) The type of index to build; the only option is
"puff" in this version of salmon.
49 changes: 49 additions & 0 deletions src/salmon/salmon_index/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/bin/bash

set -e

## VIASH START
## VIASH END

[[ "$par_gencode" == "false" ]] && unset par_gencode
[[ "$par_features" == "false" ]] && unset par_features
[[ "$par_keep_duplicates" == "false" ]] && unset par_keep_duplicates
[[ "$par_keep_fixed_fasta" == "false" ]] && unset par_keep_fixed_fasta
[[ "$par_sparse" == "false" ]] && unset par_sparse
[[ "$par_no_clip" == "false" ]] && unset par_no_clip

tmp_dir=$(mktemp -d -p "$meta_temp_dir" "${meta_functionality_name}_XXXXXX")
mkdir -p "$tmp_dir/temp"

if [[ -f "$par_genome" ]] && [[ ! "$par_decoys" ]]; then
filename="$(basename -- $par_genome)"
decoys="decoys.txt"
if [ ${filename##*.} == "gz" ]; then
grep '^>' <(gunzip -c $par_genome) | cut -d ' ' -f 1 > $decoys
gentrome="gentrome.fa.gz"
else
grep '^>' $par_genome | cut -d ' ' -f 1 > $decoys
gentrome="gentrome.fa"
fi
sed -i.bak -e 's/>//g' $decoys
cat $par_transcripts $par_genome > $gentrome
else
gentrome=$par_transcripts
decoys=$par_decoys
fi

salmon index \
-t "$gentrome" \
--tmpdir "$tmp_dir/temp" \
${meta_cpus:+--threads "${meta_cpus}"} \
-i "$par_index" \
${par_kmer_len:+-k "${par_kmer_len}"} \
${par_gencode:+--gencode} \
${par_features:+--features} \
${par_keep_duplicates:+--keepDuplicates} \
${par_keep_fixed_fasta:+--keepFixedFasta} \
${par_filter_size:+-f "${par_filter_size}"} \
${par_sparse:+--sparse} \
${decoys:+-d "${decoys}"} \
${par_no_clip:+--no-clip} \
${par_type:+--type "${par_type}"}
35 changes: 35 additions & 0 deletions src/salmon/salmon_index/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/bin/bash

set -e

echo "> Prepare test data"

dir_in="test_data"
mkdir -p "$dir_in"

cat > "$dir_in/transcriptome.fasta" <<'EOF'
>contig1
AGCTCCAGATTCGCTCAGGCCCTTGATCATCAGTCGTCGTCGTCTTCGATTTGCCAGAGG
AGTTTAGATGAAGAATGTCAAGGATGTTCCTCCCTGCCCTCCCATCTAGCCAAGAACATT
TCCAAGAAGATAAAACTGTCACTGAGACAGGTCTGGATGCGCCCTAGGGGCAAATAGAGA
>contig2
AGGCCTTTACCACATTGCTGCTGGCTATAGGAAGTCCCAGGTACTAGCCTGAAACAGCTG
ATATTTGGGGCTGTCACAGACAATATGGCCACCCCTTGGTCTTTATGCATGAAGATTATG
TAAAGGTTTTTATTAAAAAATATATATATATATATAAATGATCTAGATTATTTTCCTCTT
TCTGAAGTACTTTCTTAAAAAAATAAAATTAAATGTTTATAGTATTCCCGGT
EOF

printf ">>> Run salmon_index"
"$meta_executable" \
--transcripts $dir_in/transcriptome.fasta \
--index index \
--kmer_len 31

printf ">>> Checking whether output exists"
[ ! -d "index" ] && echo "'index' does not exist!" && exit 1
[ -z "$(ls -A 'index')" ] && echo "'index' is empty!" && exit 1
[ ! -f "index/info.json" ] && echo "Salmon index does not contain 'info.json'! Not all files were generated correctly!" && exit 1
[ $(grep '"k": [0-9]*' index/info.json | cut -d':' -f 2) != '31,' ] && printf "The generated Salmon index seems to be incorrect!" && exit 1

echo "All tests succeeded!"
exit 0
Loading

0 comments on commit d9e7fdf

Please sign in to comment.