-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
commit d9e7fdf Author: Sai Nirmayi Yasa <[email protected]> Date: Thu Mar 28 20:08:30 2024 +0530 Add salmon (#24) * add salmon index and quant * add test resources * add help text * script and config * add test * update script and test * add salmon quant * update CHANGELOG.md * update to viash 0.9 format * remove echo ststement * output the main salmon output file separately * check if output file has the right columns * check if correct output files were generated * fix doi Co-authored-by: Robrecht Cannoodt <[email protected]> * use the default multiple separator * rename components * remove print statements * check info.json output * reduce size of fastq files and generate index in test script * use smaller (manual) test data * set A as the default lib_type * Merge branch 'add_salmon' of https://github.com/viash-hub/biobase into add_salmon * add test to check content of output index * add more detailed description about libType * add test data * delete test data * move to salmon_index and salmon_quant --------- Co-authored-by: Robrecht Cannoodt <[email protected]>
- Loading branch information
1 parent
bb52de0
commit 39790f8
Showing
9 changed files
with
2,140 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
name: salmon_index | ||
namespace: salmon | ||
description: | | ||
Salmon is a tool for wicked-fast transcript quantification from RNA-seq data. It can either make use of pre-computed alignments (in the form of a SAM/BAM file) to the transcripts rather than the raw reads, or can be run in the mapping-based mode. This component creates a salmon index for the transcriptome to use Salmon in the mapping-based mode. It is generally recommend that you build a decoy-aware transcriptome file. This is done using the entire genome of the organism as the decoy sequence by concatenating the genome to the end of the transcriptome to be indexed and populating the decoys.txt file with the chromosome names. | ||
keywords: ["Transcriptome", "Index"] | ||
links: | ||
homepage: https://salmon.readthedocs.io/en/latest/salmon.html | ||
documentation: https://salmon.readthedocs.io/en/latest/salmon.html | ||
repository: https://github.com/COMBINE-lab/salmon | ||
references: | ||
doi: 10.1038/nmeth.4197 | ||
license: GPL-3.0 | ||
requirements: | ||
commands: [ salmon ] | ||
|
||
argument_groups: | ||
- name: Inputs | ||
arguments: | ||
- name: --genome | ||
type: file | ||
description: | | ||
Genome of the organism to prepare the set of decoy sequences. Required to build decoy-aware transccriptome. | ||
required: false | ||
example: genome.fasta | ||
- name: --transcripts | ||
alternatives: ["-t"] | ||
type: file | ||
description: | | ||
Transcript fasta file. | ||
required: true | ||
example: transcriptome.fasta | ||
- name: --kmer_len | ||
alternatives: ["-k"] | ||
type: integer | ||
description: | | ||
The size of k-mers that should be used for the quasi index. | ||
required: false | ||
example: 31 | ||
- name: --gencode | ||
type: boolean_true | ||
description: | | ||
This flag will expect the input transcript fasta to be in GENCODE format, and will split the transcript name at the first '|' character. These reduced names will be used in the output and when looking for these transcripts in a gene to transcript GTF. | ||
- name: --features | ||
type: boolean_true | ||
description: | | ||
This flag will expect the input reference to be in the tsv file format, and will split the feature name at the first 'tab' character. These reduced names will be used in the output and when looking for the sequence of the features.GTF. | ||
- name: --keep_duplicates | ||
type: boolean_true | ||
description: | | ||
This flag will disable the default indexing behavior of discarding sequence-identical duplicate transcripts. If this flag is passed, then duplicate transcripts that appear in the input will be retained and quantified separately. | ||
- name: --keep_fixed_fasta | ||
type: boolean_true | ||
description: | | ||
Retain the fixed fasta file (without short transcripts and duplicates, clipped, etc.) generated during indexing. | ||
- name: --filter_size | ||
alternatives: ["-f"] | ||
type: integer | ||
description: | | ||
The size of the Bloom filter that will be used by TwoPaCo during indexing. The filter will be of size 2^{filter_size}. The default value of -1 means that the filter size will be automatically set based on the number of distinct k-mers in the input, as estimated by nthll. | ||
required: false | ||
example: -1 | ||
- name: --sparse | ||
type: boolean_true | ||
description: | | ||
Build the index using a sparse sampling of k-mer positions This will require less memory (especially during quantification), but will take longer to construct and can slow down mapping / alignment. | ||
- name: --decoys | ||
alternatives: ["-d"] | ||
type: file | ||
description: | | ||
Treat these sequences ids from the reference as the decoys that may have sequence homologous to some known transcript. For example in case of the genome, provide a list of chromosome names (one per line). | ||
required: false | ||
example: decoys.txt | ||
- name: --no_clip | ||
type: boolean_true | ||
description: | | ||
Don't clip poly-A tails from the ends of target sequences. | ||
- name: --type | ||
alternatives: ["-n"] | ||
type: string | ||
description: | | ||
The type of index to build; the only option is "puff" in this version of salmon. | ||
required: false | ||
example: puff | ||
|
||
- name: Output | ||
arguments: | ||
- name: --index | ||
alternatives: ["-i"] | ||
type: file | ||
direction: output | ||
description: | | ||
Salmon index | ||
required: true | ||
example: Salmon_index | ||
|
||
resources: | ||
- type: bash_script | ||
path: script.sh | ||
|
||
test_resources: | ||
- type: bash_script | ||
path: test.sh | ||
|
||
engines: | ||
- type: docker | ||
image: quay.io/biocontainers/salmon:1.10.2--hecfa306_0 | ||
setup: | ||
- type: docker | ||
run: | | ||
salmon index -v 2>&1 | sed 's/salmon \([0-9.]*\)/salmon: \1/' > /var/software_versions.txt | ||
runners: | ||
- type: executable | ||
- type: nextflow |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
```bash | ||
salmon index -h | ||
``` | ||
|
||
Version Info: This is the most recent version of salmon. | ||
|
||
Index | ||
========== | ||
Creates a salmon index. | ||
|
||
Command Line Options: | ||
-v [ --version ] print version string | ||
-h [ --help ] produce help message | ||
-t [ --transcripts ] arg Transcript fasta file. | ||
-k [ --kmerLen ] arg (=31) The size of k-mers that should be used for the | ||
quasi index. | ||
-i [ --index ] arg salmon index. | ||
--gencode This flag will expect the input transcript | ||
fasta to be in GENCODE format, and will split | ||
the transcript name at the first '|' character. | ||
These reduced names will be used in the output | ||
and when looking for these transcripts in a | ||
gene to transcript GTF. | ||
--features This flag will expect the input reference to be | ||
in the tsv file format, and will split the | ||
feature name at the first 'tab' character. | ||
These reduced names will be used in the output | ||
and when looking for the sequence of the | ||
features.GTF. | ||
--keepDuplicates This flag will disable the default indexing | ||
behavior of discarding sequence-identical | ||
duplicate transcripts. If this flag is passed, | ||
then duplicate transcripts that appear in the | ||
input will be retained and quantified | ||
separately. | ||
-p [ --threads ] arg (=2) Number of threads to use during indexing. | ||
--keepFixedFasta Retain the fixed fasta file (without short | ||
transcripts and duplicates, clipped, etc.) | ||
generated during indexing | ||
-f [ --filterSize ] arg (=-1) The size of the Bloom filter that will be used | ||
by TwoPaCo during indexing. The filter will be | ||
of size 2^{filterSize}. The default value of -1 | ||
means that the filter size will be | ||
automatically set based on the number of | ||
distinct k-mers in the input, as estimated by | ||
nthll. | ||
--tmpdir arg The directory location that will be used for | ||
TwoPaCo temporary files; it will be created if | ||
need be and be removed prior to indexing | ||
completion. The default value will cause a | ||
(temporary) subdirectory of the salmon index | ||
directory to be used for this purpose. | ||
--sparse Build the index using a sparse sampling of | ||
k-mer positions This will require less memory | ||
(especially during quantification), but will | ||
take longer to construct and can slow down | ||
mapping / alignment | ||
-d [ --decoys ] arg Treat these sequences ids from the reference as | ||
the decoys that may have sequence homologous to | ||
some known transcript. for example in case of | ||
the genome, provide a list of chromosome name | ||
--- one per line | ||
-n [ --no-clip ] Don't clip poly-A tails from the ends of target | ||
sequences | ||
--type arg (=puff) The type of index to build; the only option is | ||
"puff" in this version of salmon. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
#!/bin/bash | ||
|
||
set -e | ||
|
||
## VIASH START | ||
## VIASH END | ||
|
||
[[ "$par_gencode" == "false" ]] && unset par_gencode | ||
[[ "$par_features" == "false" ]] && unset par_features | ||
[[ "$par_keep_duplicates" == "false" ]] && unset par_keep_duplicates | ||
[[ "$par_keep_fixed_fasta" == "false" ]] && unset par_keep_fixed_fasta | ||
[[ "$par_sparse" == "false" ]] && unset par_sparse | ||
[[ "$par_no_clip" == "false" ]] && unset par_no_clip | ||
|
||
tmp_dir=$(mktemp -d -p "$meta_temp_dir" "${meta_functionality_name}_XXXXXX") | ||
mkdir -p "$tmp_dir/temp" | ||
|
||
if [[ -f "$par_genome" ]] && [[ ! "$par_decoys" ]]; then | ||
filename="$(basename -- $par_genome)" | ||
decoys="decoys.txt" | ||
if [ ${filename##*.} == "gz" ]; then | ||
grep '^>' <(gunzip -c $par_genome) | cut -d ' ' -f 1 > $decoys | ||
gentrome="gentrome.fa.gz" | ||
else | ||
grep '^>' $par_genome | cut -d ' ' -f 1 > $decoys | ||
gentrome="gentrome.fa" | ||
fi | ||
sed -i.bak -e 's/>//g' $decoys | ||
cat $par_transcripts $par_genome > $gentrome | ||
else | ||
gentrome=$par_transcripts | ||
decoys=$par_decoys | ||
fi | ||
|
||
salmon index \ | ||
-t "$gentrome" \ | ||
--tmpdir "$tmp_dir/temp" \ | ||
${meta_cpus:+--threads "${meta_cpus}"} \ | ||
-i "$par_index" \ | ||
${par_kmer_len:+-k "${par_kmer_len}"} \ | ||
${par_gencode:+--gencode} \ | ||
${par_features:+--features} \ | ||
${par_keep_duplicates:+--keepDuplicates} \ | ||
${par_keep_fixed_fasta:+--keepFixedFasta} \ | ||
${par_filter_size:+-f "${par_filter_size}"} \ | ||
${par_sparse:+--sparse} \ | ||
${decoys:+-d "${decoys}"} \ | ||
${par_no_clip:+--no-clip} \ | ||
${par_type:+--type "${par_type}"} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
#!/bin/bash | ||
|
||
set -e | ||
|
||
echo "> Prepare test data" | ||
|
||
dir_in="test_data" | ||
mkdir -p "$dir_in" | ||
|
||
cat > "$dir_in/transcriptome.fasta" <<'EOF' | ||
>contig1 | ||
AGCTCCAGATTCGCTCAGGCCCTTGATCATCAGTCGTCGTCGTCTTCGATTTGCCAGAGG | ||
AGTTTAGATGAAGAATGTCAAGGATGTTCCTCCCTGCCCTCCCATCTAGCCAAGAACATT | ||
TCCAAGAAGATAAAACTGTCACTGAGACAGGTCTGGATGCGCCCTAGGGGCAAATAGAGA | ||
>contig2 | ||
AGGCCTTTACCACATTGCTGCTGGCTATAGGAAGTCCCAGGTACTAGCCTGAAACAGCTG | ||
ATATTTGGGGCTGTCACAGACAATATGGCCACCCCTTGGTCTTTATGCATGAAGATTATG | ||
TAAAGGTTTTTATTAAAAAATATATATATATATATAAATGATCTAGATTATTTTCCTCTT | ||
TCTGAAGTACTTTCTTAAAAAAATAAAATTAAATGTTTATAGTATTCCCGGT | ||
EOF | ||
|
||
printf ">>> Run salmon_index" | ||
"$meta_executable" \ | ||
--transcripts $dir_in/transcriptome.fasta \ | ||
--index index \ | ||
--kmer_len 31 | ||
|
||
printf ">>> Checking whether output exists" | ||
[ ! -d "index" ] && echo "'index' does not exist!" && exit 1 | ||
[ -z "$(ls -A 'index')" ] && echo "'index' is empty!" && exit 1 | ||
[ ! -f "index/info.json" ] && echo "Salmon index does not contain 'info.json'! Not all files were generated correctly!" && exit 1 | ||
[ $(grep '"k": [0-9]*' index/info.json | cut -d':' -f 2) != '31,' ] && printf "The generated Salmon index seems to be incorrect!" && exit 1 | ||
|
||
echo "All tests succeeded!" | ||
exit 0 |
Oops, something went wrong.