Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add salmon #24

Merged
merged 32 commits into from
Mar 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
fe880e2
add salmon index and quant
sainirmayi Feb 6, 2024
39f2258
add test resources
sainirmayi Feb 7, 2024
9dd1db3
add help text
sainirmayi Feb 7, 2024
4f2bf8f
script and config
sainirmayi Feb 7, 2024
e9fc570
add test
sainirmayi Feb 7, 2024
ba62b70
update script and test
sainirmayi Feb 10, 2024
6da7254
Merge branch 'main' into add_salmon
sainirmayi Feb 10, 2024
9e4541a
add salmon quant
sainirmayi Feb 19, 2024
0844dd5
Merge branch 'main' into add_salmon
sainirmayi Feb 19, 2024
fd966c0
update CHANGELOG.md
sainirmayi Feb 19, 2024
97f317c
Merge branch 'main' into add_salmon
sainirmayi Feb 26, 2024
ec0f3d3
update to viash 0.9 format
sainirmayi Feb 26, 2024
6efc37d
remove echo ststement
sainirmayi Mar 11, 2024
f02d38c
output the main salmon output file separately
sainirmayi Mar 21, 2024
491d48e
check if output file has the right columns
sainirmayi Mar 21, 2024
3620565
Merge remote-tracking branch 'origin/main' into add_salmon
sainirmayi Mar 21, 2024
3ee96de
check if correct output files were generated
sainirmayi Mar 21, 2024
c3029cb
fix doi
sainirmayi Mar 21, 2024
51c7d74
use the default multiple separator
sainirmayi Mar 25, 2024
eba9e45
rename components
sainirmayi Mar 25, 2024
2658305
remove print statements
sainirmayi Mar 25, 2024
6bc90ce
check info.json output
sainirmayi Mar 25, 2024
9abc7b9
reduce size of fastq files and generate index in test script
sainirmayi Mar 26, 2024
7395c8a
use smaller (manual) test data
rcannood Mar 27, 2024
d9908b3
set A as the default lib_type
rcannood Mar 27, 2024
511b220
Merge branch 'add_salmon' of https://github.com/viash-hub/biobase int…
sainirmayi Mar 27, 2024
2524316
add test to check content of output index
sainirmayi Mar 27, 2024
7ca0629
add more detailed description about libType
sainirmayi Mar 27, 2024
b501457
add test data
sainirmayi Mar 27, 2024
16e9275
delete test data
sainirmayi Mar 27, 2024
8f27180
Merge branch 'add_salmon' of https://github.com/viash-hub/biobase int…
sainirmayi Mar 27, 2024
b8bbbe0
move to salmon_index and salmon_quant
sainirmayi Mar 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,11 @@

* `star/star_align_reads`: Align reads to a reference genome (PR #22).

* `gffread`: Validate, filter, convert and perform other operations on GFF files (PR #29).
* `gffread`: Validate, filter, convert and perform other operations on GFF files (PR #29).

* `salmon`:
- `salmon/salmon_index`: Create a salmon index for the transcriptome to use Salmon in the mapping-based mode (PR #24).
- `salmon/salmon_quant`: Transcript quantification from RNA-seq data (PR #24).

## MAJOR CHANGES

Expand Down
113 changes: 113 additions & 0 deletions src/salmon/salmon_index/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
name: salmon_index
namespace: salmon
description: |
Salmon is a tool for wicked-fast transcript quantification from RNA-seq data. It can either make use of pre-computed alignments (in the form of a SAM/BAM file) to the transcripts rather than the raw reads, or can be run in the mapping-based mode. This component creates a salmon index for the transcriptome to use Salmon in the mapping-based mode. It is generally recommend that you build a decoy-aware transcriptome file. This is done using the entire genome of the organism as the decoy sequence by concatenating the genome to the end of the transcriptome to be indexed and populating the decoys.txt file with the chromosome names.
keywords: ["Transcriptome", "Index"]
links:
homepage: https://salmon.readthedocs.io/en/latest/salmon.html
documentation: https://salmon.readthedocs.io/en/latest/salmon.html
repository: https://github.com/COMBINE-lab/salmon
references:
doi: 10.1038/nmeth.4197
license: GPL-3.0
requirements:
commands: [ salmon ]

argument_groups:
- name: Inputs
arguments:
- name: --genome
type: file
description: |
Genome of the organism to prepare the set of decoy sequences. Required to build decoy-aware transccriptome.
required: false
example: genome.fasta
- name: --transcripts
alternatives: ["-t"]
type: file
description: |
Transcript fasta file.
required: true
example: transcriptome.fasta
- name: --kmer_len
alternatives: ["-k"]
type: integer
description: |
The size of k-mers that should be used for the quasi index.
required: false
example: 31
- name: --gencode
type: boolean_true
description: |
This flag will expect the input transcript fasta to be in GENCODE format, and will split the transcript name at the first '|' character. These reduced names will be used in the output and when looking for these transcripts in a gene to transcript GTF.
- name: --features
type: boolean_true
description: |
This flag will expect the input reference to be in the tsv file format, and will split the feature name at the first 'tab' character. These reduced names will be used in the output and when looking for the sequence of the features.GTF.
- name: --keep_duplicates
type: boolean_true
description: |
This flag will disable the default indexing behavior of discarding sequence-identical duplicate transcripts. If this flag is passed, then duplicate transcripts that appear in the input will be retained and quantified separately.
- name: --keep_fixed_fasta
type: boolean_true
description: |
Retain the fixed fasta file (without short transcripts and duplicates, clipped, etc.) generated during indexing.
- name: --filter_size
alternatives: ["-f"]
type: integer
description: |
The size of the Bloom filter that will be used by TwoPaCo during indexing. The filter will be of size 2^{filter_size}. The default value of -1 means that the filter size will be automatically set based on the number of distinct k-mers in the input, as estimated by nthll.
required: false
example: -1
- name: --sparse
type: boolean_true
description: |
Build the index using a sparse sampling of k-mer positions This will require less memory (especially during quantification), but will take longer to construct and can slow down mapping / alignment.
- name: --decoys
alternatives: ["-d"]
type: file
description: |
Treat these sequences ids from the reference as the decoys that may have sequence homologous to some known transcript. For example in case of the genome, provide a list of chromosome names (one per line).
required: false
example: decoys.txt
- name: --no_clip
type: boolean_true
description: |
Don't clip poly-A tails from the ends of target sequences.
- name: --type
alternatives: ["-n"]
type: string
description: |
The type of index to build; the only option is "puff" in this version of salmon.
required: false
example: puff

- name: Output
arguments:
- name: --index
alternatives: ["-i"]
type: file
direction: output
description: |
Salmon index
required: true
example: Salmon_index

resources:
- type: bash_script
path: script.sh

test_resources:
- type: bash_script
path: test.sh

engines:
- type: docker
image: quay.io/biocontainers/salmon:1.10.2--hecfa306_0
setup:
- type: docker
run: |
salmon index -v 2>&1 | sed 's/salmon \([0-9.]*\)/salmon: \1/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
66 changes: 66 additions & 0 deletions src/salmon/salmon_index/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
```bash
salmon index -h
```

Version Info: This is the most recent version of salmon.

Index
==========
Creates a salmon index.

Command Line Options:
-v [ --version ] print version string
-h [ --help ] produce help message
-t [ --transcripts ] arg Transcript fasta file.
-k [ --kmerLen ] arg (=31) The size of k-mers that should be used for the
quasi index.
-i [ --index ] arg salmon index.
--gencode This flag will expect the input transcript
fasta to be in GENCODE format, and will split
the transcript name at the first '|' character.
These reduced names will be used in the output
and when looking for these transcripts in a
gene to transcript GTF.
--features This flag will expect the input reference to be
in the tsv file format, and will split the
feature name at the first 'tab' character.
These reduced names will be used in the output
and when looking for the sequence of the
features.GTF.
--keepDuplicates This flag will disable the default indexing
behavior of discarding sequence-identical
duplicate transcripts. If this flag is passed,
then duplicate transcripts that appear in the
input will be retained and quantified
separately.
-p [ --threads ] arg (=2) Number of threads to use during indexing.
--keepFixedFasta Retain the fixed fasta file (without short
transcripts and duplicates, clipped, etc.)
generated during indexing
-f [ --filterSize ] arg (=-1) The size of the Bloom filter that will be used
by TwoPaCo during indexing. The filter will be
of size 2^{filterSize}. The default value of -1
means that the filter size will be
automatically set based on the number of
distinct k-mers in the input, as estimated by
nthll.
--tmpdir arg The directory location that will be used for
TwoPaCo temporary files; it will be created if
need be and be removed prior to indexing
completion. The default value will cause a
(temporary) subdirectory of the salmon index
directory to be used for this purpose.
--sparse Build the index using a sparse sampling of
k-mer positions This will require less memory
(especially during quantification), but will
take longer to construct and can slow down
mapping / alignment
-d [ --decoys ] arg Treat these sequences ids from the reference as
the decoys that may have sequence homologous to
some known transcript. for example in case of
the genome, provide a list of chromosome name
--- one per line
-n [ --no-clip ] Don't clip poly-A tails from the ends of target
sequences
--type arg (=puff) The type of index to build; the only option is
"puff" in this version of salmon.
49 changes: 49 additions & 0 deletions src/salmon/salmon_index/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/bin/bash

set -e

## VIASH START
## VIASH END

[[ "$par_gencode" == "false" ]] && unset par_gencode
[[ "$par_features" == "false" ]] && unset par_features
[[ "$par_keep_duplicates" == "false" ]] && unset par_keep_duplicates
[[ "$par_keep_fixed_fasta" == "false" ]] && unset par_keep_fixed_fasta
[[ "$par_sparse" == "false" ]] && unset par_sparse
[[ "$par_no_clip" == "false" ]] && unset par_no_clip

tmp_dir=$(mktemp -d -p "$meta_temp_dir" "${meta_functionality_name}_XXXXXX")
mkdir -p "$tmp_dir/temp"

if [[ -f "$par_genome" ]] && [[ ! "$par_decoys" ]]; then
filename="$(basename -- $par_genome)"
decoys="decoys.txt"
if [ ${filename##*.} == "gz" ]; then
grep '^>' <(gunzip -c $par_genome) | cut -d ' ' -f 1 > $decoys
gentrome="gentrome.fa.gz"
else
grep '^>' $par_genome | cut -d ' ' -f 1 > $decoys
gentrome="gentrome.fa"
fi
sed -i.bak -e 's/>//g' $decoys
cat $par_transcripts $par_genome > $gentrome
else
gentrome=$par_transcripts
decoys=$par_decoys
fi

salmon index \
-t "$gentrome" \
--tmpdir "$tmp_dir/temp" \
${meta_cpus:+--threads "${meta_cpus}"} \
-i "$par_index" \
${par_kmer_len:+-k "${par_kmer_len}"} \
${par_gencode:+--gencode} \
${par_features:+--features} \
${par_keep_duplicates:+--keepDuplicates} \
${par_keep_fixed_fasta:+--keepFixedFasta} \
${par_filter_size:+-f "${par_filter_size}"} \
${par_sparse:+--sparse} \
${decoys:+-d "${decoys}"} \
${par_no_clip:+--no-clip} \
${par_type:+--type "${par_type}"}
35 changes: 35 additions & 0 deletions src/salmon/salmon_index/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/bin/bash

set -e

echo "> Prepare test data"

dir_in="test_data"
mkdir -p "$dir_in"

cat > "$dir_in/transcriptome.fasta" <<'EOF'
>contig1
AGCTCCAGATTCGCTCAGGCCCTTGATCATCAGTCGTCGTCGTCTTCGATTTGCCAGAGG
AGTTTAGATGAAGAATGTCAAGGATGTTCCTCCCTGCCCTCCCATCTAGCCAAGAACATT
TCCAAGAAGATAAAACTGTCACTGAGACAGGTCTGGATGCGCCCTAGGGGCAAATAGAGA
>contig2
AGGCCTTTACCACATTGCTGCTGGCTATAGGAAGTCCCAGGTACTAGCCTGAAACAGCTG
ATATTTGGGGCTGTCACAGACAATATGGCCACCCCTTGGTCTTTATGCATGAAGATTATG
TAAAGGTTTTTATTAAAAAATATATATATATATATAAATGATCTAGATTATTTTCCTCTT
TCTGAAGTACTTTCTTAAAAAAATAAAATTAAATGTTTATAGTATTCCCGGT
EOF

printf ">>> Run salmon_index"
"$meta_executable" \
--transcripts $dir_in/transcriptome.fasta \
--index index \
--kmer_len 31

printf ">>> Checking whether output exists"
[ ! -d "index" ] && echo "'index' does not exist!" && exit 1
[ -z "$(ls -A 'index')" ] && echo "'index' is empty!" && exit 1
[ ! -f "index/info.json" ] && echo "Salmon index does not contain 'info.json'! Not all files were generated correctly!" && exit 1
[ $(grep '"k": [0-9]*' index/info.json | cut -d':' -f 2) != '31,' ] && printf "The generated Salmon index seems to be incorrect!" && exit 1

echo "All tests succeeded!"
exit 0
Loading