Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kallisto index #149

Merged
merged 17 commits into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,11 @@
* `sortmerna`: Local sequence alignment tool for mapping, clustering, and filtering rRNA from metatranscriptomic
data. (PR #146)

* `fq_subsample`: Sample a subset of records from single or paired FASTQ files (PR #147).
* `fq_subsample`: Sample a subset of records from single or paired FASTQ files (PR #147).

* `kallisto`:
- `kallisto_index`: Create a kallisto index (PR #149).


## MINOR CHANGES

Expand Down
89 changes: 89 additions & 0 deletions src/kallisto/kallisto_index/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
name: kallisto_index
namespace: kallisto
description: |
Build a Kallisto index for the transcriptome to use Kallisto in the mapping-based mode.
keywords: [kallisto, index]
links:
homepage: https://pachterlab.github.io/kallisto/about
documentation: https://pachterlab.github.io/kallisto/manual
repository: https://github.com/pachterlab/kallisto
issue_tracker: https://github.com/pachterlab/kallisto/issues
references:
doi: https://doi.org/10.1038/nbt.3519
license: BSD 2-Clause License

argument_groups:
- name: "Input"
arguments:
- name: "--input"
type: file
description: |
Path to a FASTA-file containing the transcriptome sequences, either in plain text or
compressed (.gz) format.
required: true
- name: "--d_list"
type: file
description: |
Path to a FASTA-file containing sequences to mask from quantification.

- name: "Output"
arguments:
- name: "--kallisto_index"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why --kallisto-index instead of --index?

type: file
direction: output
must_exist: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I think I meant to remove that.

example: Kallisto_index

- name: "Options"
arguments:
- name: "--kmer_size"
type: integer
description: |
Kmer length passed to indexing step of pseudoaligners (default: '31').
example: 31
- name: "--make_unique"
type: boolean_true
description: |
Replace repeated target names with unique names.
- name: "--aa"
type: boolean_true
description: |
Generate index from a FASTA-file containing amino acid sequences.
- name: "--distiguish"
type: boolean_true
description: |
Generate index where sequences are distinguished by the sequence names.
- name: "--min_size"
alternatives: ["-m"]
type: integer
description: |
Length of minimizers (default: automatically chosen).
- name: "--ec_max_size"
alternatives: ["-e"]
type: integer
description: |
Maximum number of targets in an equivalence class (default: no maximum).

resources:
- type: bash_script
path: script.sh

test_resources:
- type: bash_script
path: test.sh
- path: test_data

engines:
- type: docker
image: ubuntu:22.04
setup:
- type: docker
run: |
apt-get update && \
apt-get install -y --no-install-recommends wget && \
wget --no-check-certificate https://github.com/pachterlab/kallisto/releases/download/v0.50.1/kallisto_linux-v0.50.1.tar.gz && \
tar -xzf kallisto_linux-v0.50.1.tar.gz && \
mv kallisto/kallisto /usr/local/bin/
runners:
- type: executable
- type: nextflow
21 changes: 21 additions & 0 deletions src/kallisto/kallisto_index/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
```
kallisto index
```
kallisto 0.50.1
Builds a kallisto index

Usage: kallisto index [arguments] FASTA-files

Required argument:
-i, --index=STRING Filename for the kallisto index to be constructed

Optional argument:
-k, --kmer-size=INT k-mer (odd) length (default: 31, max value: 31)
tverbeiren marked this conversation as resolved.
Show resolved Hide resolved
-t, --threads=INT Number of threads to use (default: 1)
-d, --d-list=STRING Path to a FASTA-file containing sequences to mask from quantification
--make-unique Replace repeated target names with unique names
--aa Generate index from a FASTA-file containing amino acid sequences
--distinguish Generate index where sequences are distinguished by the sequence name
-T, --tmp=STRING Temporary directory (default: tmp)
tverbeiren marked this conversation as resolved.
Show resolved Hide resolved
-m, --min-size=INT Length of minimizers (default: automatically chosen)
-e, --ec-max-size=INT Maximum number of targets in an equivalence class (default: no maximum)
25 changes: 25 additions & 0 deletions src/kallisto/kallisto_index/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash

## VIASH START
## VIASH END

set -eo pipefail

if [ -n "$par_kmer_size" ]; then
if [[ "$par_kmer_size" -lt 1 || "$par_kmer_size" -gt 31 || $(( par_kmer_size % 2 )) -eq 0 ]]; then
echo "Error: Kmer size must be an odd number between 1 and 31."
exit 1
fi
fi

kallisto index \
-i "${par_kallisto_index}" \
${par_kmer_size:+--kmer-size $par_kmer_size} \
${par_make_unique:+--make-unique} \
${par_aa:+--aa} \
${par_distinguish:+--distinguish} \
${par_min_size:+--min-size $par_min_size} \
${par_ec_max_size:+--ec-max-size $par_ec_max_size} \
${par_d_list:+--d-list "${par_d_list}"} \
"${par_input}"

35 changes: 35 additions & 0 deletions src/kallisto/kallisto_index/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/bin/bash

echo ">>>Test1: Testing $meta_functionality_name with non-default k-mer size"

"$meta_executable" \
--input "$meta_resources_dir/test_data/transcriptome.fasta" \
--kallisto_index Kallisto \
--kmer_size 21


echo ">>> Checking whether output exists and is correct"
[ ! -f "Kallisto" ] && echo "Kallisto index does not exist!" && exit 1
[ ! -s "Kallisto" ] && echo "Kallisto index is empty!" && exit 1

kallisto inspect Kallisto 2> test.txt
grep "number of k-mers: 2,978" test.txt || { echo "The content of the index seems to be incorrect." && exit 1; }

################################################################################

echo ">>>Test2: Testing $meta_functionality_name with d_list argument"

"$meta_executable" \
--input "$meta_resources_dir/test_data/transcriptome.fasta" \
--kallisto_index Kallisto \
--d_list "$meta_resources_dir/test_data/d_list.fasta"

echo ">>> Checking whether output exists and is correct"
[ ! -f "Kallisto" ] && echo "Kallisto index does not exist!" && exit 1
[ ! -s "Kallisto" ] && echo "Kallisto index is empty!" && exit 1

kallisto inspect Kallisto 2> test.txt
grep "number of k-mers: 3,056" test.txt || { echo "The content of the index seems to be incorrect." && exit 1; }

echo "All tests succeeded!"
exit 0
5 changes: 5 additions & 0 deletions src/kallisto/kallisto_index/test_data/d_list.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
>YAL067W-A CDS=1-228
ATGCCAATTATAGGGGTGCCGAGGTGCCTTATAAAACCCTTTTCTGTGCCTGTGACATTTCCTTTTTCGG
TCAAAAAGAATATCCGAATTTTAGATTTGGACCCTCGTACAGAAGCTTATTGTCTAAGCCTGAATTCAGT
CTGCTTTAAACGGCTTCCGCGGAGGAAATATTTCCATCTCTTGAATTCGTACAACATTAAACGTGTGTTG
GGAGTCGTATACTGTTAG
23 changes: 23 additions & 0 deletions src/kallisto/kallisto_index/test_data/transcriptome.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
>YAL069W CDS=1-315
ATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTC
ACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTC
AGATTCCACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACG
GCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATAT
CTATATCTCATTCGGCGGTCCCAAATATTGTATAA
>YAL068W-A CDS=1-255
ATGCACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATT
TTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAACTGCCCTTAATACATACGTTATACCACT
TTTGCACCATATACTTACCACTCCATTTATATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAA
TCACCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAA
>YAL068C CDS=1-363
ATGGTCAAATTAACTTCAATCGCCGCTGGTGTCGCTGCCATCGCTGCTACTGCTTCTGCAACCACCACTC
TAGCTCAATCTGACGAAAGAGTCAACTTGGTGGAATTGGGTGTCTACGTCTCTGATATCAGAGCTCACTT
AGCCCAATACTACATGTTCCAAGCCGCCCACCCAACTGAAACCTACCCAGTCGAAGTTGCTGAAGCCGTT
TTCAACTACGGTGACTTCACCACCATGTTGACCGGTATTGCTCCAGACCAAGTGACCAGAATGATCACCG
GTGTTCCATGGTACTCCAGCAGATTAAAGCCAGCCATCTCCAGTGCTCTATCCAAGGACGGTATCTACAC
TATCGCAAACTAG
>YAL067W-A CDS=1-228
ATGCCAATTATAGGGGTGCCGAGGTGCCTTATAAAACCCTTTTCTGTGCCTGTGACATTTCCTTTTTCGG
TCAAAAAGAATATCCGAATTTTAGATTTGGACCCTCGTACAGAAGCTTATTGTCTAAGCCTGAATTCAGT
CTGCTTTAAACGGCTTCCGCGGAGGAAATATTTCCATCTCTTGAATTCGTACAACATTAAACGTGTGTTG
GGAGTCGTATACTGTTAG