Skip to content

Commit

Permalink
bd_rhapsody_make_reference: Create a reference for the BD Rhapsody …
Browse files Browse the repository at this point in the history
…pipeline (#75)

* `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline

* add missing metadata

* remove unicode

* trigger

* process comments

* add authors

* Apply suggestions from code review

Co-authored-by: Dorien <[email protected]>

---------

Co-authored-by: Dorien <[email protected]>
  • Loading branch information
rcannood and dorien-er authored Jul 17, 2024
1 parent f71ed87 commit 7d99065
Show file tree
Hide file tree
Showing 11 changed files with 660 additions and 0 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# biobox x.x.x

## NEW FEATURES

* `bd_rhapsody`:

- `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline (PR #75).

## BUG FIXES

* `pear`: fix component not exiting with the correct exitcode when PEAR fails.
Expand Down
14 changes: 14 additions & 0 deletions src/_authors/robrecht_cannoodt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: Robrecht Cannoodt
info:
links:
email: [email protected]
github: rcannood
orcid: "0000-0003-3641-729X"
linkedin: robrechtcannoodt
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Science Engineer
- name: Open Problems
href: https://openproblems.bio
role: Core Member
5 changes: 5 additions & 0 deletions src/_authors/weiwei_schultz.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
name: Weiwei Schultz
info:
organizations:
- name: Janssen R&D US
role: Associate Director Data Sciences
143 changes: 143 additions & 0 deletions src/bd_rhapsody/bd_rhapsody_make_reference/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
name: bd_rhapsody_make_reference
namespace: bd_rhapsody
description: |
The Reference Files Generator creates an archive containing Genome Index
and Transcriptome annotation files needed for the BD Rhapsody Sequencing
Analysis Pipeline. The app takes as input one or more FASTA and GTF files
and produces a compressed archive in the form of a tar.gz file. The
archive contains:
- STAR index
- Filtered GTF file
keywords: [genome, reference, index, align]
links:
repository: https://bitbucket.org/CRSwDev/cwl/src/master/v2.2.1/Extra_Utilities/
documentation: https://bd-rhapsody-bioinfo-docs.genomics.bd.com/resources/extra_utilities.html#make-rhapsody-reference
license: Unknown
authors:
- __merge__: /src/_authors/robrecht_cannoodt.yaml
roles: [ author, maintainer ]
- __merge__: /src/_authors/weiwei_schultz.yaml
roles: [ contributor ]

argument_groups:
- name: Inputs
arguments:
- type: file
name: --genome_fasta
required: true
description: Reference genome file in FASTA or FASTA.GZ format. The BD Rhapsody Sequencing Analysis Pipeline uses GRCh38 for Human and GRCm39 for Mouse.
example: genome_sequence.fa.gz
multiple: true
info:
config_key: Genome_fasta
- type: file
name: --gtf
required: true
description: |
File path to the transcript annotation files in GTF or GTF.GZ format. The Sequence Analysis Pipeline requires the 'gene_name' or
'gene_id' attribute to be set on each gene and exon feature. Gene and exon feature lines must have the same attribute, and exons
must have a corresponding gene with the same value. For TCR/BCR assays, the TCR or BCR gene segments must have the 'gene_type' or
'gene_biotype' attribute set, and the value should begin with 'TR' or 'IG', respectively.
example: transcriptome_annotation.gtf.gz
multiple: true
info:
config_key: Gtf
- type: file
name: --extra_sequences
description: |
File path to additional sequences in FASTA format to use when building the STAR index. (e.g. transgenes or CRISPR guide barcodes).
GTF lines for these sequences will be automatically generated and combined with the main GTF.
required: false
multiple: true
info:
config_key: Extra_sequences
- name: Outputs
arguments:
- type: file
name: --reference_archive
direction: output
required: true
description: |
A Compressed archive containing the Reference Genome Index and annotation GTF files. This archive is meant to be used as an
input in the BD Rhapsody Sequencing Analysis Pipeline.
example: star_index.tar.gz
- name: Arguments
arguments:
- type: string
name: --mitochondrial_contigs
description: |
Names of the Mitochondrial contigs in the provided Reference Genome. Fragments originating from contigs other than these are
identified as 'nuclear fragments' in the ATACseq analysis pipeline.
required: false
multiple: true
default: [chrM, chrMT, M, MT]
info:
config_key: Mitochondrial_contigs
- type: boolean_true
name: --filtering_off
description: |
By default the input Transcript Annotation files are filtered based on the gene_type/gene_biotype attribute. Only features
having the following attribute values are kept:
- protein_coding
- lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97)
- IG_LV_gene
- IG_V_gene
- IG_V_pseudogene
- IG_D_gene
- IG_J_gene
- IG_J_pseudogene
- IG_C_gene
- IG_C_pseudogene
- TR_V_gene
- TR_V_pseudogene
- TR_D_gene
- TR_J_gene
- TR_J_pseudogene
- TR_C_gene
If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True.
info:
config_key: Filtering_off
- type: boolean_true
name: --wta_only_index
description: Build a WTA only index, otherwise builds a WTA + ATAC index.
info:
config_key: Wta_Only
- type: string
name: --extra_star_params
description: Additional parameters to pass to STAR when building the genome index. Specify exactly like how you would on the command line.
example: --limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11
required: false
info:
config_key: Extra_STAR_params

resources:
- type: python_script
path: script.py
- path: make_rhap_reference_2.2.1_nodocker.cwl

test_resources:
- type: bash_script
path: test.sh
- path: test_data

requirements:
commands: [ "cwl-runner" ]

engines:
- type: docker
image: bdgenomics/rhapsody:2.2.1
setup:
- type: apt
packages: [procps]
- type: python
packages: [cwlref-runner, cwl-runner]
- type: docker
run: |
echo "bdgenomics/rhapsody: 2.2.1" > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
66 changes: 66 additions & 0 deletions src/bd_rhapsody/bd_rhapsody_make_reference/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
```bash
cwl-runner src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl --help
```

usage: src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl
[-h] [--Archive_prefix ARCHIVE_PREFIX]
[--Extra_STAR_params EXTRA_STAR_PARAMS]
[--Extra_sequences EXTRA_SEQUENCES] [--Filtering_off] --Genome_fasta
GENOME_FASTA --Gtf GTF [--Maximum_threads MAXIMUM_THREADS]
[--Mitochondrial_Contigs MITOCHONDRIAL_CONTIGS] [--WTA_Only]
[job_order]

The Reference Files Generator creates an archive containing Genome Index and
Transcriptome annotation files needed for the BD Rhapsodyâ„¢ Sequencing
Analysis Pipeline. The app takes as input one or more FASTA and GTF files and
produces a compressed archive in the form of a tar.gz file. The archive
contains:\n - STAR index\n - Filtered GTF file

positional arguments:
job_order Job input json file

options:
-h, --help show this help message and exit
--Archive_prefix ARCHIVE_PREFIX
A prefix for naming the compressed archive file
containing the Reference genome index and annotation
files. The default value is constructed based on the
input Reference files.
--Extra_STAR_params EXTRA_STAR_PARAMS
Additional parameters to pass to STAR when building
the genome index. Specify exactly like how you would
on the command line. Example: --limitGenomeGenerateRAM
48000 --genomeSAindexNbases 11
--Extra_sequences EXTRA_SEQUENCES
Additional sequences in FASTA format to use when
building the STAR index. (E.g. phiX genome)
--Filtering_off By default the input Transcript Annotation files are
filtered based on the gene_type/gene_biotype
attribute. Only features having the following
attribute values are are kept: - protein_coding -
lncRNA (lincRNA and antisense for Gencode <
v31/M22/Ensembl97) - IG_LV_gene - IG_V_gene -
IG_V_pseudogene - IG_D_gene - IG_J_gene -
IG_J_pseudogene - IG_C_gene - IG_C_pseudogene -
TR_V_gene - TR_V_pseudogene - TR_D_gene - TR_J_gene -
TR_J_pseudogene - TR_C_gene If you have already pre-
filtered the input Annotation files and/or wish to
turn-off the filtering, please set this option to
True.
--Genome_fasta GENOME_FASTA
Reference genome file in FASTA format. The BD
Rhapsodyâ„¢ Sequencing Analysis Pipeline uses GRCh38
for Human and GRCm39 for Mouse.
--Gtf GTF Transcript annotation files in GTF format. The BD
Rhapsodyâ„¢ Sequencing Analysis Pipeline uses Gencode
v42 for Human and M31 for Mouse.
--Maximum_threads MAXIMUM_THREADS
The maximum number of threads to use in the pipeline.
By default, all available cores are used.
--Mitochondrial_Contigs MITOCHONDRIAL_CONTIGS
Names of the Mitochondrial contigs in the provided
Reference Genome. Fragments originating from contigs
other than these are identified as 'nuclear fragments'
in the ATACseq analysis pipeline.
--WTA_Only Build a WTA only index, otherwise builds a WTA + ATAC
index.
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
requirements:
InlineJavascriptRequirement: {}
class: CommandLineTool
label: Reference Files Generator for BD Rhapsodyâ„¢ Sequencing Analysis Pipeline
cwlVersion: v1.2
doc: >-
The Reference Files Generator creates an archive containing Genome Index and Transcriptome annotation files needed for the BD Rhapsodyâ„¢ Sequencing Analysis Pipeline. The app takes as input one or more FASTA and GTF files and produces a compressed archive in the form of a tar.gz file. The archive contains:\n - STAR index\n - Filtered GTF file


baseCommand: run_reference_generator.sh
inputs:
Genome_fasta:
type: File[]
label: Reference Genome
doc: |-
Reference genome file in FASTA format. The BD Rhapsodyâ„¢ Sequencing Analysis Pipeline uses GRCh38 for Human and GRCm39 for Mouse.
inputBinding:
prefix: --reference-genome
shellQuote: false
Gtf:
type: File[]
label: Transcript Annotations
doc: |-
Transcript annotation files in GTF format. The BD Rhapsodyâ„¢ Sequencing Analysis Pipeline uses Gencode v42 for Human and M31 for Mouse.
inputBinding:
prefix: --gtf
shellQuote: false
Extra_sequences:
type: File[]?
label: Extra Sequences
doc: |-
Additional sequences in FASTA format to use when building the STAR index. (E.g. phiX genome)
inputBinding:
prefix: --extra-sequences
shellQuote: false
Mitochondrial_Contigs:
type: string[]?
default: ["chrM", "chrMT", "M", "MT"]
label: Mitochondrial Contig Names
doc: |-
Names of the Mitochondrial contigs in the provided Reference Genome. Fragments originating from contigs other than these are identified as 'nuclear fragments' in the ATACseq analysis pipeline.
inputBinding:
prefix: --mitochondrial-contigs
shellQuote: false
Filtering_off:
type: boolean?
label: Turn off filtering
doc: |-
By default the input Transcript Annotation files are filtered based on the gene_type/gene_biotype attribute. Only features having the following attribute values are are kept:
- protein_coding
- lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97)
- IG_LV_gene
- IG_V_gene
- IG_V_pseudogene
- IG_D_gene
- IG_J_gene
- IG_J_pseudogene
- IG_C_gene
- IG_C_pseudogene
- TR_V_gene
- TR_V_pseudogene
- TR_D_gene
- TR_J_gene
- TR_J_pseudogene
- TR_C_gene
If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True.
inputBinding:
prefix: --filtering-off
shellQuote: false
WTA_Only:
type: boolean?
label: WTA only index
doc: Build a WTA only index, otherwise builds a WTA + ATAC index.
inputBinding:
prefix: --wta-only-index
shellQuote: false
Archive_prefix:
type: string?
label: Archive Prefix
doc: |-
A prefix for naming the compressed archive file containing the Reference genome index and annotation files. The default value is constructed based on the input Reference files.
inputBinding:
prefix: --archive-prefix
shellQuote: false
Extra_STAR_params:
type: string?
label: Extra STAR Params
doc: |-
Additional parameters to pass to STAR when building the genome index. Specify exactly like how you would on the command line.
Example:
--limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11
inputBinding:
prefix: --extra-star-params
shellQuote: true

Maximum_threads:
type: int?
label: Maximum Number of Threads
doc: |-
The maximum number of threads to use in the pipeline. By default, all available cores are used.
inputBinding:
prefix: --maximum-threads
shellQuote: false

outputs:

Archive:
type: File
doc: |-
A Compressed archive containing the Reference Genome Index and annotation GTF files. This archive is meant to be used as an input in the BD Rhapsodyâ„¢ Sequencing Analysis Pipeline.
id: Reference_Archive
label: Reference Files Archive
outputBinding:
glob: '*.tar.gz'

Loading

0 comments on commit 7d99065

Please sign in to comment.