bd_rhapsody_make_reference: Create a reference for the BD Rhapsody …

…pipeline (viash-hub#75) * `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline * add missing metadata * remove unicode * trigger * process comments * add authors * Apply suggestions from code review Co-authored-by: Dorien <[email protected]> --------- Co-authored-by: Dorien <[email protected]>
emmarousseau · Jul 17, 2024 · 7d99065 · 7d99065
1 parent f71ed87
commit 7d99065
Show file tree

Hide file tree

Showing 11 changed files with 660 additions and 0 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,11 @@
 # biobox x.x.x
 
+## NEW FEATURES
+
+* `bd_rhapsody`:
+
+  - `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline (PR #75).
+
 ## BUG FIXES
 
 * `pear`: fix component not exiting with the correct exitcode when PEAR fails.

diff --git a/src/_authors/robrecht_cannoodt.yaml b/src/_authors/robrecht_cannoodt.yaml
@@ -0,0 +1,14 @@
+name: Robrecht Cannoodt
+info:
+  links:
+    email: [email protected]
+    github: rcannood
+    orcid: "0000-0003-3641-729X"
+    linkedin: robrechtcannoodt
+  organizations:
+    - name: Data Intuitive
+      href: https://www.data-intuitive.com
+      role: Data Science Engineer
+    - name: Open Problems
+      href: https://openproblems.bio
+      role: Core Member
diff --git a/src/_authors/weiwei_schultz.yaml b/src/_authors/weiwei_schultz.yaml
@@ -0,0 +1,5 @@
+name: Weiwei Schultz
+info:
+  organizations:
+    - name: Janssen R&D US
+      role: Associate Director Data Sciences
diff --git a/src/bd_rhapsody/bd_rhapsody_make_reference/config.vsh.yaml b/src/bd_rhapsody/bd_rhapsody_make_reference/config.vsh.yaml
@@ -0,0 +1,143 @@
+name: bd_rhapsody_make_reference
+namespace: bd_rhapsody
+description: |
+  The Reference Files Generator creates an archive containing Genome Index
+  and Transcriptome annotation files needed for the BD Rhapsody Sequencing
+  Analysis Pipeline. The app takes as input one or more FASTA and GTF files
+  and produces a compressed archive in the form of a tar.gz file. The 
+  archive contains:
+  
+  - STAR index
+  - Filtered GTF file
+keywords: [genome, reference, index, align]
+links:
+  repository: https://bitbucket.org/CRSwDev/cwl/src/master/v2.2.1/Extra_Utilities/
+  documentation: https://bd-rhapsody-bioinfo-docs.genomics.bd.com/resources/extra_utilities.html#make-rhapsody-reference
+license: Unknown
+authors:
+  - __merge__: /src/_authors/robrecht_cannoodt.yaml
+    roles: [ author, maintainer ]
+  - __merge__: /src/_authors/weiwei_schultz.yaml
+    roles: [ contributor ]
+
+argument_groups:
+  - name: Inputs
+    arguments:
+      - type: file
+        name: --genome_fasta
+        required: true
+        description: Reference genome file in FASTA or FASTA.GZ format. The BD Rhapsody Sequencing Analysis Pipeline uses GRCh38 for Human and GRCm39 for Mouse.
+        example: genome_sequence.fa.gz
+        multiple: true
+        info:
+          config_key: Genome_fasta
+      - type: file
+        name: --gtf
+        required: true
+        description: |
+          File path to the transcript annotation files in GTF or GTF.GZ format. The Sequence Analysis Pipeline requires the 'gene_name' or 
+          'gene_id' attribute to be set on each gene and exon feature. Gene and exon feature lines must have the same attribute, and exons
+          must have a corresponding gene with the same value. For TCR/BCR assays, the TCR or BCR gene segments must have the 'gene_type' or
+          'gene_biotype' attribute set, and the value should begin with 'TR' or 'IG', respectively.
+        example: transcriptome_annotation.gtf.gz
+        multiple: true
+        info:
+          config_key: Gtf
+      - type: file
+        name: --extra_sequences
+        description: |
+          File path to additional sequences in FASTA format to use when building the STAR index. (e.g. transgenes or CRISPR guide barcodes).
+          GTF lines for these sequences will be automatically generated and combined with the main GTF.
+        required: false
+        multiple: true
+        info:
+          config_key: Extra_sequences
+  - name: Outputs
+    arguments:
+      - type: file
+        name: --reference_archive
+        direction: output
+        required: true
+        description: |
+          A Compressed archive containing the Reference Genome Index and annotation GTF files. This archive is meant to be used as an
+          input in the BD Rhapsody Sequencing Analysis Pipeline.
+        example: star_index.tar.gz
+  - name: Arguments
+    arguments:
+      - type: string
+        name: --mitochondrial_contigs
+        description: |
+          Names of the Mitochondrial contigs in the provided Reference Genome. Fragments originating from contigs other than these are
+          identified as 'nuclear fragments' in the ATACseq analysis pipeline.
+        required: false
+        multiple: true
+        default: [chrM, chrMT, M, MT]
+        info:
+          config_key: Mitochondrial_contigs
+      - type: boolean_true
+        name: --filtering_off
+        description: |
+          By default the input Transcript Annotation files are filtered based on the gene_type/gene_biotype attribute. Only features 
+          having the following attribute values are kept:
+
+            - protein_coding
+            - lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97)
+            - IG_LV_gene
+            - IG_V_gene
+            - IG_V_pseudogene
+            - IG_D_gene
+            - IG_J_gene
+            - IG_J_pseudogene
+            - IG_C_gene
+            - IG_C_pseudogene
+            - TR_V_gene
+            - TR_V_pseudogene
+            - TR_D_gene
+            - TR_J_gene
+            - TR_J_pseudogene
+            - TR_C_gene
+
+            If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True.
+        info:
+          config_key: Filtering_off
+      - type: boolean_true
+        name: --wta_only_index
+        description: Build a WTA only index, otherwise builds a WTA + ATAC index.
+        info:
+          config_key: Wta_Only
+      - type: string
+        name: --extra_star_params
+        description: Additional parameters to pass to STAR when building the genome index. Specify exactly like how you would on the command line.
+        example: --limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11
+        required: false
+        info:
+          config_key: Extra_STAR_params
+
+resources:
+  - type: python_script
+    path: script.py
+  - path: make_rhap_reference_2.2.1_nodocker.cwl
+
+test_resources:
+  - type: bash_script
+    path: test.sh
+  - path: test_data
+
+requirements:
+  commands: [ "cwl-runner" ]
+
+engines:
+  - type: docker
+    image: bdgenomics/rhapsody:2.2.1
+    setup:
+      - type: apt
+        packages: [procps]
+      - type: python
+        packages: [cwlref-runner, cwl-runner]
+      - type: docker
+        run: |
+          echo "bdgenomics/rhapsody: 2.2.1" > /var/software_versions.txt
+
+runners:
+  - type: executable
+  - type: nextflow
diff --git a/src/bd_rhapsody/bd_rhapsody_make_reference/help.txt b/src/bd_rhapsody/bd_rhapsody_make_reference/help.txt
@@ -0,0 +1,66 @@
+```bash
+cwl-runner src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl --help
+```
+
+usage: src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl
+       [-h] [--Archive_prefix ARCHIVE_PREFIX]
+       [--Extra_STAR_params EXTRA_STAR_PARAMS]
+       [--Extra_sequences EXTRA_SEQUENCES] [--Filtering_off] --Genome_fasta
+       GENOME_FASTA --Gtf GTF [--Maximum_threads MAXIMUM_THREADS]
+       [--Mitochondrial_Contigs MITOCHONDRIAL_CONTIGS] [--WTA_Only]
+       [job_order]
+
+The Reference Files Generator creates an archive containing Genome Index and
+Transcriptome annotation files needed for the BD Rhapsodyâ„¢ Sequencing
+Analysis Pipeline. The app takes as input one or more FASTA and GTF files and
+produces a compressed archive in the form of a tar.gz file. The archive
+contains:\n - STAR index\n - Filtered GTF file
+
+positional arguments:
+  job_order             Job input json file
+
+options:
+  -h, --help            show this help message and exit
+  --Archive_prefix ARCHIVE_PREFIX
+                        A prefix for naming the compressed archive file
+                        containing the Reference genome index and annotation
+                        files. The default value is constructed based on the
+                        input Reference files.
+  --Extra_STAR_params EXTRA_STAR_PARAMS
+                        Additional parameters to pass to STAR when building
+                        the genome index. Specify exactly like how you would
+                        on the command line. Example: --limitGenomeGenerateRAM
+                        48000 --genomeSAindexNbases 11
+  --Extra_sequences EXTRA_SEQUENCES
+                        Additional sequences in FASTA format to use when
+                        building the STAR index. (E.g. phiX genome)
+  --Filtering_off       By default the input Transcript Annotation files are
+                        filtered based on the gene_type/gene_biotype
+                        attribute. Only features having the following
+                        attribute values are are kept: - protein_coding -
+                        lncRNA (lincRNA and antisense for Gencode <
+                        v31/M22/Ensembl97) - IG_LV_gene - IG_V_gene -
+                        IG_V_pseudogene - IG_D_gene - IG_J_gene -
+                        IG_J_pseudogene - IG_C_gene - IG_C_pseudogene -
+                        TR_V_gene - TR_V_pseudogene - TR_D_gene - TR_J_gene -
+                        TR_J_pseudogene - TR_C_gene If you have already pre-
+                        filtered the input Annotation files and/or wish to
+                        turn-off the filtering, please set this option to
+                        True.
+  --Genome_fasta GENOME_FASTA
+                        Reference genome file in FASTA format. The BD
+                        Rhapsodyâ„¢ Sequencing Analysis Pipeline uses GRCh38
+                        for Human and GRCm39 for Mouse.
+  --Gtf GTF             Transcript annotation files in GTF format. The BD
+                        Rhapsodyâ„¢ Sequencing Analysis Pipeline uses Gencode
+                        v42 for Human and M31 for Mouse.
+  --Maximum_threads MAXIMUM_THREADS
+                        The maximum number of threads to use in the pipeline.
+                        By default, all available cores are used.
+  --Mitochondrial_Contigs MITOCHONDRIAL_CONTIGS
+                        Names of the Mitochondrial contigs in the provided
+                        Reference Genome. Fragments originating from contigs
+                        other than these are identified as 'nuclear fragments'
+                        in the ATACseq analysis pipeline.
+  --WTA_Only            Build a WTA only index, otherwise builds a WTA + ATAC
+                        index.
diff --git a/src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl b/src/bd_rhapsody/bd_rhapsody_make_reference/make_rhap_reference_2.2.1_nodocker.cwl
@@ -0,0 +1,115 @@
+requirements:
+  InlineJavascriptRequirement: {}
+class: CommandLineTool
+label: Reference Files Generator for BD Rhapsodyâ„¢ Sequencing Analysis Pipeline
+cwlVersion: v1.2
+doc: >- 
+    The Reference Files Generator creates an archive containing Genome Index and Transcriptome annotation files needed for the BD Rhapsodyâ„¢ Sequencing Analysis Pipeline. The app takes as input one or more FASTA and GTF files and produces a compressed archive in the form of a tar.gz file. The archive contains:\n  - STAR index\n  - Filtered GTF file
+
+
+baseCommand: run_reference_generator.sh 
+inputs: 
+    Genome_fasta:
+        type: File[]
+        label: Reference Genome
+        doc: |-
+            Reference genome file in FASTA format. The BD Rhapsodyâ„¢ Sequencing Analysis Pipeline uses GRCh38 for Human and GRCm39 for Mouse.
+        inputBinding:
+            prefix: --reference-genome
+            shellQuote: false
+    Gtf:
+        type: File[]
+        label: Transcript Annotations
+        doc: |-
+            Transcript annotation files in GTF format. The BD Rhapsodyâ„¢ Sequencing Analysis Pipeline uses Gencode v42 for Human and M31 for Mouse.
+        inputBinding:
+            prefix: --gtf
+            shellQuote: false
+    Extra_sequences:
+        type: File[]?
+        label: Extra Sequences
+        doc: |-
+            Additional sequences in FASTA format to use when building the STAR index. (E.g. phiX genome)
+        inputBinding:
+            prefix: --extra-sequences
+            shellQuote: false
+    Mitochondrial_Contigs:
+        type: string[]?
+        default: ["chrM", "chrMT", "M", "MT"]
+        label: Mitochondrial Contig Names
+        doc: |-
+            Names of the Mitochondrial contigs in the provided Reference Genome. Fragments originating from contigs other than these are identified as 'nuclear fragments' in the ATACseq analysis pipeline.
+        inputBinding:
+            prefix: --mitochondrial-contigs
+            shellQuote: false
+    Filtering_off:
+        type: boolean?
+        label: Turn off filtering
+        doc: |-
+            By default the input Transcript Annotation files are filtered based on the gene_type/gene_biotype attribute. Only features having the following attribute values are are kept:
+            - protein_coding
+            - lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97)
+            - IG_LV_gene
+            - IG_V_gene
+            - IG_V_pseudogene
+            - IG_D_gene
+            - IG_J_gene
+            - IG_J_pseudogene
+            - IG_C_gene
+            - IG_C_pseudogene
+            - TR_V_gene
+            - TR_V_pseudogene
+            - TR_D_gene
+            - TR_J_gene
+            - TR_J_pseudogene
+            - TR_C_gene
+            If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True.
+        inputBinding: 
+            prefix: --filtering-off
+            shellQuote: false
+    WTA_Only:
+        type: boolean?
+        label: WTA only index
+        doc: Build a WTA only index, otherwise builds a WTA + ATAC index.
+        inputBinding:
+            prefix: --wta-only-index
+            shellQuote: false
+    Archive_prefix:
+        type: string?
+        label: Archive Prefix
+        doc: |-
+            A prefix for naming the compressed archive file containing the Reference genome index and annotation files. The default value is constructed based on the input Reference files.
+        inputBinding:
+            prefix: --archive-prefix
+            shellQuote: false
+    Extra_STAR_params:
+        type: string?
+        label: Extra STAR Params
+        doc: |-
+            Additional parameters to pass to STAR when building the genome index. Specify exactly like how you would on the command line.
+            Example:
+              --limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11
+        inputBinding:
+            prefix: --extra-star-params 
+            shellQuote: true
+
+    Maximum_threads:
+        type: int?
+        label: Maximum Number of Threads
+        doc: |-
+            The maximum number of threads to use in the pipeline. By default, all available cores are used.
+        inputBinding:
+            prefix: --maximum-threads
+            shellQuote: false
+
+outputs:
+
+    Archive:
+        type: File
+        doc: |- 
+            A Compressed archive containing the Reference Genome Index and annotation GTF files. This archive is meant to be used as an input in the BD Rhapsodyâ„¢ Sequencing Analysis Pipeline.
+        id: Reference_Archive
+        label: Reference Files Archive
+        outputBinding:
+            glob: '*.tar.gz'
+