improve annotation in pipeline #125

slsevilla · 2022-07-26T20:45:01Z

Currently annotation calling is one of the largest bottlenecks of the pipeline. It is currently split into several rules and accompanying scripts.

Rules

peak_Transcripts
peak_ExonIntron
peak_RMSK
peak_Transcripts
peak_junctions
peak_process
project_annotations

Scripts

The general workflow is to run each annotation type separately before merging into one RMD file. This requires a significant amount of time, and is generating individual jobs per sample per rule, which also utilizes more Biowulf resources than maybe necessary.

Goals for the re-write

Speed up performance
Reduce the number of input/output files required for execution
Transfer all file creation from R files to snakemake
Reduce the number of rules required without sacrificing speed considerably

slsevilla · 2022-07-29T15:52:44Z

Rule ExonIntron

Project info

Three projects were created from previous runs to complete benchmarking analysis

project1: mESC_clip_4_v2.0
project2: 8-09-21-HaCaT_fCLIP_v2.0
project3: mES_fclip_1_YL_011622_v2.0

File info

all projects are set-up with the following structure

├── proj_number
│   └── exp_output
│   └── input

Required inputs for one sample

└── input
    └── 04_annotation
        ├── 01_project
        │   ├── 7SKRNA_Repeatmasker.bed
        │   ├── annotations.txt
        │   ├── DNA_Repeatmasker.bed
        │   ├── lincRNA_Gencode.bed
        │   ├── LINE\ SINE_Repeatmasker.bed
        │   ├── lncRNA_Gencode.bed
        │   ├── lncRNA_Gencode.txt
        │   ├── Low_complexity_Repeatmasker.bed
        │   ├── LTR_Repeatmasker.bed
        │   ├── miRNA_Gencode.bed
        │   ├── ncRNA_annotations.txt
        │   ├── Other_Repeatmasker.bed
        │   ├── ref_gencode.txt
        │   ├── rRNA_Custom.bed
        │   ├── rRNA_Gencode.bed
        │   ├── rRNA_Repeatmasker.bed
        │   ├── Satellite_Repeatmasker.bed
        │   ├── scRNA_Repeatmasker.bed
        │   ├── Simple_repeat_Repeatmasker.bed
        │   ├── sncRNA_Custom.bed
        │   ├── snoRNA_Gencode.bed
        │   ├── snRNA_Gencode.bed
        │   ├── srpRNA_Repeatmasker.bed
        │   ├── tRNA_Custom.bed
        │   ├── Unknown_Repeatmasker.bed
        │   └── yRNA_Repeatmasker.bed
        └── 02_peaks
            ├── Control1hr_Clip_ALLreadPeaks_AllRegions.txt
            └──  Control7hr_Clip_ALLreadPeaks_AllRegions.txt
   └── config
       └── annotation_config.txt

Expected outputs for one sample (Control1hr)

├── exp_output
│   └── 04_annotation
│       └── 02_peaks
│           ├── Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt
│           ├── Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt
│           ├── Control7hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt
│           └──Control7hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt

Project input/output files are located here:

/data/RBL_NCI/Wolin/Sam/annotation_testing

Script calling

Script location:

/data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/

Example R script (SameStrand, proj_1):

Rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_Anno_ExonIntron.R \
--rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_peak_annotation_functions.R \
--peak_type ALL \
--anno_anchor max_total \
--read_depth 3 \
--sample_id Control1hr_Clip \
--ref_species mm10 \
--anno_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/input/04_annotation/01_project/ \
--reftable_path  /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/config/annotation_config.txt \
--gencode_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromGencode/gencode.vM23.annotation.gtf.txt \
--intron_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromUCSC/KnownGene/KnownGene_GRCm38_introns.bed \
--rmsk_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/repeatmasker/rmsk_GRCm38.txt \
--tmp_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/tmp \
--out_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/ \
--out_file /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt \
--anno_strand "SameStrand"

Example R script (OppoStrand, proj_1):

Rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_Anno_ExonIntron.R \
--rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_peak_annotation_functions.R \
--peak_type ALL \
--anno_anchor max_total \
--read_depth 3 \
--sample_id Control1hr_Clip \
--ref_species mm10 \
--anno_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/input/04_annotation/01_project/ \
--reftable_path  /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/config/annotation_config.txt \
--gencode_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromGencode/gencode.vM23.annotation.gtf.txt \
--intron_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromUCSC/KnownGene/KnownGene_GRCm38_introns.bed \
--rmsk_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/repeatmasker/rmsk_GRCm38.txt \
--tmp_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/tmp \
--out_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/ \
--out_file /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt \
--anno_strand "OppoStrand"

slsevilla · 2022-07-29T16:42:21Z

Create DAG of pipeline v2.0 for review

dag.pdf

wilfriedguiblet · 2022-08-11T15:53:20Z

Improved IE_calling speed in 05_peak_annotation_functions.R.

slsevilla added the enhancement New feature or request label Jul 26, 2022

slsevilla assigned wilfriedguiblet Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve annotation in pipeline #125

improve annotation in pipeline #125

slsevilla commented Jul 26, 2022

slsevilla commented Jul 29, 2022 •

edited by wilfriedguiblet

Loading

slsevilla commented Jul 29, 2022

wilfriedguiblet commented Aug 11, 2022

improve annotation in pipeline #125

improve annotation in pipeline #125

Comments

slsevilla commented Jul 26, 2022

slsevilla commented Jul 29, 2022 • edited by wilfriedguiblet Loading

Rule ExonIntron

Project info

File info

Script calling

slsevilla commented Jul 29, 2022

wilfriedguiblet commented Aug 11, 2022

slsevilla commented Jul 29, 2022 •

edited by wilfriedguiblet

Loading