Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve annotation in pipeline #125

Open
slsevilla opened this issue Jul 26, 2022 · 3 comments
Open

improve annotation in pipeline #125

slsevilla opened this issue Jul 26, 2022 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@slsevilla
Copy link
Contributor

Currently annotation calling is one of the largest bottlenecks of the pipeline. It is currently split into several rules and accompanying scripts.

Rules

  • peak_Transcripts
  • peak_ExonIntron
  • peak_RMSK
  • peak_Transcripts
  • peak_junctions
  • peak_process
  • project_annotations

Scripts

The general workflow is to run each annotation type separately before merging into one RMD file. This requires a significant amount of time, and is generating individual jobs per sample per rule, which also utilizes more Biowulf resources than maybe necessary.

Goals for the re-write

  1. Speed up performance
  2. Reduce the number of input/output files required for execution
  3. Transfer all file creation from R files to snakemake
  4. Reduce the number of rules required without sacrificing speed considerably
@slsevilla slsevilla added the enhancement New feature or request label Jul 26, 2022
@slsevilla
Copy link
Contributor Author

slsevilla commented Jul 29, 2022

Rule ExonIntron

Project info

Three projects were created from previous runs to complete benchmarking analysis

  • project1: mESC_clip_4_v2.0
  • project2: 8-09-21-HaCaT_fCLIP_v2.0
  • project3: mES_fclip_1_YL_011622_v2.0

File info

  • all projects are set-up with the following structure
├── proj_number
│   └── exp_output
│   └── input
  • Required inputs for one sample
└── input
    └── 04_annotation
        ├── 01_project
        │   ├── 7SKRNA_Repeatmasker.bed
        │   ├── annotations.txt
        │   ├── DNA_Repeatmasker.bed
        │   ├── lincRNA_Gencode.bed
        │   ├── LINE\ SINE_Repeatmasker.bed
        │   ├── lncRNA_Gencode.bed
        │   ├── lncRNA_Gencode.txt
        │   ├── Low_complexity_Repeatmasker.bed
        │   ├── LTR_Repeatmasker.bed
        │   ├── miRNA_Gencode.bed
        │   ├── ncRNA_annotations.txt
        │   ├── Other_Repeatmasker.bed
        │   ├── ref_gencode.txt
        │   ├── rRNA_Custom.bed
        │   ├── rRNA_Gencode.bed
        │   ├── rRNA_Repeatmasker.bed
        │   ├── Satellite_Repeatmasker.bed
        │   ├── scRNA_Repeatmasker.bed
        │   ├── Simple_repeat_Repeatmasker.bed
        │   ├── sncRNA_Custom.bed
        │   ├── snoRNA_Gencode.bed
        │   ├── snRNA_Gencode.bed
        │   ├── srpRNA_Repeatmasker.bed
        │   ├── tRNA_Custom.bed
        │   ├── Unknown_Repeatmasker.bed
        │   └── yRNA_Repeatmasker.bed
        └── 02_peaks
            ├── Control1hr_Clip_ALLreadPeaks_AllRegions.txt
            └──  Control7hr_Clip_ALLreadPeaks_AllRegions.txt
   └── config
       └── annotation_config.txt

  • Expected outputs for one sample (Control1hr)
├── exp_output
│   └── 04_annotation
│       └── 02_peaks
│           ├── Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt
│           ├── Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt
│           ├── Control7hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt
│           └──Control7hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt

Project input/output files are located here:

/data/RBL_NCI/Wolin/Sam/annotation_testing

Script calling

Script location:

/data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/

Example R script (SameStrand, proj_1):

Rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_Anno_ExonIntron.R \
--rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_peak_annotation_functions.R \
--peak_type ALL \
--anno_anchor max_total \
--read_depth 3 \
--sample_id Control1hr_Clip \
--ref_species mm10 \
--anno_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/input/04_annotation/01_project/ \
--reftable_path  /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/config/annotation_config.txt \
--gencode_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromGencode/gencode.vM23.annotation.gtf.txt \
--intron_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromUCSC/KnownGene/KnownGene_GRCm38_introns.bed \
--rmsk_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/repeatmasker/rmsk_GRCm38.txt \
--tmp_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/tmp \
--out_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/ \
--out_file /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_SameStrand.txt \
--anno_strand "SameStrand" 

Example R script (OppoStrand, proj_1):

Rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_Anno_ExonIntron.R \
--rscript /data/RBL_NCI/Pipelines/iCLIP/v2.0/workflow/scripts/05_peak_annotation_functions.R \
--peak_type ALL \
--anno_anchor max_total \
--read_depth 3 \
--sample_id Control1hr_Clip \
--ref_species mm10 \
--anno_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/input/04_annotation/01_project/ \
--reftable_path  /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/config/annotation_config.txt \
--gencode_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromGencode/gencode.vM23.annotation.gtf.txt \
--intron_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/Gencode_VM23/fromUCSC/KnownGene/KnownGene_GRCm38_introns.bed \
--rmsk_path /data/CCBR_Pipeliner/iCLIP/ref/annotations/mm10/repeatmasker/rmsk_GRCm38.txt \
--tmp_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/tmp \
--out_dir /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/ \
--out_file /data/RBL_NCI/Wolin/Sam/annotation_testing/proj_1/testing_output/04_annotation/02_peaks/Control1hr_Clip_ALLreadPeaks_AllRegions_IntronExon_OppoStrand.txt \
--anno_strand "OppoStrand"

@slsevilla
Copy link
Contributor Author

Create DAG of pipeline v2.0 for review

dag.pdf

@wilfriedguiblet
Copy link
Contributor

Improved IE_calling speed in 05_peak_annotation_functions.R.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants