Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Case study report automation #111

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

Conversation

awasyn
Copy link
Collaborator

@awasyn awasyn commented Oct 26, 2024

Cases study report automation with MolEvolvR

This commit introduces a series of scripts and supporting code for automating case study reports via the MolEvolvR analysis pipeline. The added scripts enable end-to-end processing of pathogen and/or drug data from CARD data in FASTA format, automating key tasks like sequence retrieval, alignment, and protein annotation. Currently, the pipeline supports only the "full" analysis option,
Specifically, the following capabilities have been implemented:

  • BLAST and InterProScan Integration:

    • BLAST: Utilizes NCBI’s online BLAST API to retrieve sequence alignment results.
    • InterProScan: Supports local execution with dependencies on the command-line version of InterProScan. Future work will include developing an InterProScan API wrapper (iprscanr) for more accessible remote processing, reducing local resource demands.

This approach is based on the MolEvolvR webapp pipeline , with portions of code directly adapted from MolEvolvR scripts. The original authorship has been retained to acknowledge the foundational work and provide proper attribution.

Current dependencies not inlcuded in this commit :

  • lineage_lookup.txt, cln_lookup_tbl.tsv data files
  • blast+ for local sequence alignments
  • InterProScan (local installation)

To-Do:

  • - Clean scripts and post-analysis report generation
  • - Add a wrapper for the InterProScan API to reduce local resource constraints
  • - Extend support for additional analysis options ("da" "dblasts", "phylo")
  • - Extend support for other formats ("msa", "accnum", "ipr" )

Update:

Hi, below are the complete instructions for running the case study report generation on Windows and Ubuntu, including how to install and use the getCaseStudyReport function for automating the case study:

Setup Instructions

Windows

  1. Install BLAST:
  • Download from NCBI BLAST+ and follow the instructions to install and add it to your PATH.
  • Verify:
    blastn -version
  1. Install CD-HIT:
  • Download binaries from CD-HIT GitHub Releases and extract the files.
  • Add the folder containing cd-hit.exe to your PATH.
  • Verify:
    cd-hit -h
  1. Install MolEvolvR:
install.packages("devtools")
devtools::install_github("awasyn/MolEvolvR", ref = "case_study", dependencies = TRUE)

Ubuntu

  1. Install BLAST:
sudo apt-get update
sudo apt-get install ncbi-blast+ -y
  1. Install CD-HIT:
    sudo apt-get install cd-hit -y
  2. Install MolEvolvR:
install.packages("devtools")
devtools::install_github("awasyn/MolEvolvR", ref = "case_study", dependencies = TRUE)

Next add this support data.

# create common_data folder inside the MolEvolvR library
common_data_path <- file.path(.libPaths()[1], "MolEvolvR", "common_data")

dir.create(common_data_path, showWarnings = FALSE)

# download data
download.file("https://drive.google.com/uc?id=1qN3NGUahVZmRniedy_2ijMKaY_TBrwOz", 
              destfile = file.path(common_data_path, "cln_lookup_tbl.tsv"))

download.file("https://drive.google.com/uc?id=1h_CjURK5laxT7Prhm_U2IqRYlCf6WwKJ", 
              destfile = file.path(common_data_path, "lineage_lookup.txt"))

Try out the case study

Once everything is set up, you can test the automation by running the following in R/RStudio:

library(MolEvolvR)
getCaseStudyReport(pathogen = "Acinetobacter baumannii", drug = "Beta-lactams")

This will generate and open a detailed case study report. 🚀
The output files are in the parent directory of the current directory where the function is called.

@jananiravi @the-mayer @falquaddoomi

What kind of change(s) are included?

  • Feature (adds or updates new capabilities)
  • Enhancement (adds functionality).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

  • I have read and followed the CONTRIBUTING.md guidelines.
  • I have searched for existing content to ensure this is not a duplicate.
  • I have performed a self-review of these additions (including spelling, grammar, and related).
  • I have added comments to my code to help provide understanding.
  • I have added a test which covers the code changes found within this PR.
  • I have deleted all non-relevant text in this pull request template.
  • Reviewer assignment: Tag a relevant team member to review and approve the changes.

@awasyn awasyn changed the title WIP: Case study report Automation WIP: Case study report automation Oct 26, 2024
@awasyn awasyn self-assigned this Oct 27, 2024
@falquaddoomi falquaddoomi self-requested a review October 29, 2024 15:43
Signed-off-by: Awa Synthia <[email protected]>
@jananiravi jananiravi added enhancement New feature or request outreachy for outreachy interns package R package dev api Python, Plumber, R bioinfo Bioinformatics related coding Coding experience (of any sort) would be helpful labels Nov 18, 2024
@jananiravi jananiravi added this to the v0 | short-term fixes milestone Nov 18, 2024
Copy link
Member

@jananiravi jananiravi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awasyn, thanks again for putting together this PR. I haven't gone through all the files & functions line-by-line, but I started noting some high-level comments first that might help with a more careful review by our team.

  1. It might help if this PR focuses on new function definitions and MolEvolvR/CARD calls. When possible, call existing functions directly from R/ or use the downloaded github package calls. @the-mayer may advise one way or another.
  2. Use current function names post batch renaming (recently merged PRs). The current list is here: https://jravilab.github.io/MolEvolvR/reference/index.html. This is another reason to avoid redefining functions.
  3. Avoid stylistic changes or minor formatting changes, incl. stripping tail-end white spaces as part of commits to keep content and style-based commits and PRs separate.

Please help us by pointing to new functions/functionalities so we can focus our code review there.
MolEvolvR: @the-mayer @falquaddoomi @epbrenner
CARD/amR: @AbhirupaGhosh @epbrenner

Thanks a ton once again, for your concerted effort on this mega issue!

inst/report/scripts/run_molevolvr_pipeline.R Outdated Show resolved Hide resolved
inst/report/scripts/run_molevolvr_pipeline.R Outdated Show resolved Hide resolved
summarise(totalcount = sum(count))

total <- left_join(prot, col_count, by = as_string(column))
total <- left_join(prot, col_count, by = column)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@the-mayer, make sure this switch causes no errors?


prot <- select(prot, {{ column }}, {{ lineage_col }}) %>%
filter(!is.na({{ column }}) & !is.na({{ lineage_col }})) %>%
filter({{ column }} != "")

prot <- summarizeByLineage(prot, column, by = lineage_col, query = "all")
col_count <- prot %>%
group_by({{ column }}) %>%
group_by(!!sym(column)) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@the-mayer, do we prefer curly curly or is this switch going to be consistent with the rest of our code? @awasyn, is there any particular reason for this switch? Did it ({{}}) throw a warning or error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it threw an error. eg column <- "DomArch" is passed as string but earlier in the code there was a column <- sym(column) which change "DomArch" to a symbol i.e "DomArch" and the code tries to find column name in dataframe as "DomArch" which doesn't exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. {{column}} should work similar to sym(column). So not sure what's going on. @the-mayer?

inst/report/scripts/MolEvolData_class.R Outdated Show resolved Hide resolved
inst/report/scripts/viz_utils.R Outdated Show resolved Hide resolved
inst/report/scripts/viz_utils.R Outdated Show resolved Hide resolved
}

# Function to convert accessions to names
acc_to_name <- function(app_data) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inst/report/scripts/viz_utils.R Outdated Show resolved Hide resolved
}

# Function to retrieve representative accession numbers
get_representative_accession_numbers <- function(app_data, phylo_select,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@falquaddoomi falquaddoomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, I commend you for taking on a difficult task and completing it!

I'm still in the process of reviewing your PR, but I thought I'd submit what comments I had so far so you can act on them. I'll likely do a second review a bit later on, when I get a chance to run it and investigate your changes further.

R/ipr2viz.R Outdated Show resolved Hide resolved
inst/report/scripts/run_molevolvr_pipeline.R Outdated Show resolved Hide resolved
inst/report/scripts/run_molevolvr_pipeline.R Outdated Show resolved Hide resolved
inst/report/scripts/run_molevolvr_pipeline.R Outdated Show resolved Hide resolved
inst/report/scripts/run_molevolvr_pipeline.R Outdated Show resolved Hide resolved
Copy link
Member

@jananiravi jananiravi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awasyn, were these changes (& these) pulled from main? If not, stylistic changes such as trailing spaces & indentation can be addressed later (post-merge).

@awasyn
Copy link
Collaborator Author

awasyn commented Nov 20, 2024

@awasyn, were these changes (& these) pulled from main? If not, stylistic changes such as trailing spaces & indentation can be addressed later (post-merge).

okay thanks. I wanted to ease review by updating back to only the changed lines. my editor added spaces during save.

Signed-off-by: Awa Synthia <[email protected]>
Signed-off-by: Awa Synthia <[email protected]>
Signed-off-by: Awa Synthia <[email protected]>
Signed-off-by: Awa Synthia <[email protected]>
Signed-off-by: Awa Synthia <[email protected]>
Signed-off-by: Awa Synthia <[email protected]>
Signed-off-by: Awa Synthia <[email protected]>
Signed-off-by: Awa Synthia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Python, Plumber, R bioinfo Bioinformatics related coding Coding experience (of any sort) would be helpful enhancement New feature or request outreachy for outreachy interns package R package dev
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants