WIP: Add Microarray Pathway Analysis - GSVA example #352

cbethell · 2020-11-12T17:13:44Z

Analysis Purpose

This draft PR addresses issue #343.

Pull Request Stage

This is a Draft PR - needs review of big concepts and outline

Strategy

I began by following the plan posted on issue #343 which is as follows:

Import library(GSVA) (will add this to the Dockerfile in an upcoming PR)
Read in gene expression data (the Homo sapiens medulloblastoma dataset which we also used in differential-expression_microarray_01_several-groups.Rmd)
Import gene list from broad institute url -- the original plan here was to use the recommendation from GSVA vignette to read in the file (and isolate hallmark gene sets) but I implemented qusage::read.gmt() instead because the vignette requires that users load in a separate library and use the data() function to retrieve the gene sets which I believe may be a bit more complex that using qusage::read.gmt() (context on making sure users understand the implications of multiple testing corrections and how smaller gene sets can help with this will be included in an updated PR). I will also note that when I attempted to use the method in the GSVA vignette to retrieve the hallmark gene sets, I was met with an error saying that the package is not available for the R version being used.
Gene identifier conversion — mapping to human Entrez ids
Remove duplicate identifiers — using the highest absolute mean expression value
Use GSVA::gsva() to perform GSVA, using largely the same parameters used in training with exception of the parallel.sz = 4 argument due to the following error message I got: Error in serialize(data, node$con, xdr = FALSE) : error writing to connection. I therefore opted to use the default parameter which is parallel.sz = 1L (I removed this from the step and instruct users to use ?gsva to see more options but should I still include parallel.sz = 1L although it is the default?)
Make some sort of visualization of the GSVA scores - Creating a heatmap using the GSVA scores and the associated metadata for annotation and a violin plot the scores by gene set/pathway.
Write results to a TSV.

Concerns/Questions for reviewers:

The added code in this PR (mainly the code related to reading the hallmark gene sets, the removal of duplicate gene identifiers using the higher absolute mean expression value, the main GSVA steps, and the visualization steps (include the preparation of the annotation data frame for the heatmap and the preparation of the data for the violin plots).

I will note here that the metadata does not contain a whole lot of information for this particular dataset, so the annotation on the heatmap only includes refinebio_sex. Comments on whether or not this is suffice for this use case, and suggestions for alternatives if this is not suffice, would be particularly appreciated!

I added the prompt DRAFT mainly before places where the context is rough and will be refined in upcoming PRs and REVIEW before code which could benefit from review at this stage.

Please let me know if anything above is unclear!
I am planning to break up this PR (likely into 2 separate PRs) for the purpose of refining the content in this PR and getting more detailed review comments so feel free to leave any high-level comments on this PR for implementation in the upcoming PRs.
cc: @cansavvy, @jaclyn-taroni, @jashapiro

See the rendered html for your reference.

Analysis Pull Request Check List (roughly in order):

Content checks

All {{BLANKS}} have been replaced with the correct content.
Sources are cited
Seed is set (if applicable)

Formatting Checks

Removed any manual numbering of sections.
Removed any instances of chunk naming.
Comments and documentation are up to date.
All links have been checked and are properly formatted.

Add datasets to S3

Added data and metadata files to S3.

Docker/Snakemake rendering components

Added the .html link to the navigation bar. (Will add in upcoming PR)
Any not yet added packages needed for this analysis have been added to the Dockerfile and it successfully builds. (Will add in upcoming PR)

* Adding in some style with css * Use css magic * Try making the navbar blue * Add survey link * Make font smaller * Need a comma * Change to normalizePath * normalizepath separate step references.bib * Move references.bib to component folder * Made ccs modifications, added logo file Made changes to css/navbar.html Tried to add the logo but it but it cuts out and not sure how to make it decent. * Resolve render-notebooks.R conflict * Remove testing html from file diff * uncommented mobile nav Co-authored-by: dvenprasad <[email protected]>

@dvenprasad

* Adding in some style with css * Use css magic * Try making the navbar blue * Add survey link * Make font smaller * Need a comma * Change to normalizePath * normalizepath separate step references.bib * Move references.bib to component folder * Update github actions to reflect staging branch (#311) * Update github actions to reflect staging branch * Add libglpk40 to Dockerfile * Make it gh-pages-stages! * Remove dockerfile change that should have been on its own all along * Does this work? * Declare a uses * Switch how env is declared * Force it to run so we can test it * try no curly brackets * What's up with the branch * Move to bash if instead * Need quotes? * forgot a `then` * Try dollar signs * Doesn't like the `.`? * Use curly brackets * Try ${GITHUB_REF} * Try ${BRANCH_NAME} * try ${GITHUB_REF#refs/*/} * use jashapiro suggestion * Change to base ref * Change back to `github.ref` * Get rid of PR `on:` * Try another test * Docker dep fix: Add lib package 40 thing that clusterprofiler needs (#316) * Add lib package 40 thing that clusterprofiler needs * Try adding options(warn = 2) * Test if options(warn =2) means it breaks like it should * Revert "Test if options(warn =2) means it breaks like it should" This reverts commit d9f688f. * Revert "Try another test" This reverts commit 845cf1a. * Add google analytics to renderings (#314) * Try adding google analytics * Add to header using includes * temporary file snuck in there * Restore master version so they aren't in the review * Let's call an html file and html file * Docker dep fix: Add lib package 40 thing that clusterprofiler needs (#316) * Add lib package 40 thing that clusterprofiler needs * Try adding options(warn = 2) * Test if options(warn =2) means it breaks like it should * Revert "Test if options(warn =2) means it breaks like it should" This reverts commit d9f688f. * Only push if we are in master. For simplicity, we will now run this even if the dockerfile hasn't changed. * Add test target * test staging workflow with this branch * back to latest tag * Try separate push step * change tags to test push * Revert "change tags to test push" This reverts commit 6a38574. * Remove this branch from triggers * Push staging, retag and push master Okay, so the branch name is now inaccurate, but that is fine... * Made ccs modifications, added logo file Made changes to css/navbar.html Tried to add the logo but it but it cuts out and not sure how to make it decent. * Resolve render-notebooks.R conflict * Remove testing html from file diff * uncommented mobile nav * Update scripts/render-notebooks.R * Add some issue templates (#319) * Add some rough draft issue templates * Incorporate cbethell review * Get rid of `Other` labels that aren't useful * Update diagrams showing how microarray/RNA-seq work (#326) * Mechanics for CSS file and navbar add feedback URL (#303) * Adding in some style with css * Use css magic * Try making the navbar blue * Add survey link * Make font smaller * Need a comma * Change to normalizePath * normalizepath separate step references.bib * Move references.bib to component folder * Made ccs modifications, added logo file Made changes to css/navbar.html Tried to add the logo but it but it cuts out and not sure how to make it decent. * Resolve render-notebooks.R conflict * Remove testing html from file diff * uncommented mobile nav Co-authored-by: dvenprasad <[email protected]> * Update microarray and RNAseq overview figures - add context re figures - change .jpg to .png for consistency * Revert "Mechanics for CSS file and navbar add feedback URL (#303)" This reverts commit 8b81fdd. * update links to diagrams * @dvenprasad updated figure spacing * add the right updated figure * replace section of link to figures with updated commit id * incorporate @cansavvy's suggested changes Co-authored-by: Candace Savonen <[email protected]> Co-authored-by: dvenprasad <[email protected]> Co-authored-by: Joshua Shapiro <[email protected]> Co-authored-by: dvenprasad <[email protected]> Co-authored-by: Chante Bethell <[email protected]>

jaclyn-taroni

The general structure of this looks good! The specifics of different choices need a bit of justification/work (the system is working)!

jaclyn-taroni · 2020-11-16T14:06:13Z

02-microarray/pathway-analysis_microarray_04_gsva.Rmd

+
+```{r}
+filtered_mapped_df <- mapped_df %>%
+ # Sort so that the highest absolute mean expression values are at the top


Why absolute value?

Good question, this decision was made based on a similar decision made in the GSEA example where it was suggested that we keep the higher absolute value for the purpose of ranking.

Although GSVA does not require a pre-ranked gene list like GSEA, this decision seemed to still make sense here as the row/observation with the larger absolute value would most likely be more interesting that one with a smaller value -- lowly expressed seems less interesting (whether it be a positive or negative value).

However, I will adapt this section appropriately based on your below comment.

I bet you will respond to #352 (comment) shortly, but some thoughts:

By using the mean value across samples, you're making a decision based on all the samples but the GSVA scores are on a per-sample basis (part of the gist of the linked comment ☝️ )

In general, the values that come out of refine.bio for quantile normalized microarray data (what you have here) are going to range from small negative values to higher positive values, where the small negative numbers are genes that are not highly expressed in a particular sample.

Gotcha 👍

jaclyn-taroni · 2020-11-16T14:06:57Z

02-microarray/pathway-analysis_microarray_04_gsva.Rmd

+Let's now convert our data frame into a matrix and prepare our object for GSVA.
+
+```{r}
+matrix <- filtered_mapped_df %>%


This name has to change because it will conflict with matrix(). It's also not very descriptive!

jaclyn-taroni · 2020-11-16T14:11:33Z

02-microarray/pathway-analysis_microarray_04_gsva.Rmd

+ as.matrix()
+```
+
+Note that if we had duplicate gene identifiers here, we would not be able to set them as rownames.


I would have the first PR go to here.

jaclyn-taroni · 2020-11-16T14:12:49Z

02-microarray/pathway-analysis_microarray_04_gsva.Rmd

+head(gsva_results)
+```
+
+## Visualizing Results


I would probably pick one of these - this notebook is pretty long and it doesn't have all of the text we need yet! My inclination would be to leave the violin plot because I don't think we have many instances of this kind of wrangling in the examples and you could direct folks to the heatmap example(s) we already have.

jaclyn-taroni · 2020-11-16T14:14:59Z

02-microarray/pathway-analysis_microarray_04_gsva.Rmd

+gsva_results <- gsva(matrix,
+ hallmarks_list,
+ method = "gsva",
+ # Appropriate for our transformed data on log2-like scale


Because this is a SCAN processed microarray dataset, you probably can say:

Suggested change

# Appropriate for our transformed data on log2-like scale

# Appropriate for our log2-transformed microarray data

jaclyn-taroni · 2020-11-16T14:17:22Z

02-microarray/pathway-analysis_microarray_04_gsva.Rmd

+---
+title: "Gene set variation analysis - Microarray"
+author: "CCDL for ALSF"
+date: "October 2020"


Suggested change

date: "October 2020"

date: "November 2020"

jaclyn-taroni · 2020-11-16T14:17:56Z

02-microarray/pathway-analysis_microarray_04_gsva.Rmd

+
+We can see that the associated values vary for each row.
+
+As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSVA steps later, so let's keep the Entrez IDs associated with the larger mean expression value per sample.


This seems like a somewhat standard workflow for RNA-seq data because you might expect lowly expressed genes not to show up in your data, so you'd drop the instance of the identifier with low expression as estimated by the mean value across samples. For GSEA (not the pre-ranked version), the default way to collapse identifiers is to take the max value for multiple probes that map to but you can apply the same idea to Ensembl IDs that map to the same Entrez ID (ref). Other ways people do it (also at that ref) are to take the mean value or median value (should be less sensitive to outliers than the mean). Either way, it makes more sense to me to work with the individual values (e.g., not based on a summary measurement across all samples) given that you're doing GSVA which is on a per sample basis. If you pick the max value, you're preferentially going to pick something with a higher rank and that caveat should get documented.

Gotcha, that all makes sense. I will use the default way and take the max value and be sure to document the caveat (as well as link to the reference).

jaclyn-taroni · 2020-11-16T14:50:56Z

02-microarray/pathway-analysis_microarray_04_gsva.Rmd

+**DRAFT**
+
+What do you notice about these distributions?
+How might you use this information to inform your interpretation of GSVA scores?


I know this text is marked as a draft, but I want to make a larger point about the purpose of this plot.

This looks like it's taken from the training material. In the training material, we generate a distribution of scores using random gene sets of different sizes to specifically illustrate the point about how the number of genes in a pathway/gene set influences the scores. I'm not sure what this violin plot is intended to illustrate if we're not going to include information about the gene set sizes. I don't know that we want to include that same experiment from training here (in fact I think we don't want to include it). Instead, a more common use case for a violin plot is to look at the differences between groups. If you want to do that, you probably need to pick a different dataset with more metadata based on your comment when you filed the PR. Another idea would be to include the gene set size in the labels (e.g., HALLMARK_ANGIOGENESIS (<number of genes>)).

I did think that the gene set size experiment from training would not be well suited here but was not sure of the best alternative here, that said, I plan to hopefully find another dataset with more metadata. However, if that ends up being a bit more time-consuming than we would want, I'll go ahead and included the gene set size in the labels.

jaclyn-taroni · 2020-11-16T14:56:45Z

02-microarray/pathway-analysis_microarray_04_gsva.Rmd

+ # Minimum gene set size
+ min.sz = 5,
+ # Maximum gene set size
+ max.sz = 50,


This seems pretty low to me. Why did you pick this value?

Ah, this was actually unintentionally left over from development.
I had to do a bit of debugging which I thought the smaller parameters would assist with (as it seemed that it had timed out or had a memory issue) but now this should would just fine with the larger parameters so I will update this to use the paramaters min.sz = 15 and max.sz = 500 (tested it and it only takes a second or two longer than it does now).

cbethell · 2020-11-30T22:37:32Z

This can be closed now that #343 has been addressed with closed PRs #359 and #362.

cansavvy and others added 8 commits October 22, 2020 11:04

add draft of GSVA microarray example notebook

c40740b

use latest release of hallmark gene sets

cf95200

add comment for argument in gsva step

6ee5fd0

fix merge conflicts

e5d4849

fix some spelling concerns

1bb6091

fix a few more spelling concerns

3c98e73

cbethell requested a review from jaclyn-taroni November 12, 2020 17:45

jaclyn-taroni reviewed Nov 16, 2020

View reviewed changes

This was referenced Nov 18, 2020

Pr 1 of 2: Add Microarray Pathway Analysis - GSVA example #359

Merged

Pr 2 of 2: Add Microarray Pathway Analysis - GSVA example #362

Merged

cbethell closed this Nov 30, 2020

cbethell deleted the cbethell/add-microarray-gsva-draft branch November 30, 2020 22:37

jaclyn-taroni mentioned this pull request Dec 2, 2020

RNA-seq: Pathway Analysis - GSVA #369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add Microarray Pathway Analysis - GSVA example #352

WIP: Add Microarray Pathway Analysis - GSVA example #352

cbethell commented Nov 12, 2020 •

edited

Loading

jaclyn-taroni left a comment

jaclyn-taroni Nov 16, 2020

cbethell Nov 16, 2020 •

edited

Loading

jaclyn-taroni Nov 16, 2020

cbethell Nov 16, 2020

jaclyn-taroni Nov 16, 2020

jaclyn-taroni Nov 16, 2020

jaclyn-taroni Nov 16, 2020

jaclyn-taroni Nov 16, 2020

jaclyn-taroni Nov 16, 2020

jaclyn-taroni Nov 16, 2020

cbethell Nov 16, 2020

jaclyn-taroni Nov 16, 2020

cbethell Nov 16, 2020

jaclyn-taroni Nov 16, 2020

cbethell Nov 16, 2020

cbethell commented Nov 30, 2020

	# Appropriate for our transformed data on log2-like scale
	# Appropriate for our log2-transformed microarray data


		We can see that the associated values vary for each row.

		As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSVA steps later, so let's keep the Entrez IDs associated with the larger mean expression value per sample.

WIP: Add Microarray Pathway Analysis - GSVA example #352

WIP: Add Microarray Pathway Analysis - GSVA example #352

Conversation

cbethell commented Nov 12, 2020 • edited Loading

Analysis Purpose

Pull Request Stage

Strategy

Concerns/Questions for reviewers:

Analysis Pull Request Check List (roughly in order):

Content checks

Formatting Checks

Add datasets to S3

Docker/Snakemake rendering components

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbethell Nov 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbethell commented Nov 30, 2020

cbethell commented Nov 12, 2020 •

edited

Loading

cbethell Nov 16, 2020 •

edited

Loading