Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add Microarray Pathway Analysis - GSVA example #352

Closed
wants to merge 8 commits into from

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Nov 12, 2020

Analysis Purpose

This draft PR addresses issue #343.

Pull Request Stage

This is a Draft PR - needs review of big concepts and outline

Strategy

I began by following the plan posted on issue #343 which is as follows:

  1. Import library(GSVA) (will add this to the Dockerfile in an upcoming PR)
  2. Read in gene expression data (the Homo sapiens medulloblastoma dataset which we also used in differential-expression_microarray_01_several-groups.Rmd)
  3. Import gene list from broad institute url -- the original plan here was to use the recommendation from GSVA vignette to read in the file (and isolate hallmark gene sets) but I implemented qusage::read.gmt() instead because the vignette requires that users load in a separate library and use the data() function to retrieve the gene sets which I believe may be a bit more complex that using qusage::read.gmt() (context on making sure users understand the implications of multiple testing corrections and how smaller gene sets can help with this will be included in an updated PR). I will also note that when I attempted to use the method in the GSVA vignette to retrieve the hallmark gene sets, I was met with an error saying that the package is not available for the R version being used.
  4. Gene identifier conversion — mapping to human Entrez ids
  5. Remove duplicate identifiers — using the highest absolute mean expression value
  6. Use GSVA::gsva() to perform GSVA, using largely the same parameters used in training with exception of the parallel.sz = 4 argument due to the following error message I got: Error in serialize(data, node$con, xdr = FALSE) : error writing to connection. I therefore opted to use the default parameter which is parallel.sz = 1L (I removed this from the step and instruct users to use ?gsva to see more options but should I still include parallel.sz = 1L although it is the default?)
  7. Make some sort of visualization of the GSVA scores - Creating a heatmap using the GSVA scores and the associated metadata for annotation and a violin plot the scores by gene set/pathway.
  8. Write results to a TSV.

Concerns/Questions for reviewers:

The added code in this PR (mainly the code related to reading the hallmark gene sets, the removal of duplicate gene identifiers using the higher absolute mean expression value, the main GSVA steps, and the visualization steps (include the preparation of the annotation data frame for the heatmap and the preparation of the data for the violin plots).

I will note here that the metadata does not contain a whole lot of information for this particular dataset, so the annotation on the heatmap only includes refinebio_sex. Comments on whether or not this is suffice for this use case, and suggestions for alternatives if this is not suffice, would be particularly appreciated!

I added the prompt DRAFT mainly before places where the context is rough and will be refined in upcoming PRs and REVIEW before code which could benefit from review at this stage.

Please let me know if anything above is unclear!
I am planning to break up this PR (likely into 2 separate PRs) for the purpose of refining the content in this PR and getting more detailed review comments so feel free to leave any high-level comments on this PR for implementation in the upcoming PRs.
cc: @cansavvy, @jaclyn-taroni, @jashapiro

See the rendered html for your reference.

Analysis Pull Request Check List (roughly in order):

Content checks

  • All {{BLANKS}} have been replaced with the correct content.
  • Sources are cited
  • Seed is set (if applicable)

Formatting Checks

  • Removed any manual numbering of sections.
  • Removed any instances of chunk naming.
  • Comments and documentation are up to date.
  • All links have been checked and are properly formatted.

Add datasets to S3

Docker/Snakemake rendering components

  • Added the .html link to the navigation bar. (Will add in upcoming PR)
  • Any not yet added packages needed for this analysis have been added to the Dockerfile and it successfully builds. (Will add in upcoming PR)

cansavvy and others added 8 commits October 22, 2020 11:04
* Adding in some style with css

* Use css magic

* Try making the navbar blue

* Add survey link

* Make font smaller

* Need a comma

* Change to normalizePath

* normalizepath separate step references.bib

* Move references.bib to component folder

* Made ccs modifications, added logo file

Made changes to css/navbar.html
Tried to add the logo but it but it cuts out and not sure how to make it decent.

* Resolve render-notebooks.R conflict

* Remove testing html from file diff

* uncommented mobile nav

Co-authored-by: dvenprasad <[email protected]>
* Adding in some style with css

* Use css magic

* Try making the navbar blue

* Add survey link

* Make font smaller

* Need a comma

* Change to normalizePath

* normalizepath separate step references.bib

* Move references.bib to component folder

* Update github actions to reflect staging branch (#311)

* Update github actions to reflect staging branch

* Add libglpk40 to Dockerfile

* Make it gh-pages-stages!

* Remove dockerfile change that should have been on its own all along

* Does this work?

* Declare a uses

* Switch how env is declared

* Force it to run so we can test it

* try no curly brackets

* What's up with the branch

* Move to bash if instead

* Need quotes?

* forgot a `then`

* Try dollar signs

* Doesn't like the `.`?

* Use curly brackets

* Try ${GITHUB_REF}

* Try ${BRANCH_NAME}

* try ${GITHUB_REF#refs/*/}

* use jashapiro suggestion

* Change to base ref

* Change back to `github.ref`

* Get rid of PR `on:`

* Try another test

* Docker dep fix: Add lib package 40 thing that clusterprofiler needs (#316)

* Add lib package 40 thing that clusterprofiler needs

* Try adding options(warn = 2)

* Test if options(warn =2) means it breaks like it should

* Revert "Test if options(warn =2) means it breaks like it should"

This reverts commit d9f688f.

* Revert "Try another test"

This reverts commit 845cf1a.

* Add google analytics to renderings (#314)

* Try adding google analytics

* Add to header using includes

* temporary file snuck in there

* Restore master version so they aren't in the review

* Let's call an html file and html file

* Docker dep fix: Add lib package 40 thing that clusterprofiler needs (#316)

* Add lib package 40 thing that clusterprofiler needs

* Try adding options(warn = 2)

* Test if options(warn =2) means it breaks like it should

* Revert "Test if options(warn =2) means it breaks like it should"

This reverts commit d9f688f.

* Only push if we are in master.

For simplicity, we will now run this even if the dockerfile hasn't changed.

* Add test target

* test staging workflow with this branch

* back to latest tag

* Try separate push step

* change tags to test push

* Revert "change tags to test push"

This reverts commit 6a38574.

* Remove this branch from triggers

* Push staging, retag and push master

Okay, so the branch name is now inaccurate, but that is fine...

* Made ccs modifications, added logo file

Made changes to css/navbar.html
Tried to add the logo but it but it cuts out and not sure how to make it decent.

* Resolve render-notebooks.R conflict

* Remove testing html from file diff

* uncommented mobile nav

* Update scripts/render-notebooks.R

* Add some issue templates (#319)

* Add some rough draft issue templates

* Incorporate cbethell review

* Get rid of `Other` labels that aren't useful

* Update diagrams showing how microarray/RNA-seq work  (#326)

* Mechanics for CSS file and navbar add feedback URL (#303)

* Adding in some style with css

* Use css magic

* Try making the navbar blue

* Add survey link

* Make font smaller

* Need a comma

* Change to normalizePath

* normalizepath separate step references.bib

* Move references.bib to component folder

* Made ccs modifications, added logo file

Made changes to css/navbar.html
Tried to add the logo but it but it cuts out and not sure how to make it decent.

* Resolve render-notebooks.R conflict

* Remove testing html from file diff

* uncommented mobile nav

Co-authored-by: dvenprasad <[email protected]>

* Update microarray and RNAseq overview figures


- add context re figures
- change .jpg to .png for consistency

* Revert "Mechanics for CSS file and navbar add feedback URL (#303)"

This reverts commit 8b81fdd.

* update links to diagrams

* @dvenprasad updated figure spacing

* add the right updated figure

* replace section of link to figures with updated commit id

* incorporate @cansavvy's suggested changes

Co-authored-by: Candace Savonen <[email protected]>
Co-authored-by: dvenprasad <[email protected]>

Co-authored-by: Joshua Shapiro <[email protected]>
Co-authored-by: dvenprasad <[email protected]>
Co-authored-by: Chante Bethell <[email protected]>
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general structure of this looks good! The specifics of different choices need a bit of justification/work (the system is working)!


```{r}
filtered_mapped_df <- mapped_df %>%
# Sort so that the highest absolute mean expression values are at the top
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why absolute value?

Copy link
Contributor Author

@cbethell cbethell Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, this decision was made based on a similar decision made in the GSEA example where it was suggested that we keep the higher absolute value for the purpose of ranking.

Although GSVA does not require a pre-ranked gene list like GSEA, this decision seemed to still make sense here as the row/observation with the larger absolute value would most likely be more interesting that one with a smaller value -- lowly expressed seems less interesting (whether it be a positive or negative value).

However, I will adapt this section appropriately based on your below comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet you will respond to #352 (comment) shortly, but some thoughts:

  • By using the mean value across samples, you're making a decision based on all the samples but the GSVA scores are on a per-sample basis (part of the gist of the linked comment ☝️ )
  • In general, the values that come out of refine.bio for quantile normalized microarray data (what you have here) are going to range from small negative values to higher positive values, where the small negative numbers are genes that are not highly expressed in a particular sample.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha 👍

Let's now convert our data frame into a matrix and prepare our object for GSVA.

```{r}
matrix <- filtered_mapped_df %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name has to change because it will conflict with matrix(). It's also not very descriptive!

as.matrix()
```

Note that if we had duplicate gene identifiers here, we would not be able to set them as rownames.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have the first PR go to here.

head(gsva_results)
```

## Visualizing Results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably pick one of these - this notebook is pretty long and it doesn't have all of the text we need yet! My inclination would be to leave the violin plot because I don't think we have many instances of this kind of wrangling in the examples and you could direct folks to the heatmap example(s) we already have.

gsva_results <- gsva(matrix,
hallmarks_list,
method = "gsva",
# Appropriate for our transformed data on log2-like scale
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is a SCAN processed microarray dataset, you probably can say:

Suggested change
# Appropriate for our transformed data on log2-like scale
# Appropriate for our log2-transformed microarray data

---
title: "Gene set variation analysis - Microarray"
author: "CCDL for ALSF"
date: "October 2020"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
date: "October 2020"
date: "November 2020"


We can see that the associated values vary for each row.

As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSVA steps later, so let's keep the Entrez IDs associated with the larger mean expression value per sample.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a somewhat standard workflow for RNA-seq data because you might expect lowly expressed genes not to show up in your data, so you'd drop the instance of the identifier with low expression as estimated by the mean value across samples. For GSEA (not the pre-ranked version), the default way to collapse identifiers is to take the max value for multiple probes that map to but you can apply the same idea to Ensembl IDs that map to the same Entrez ID (ref). Other ways people do it (also at that ref) are to take the mean value or median value (should be less sensitive to outliers than the mean). Either way, it makes more sense to me to work with the individual values (e.g., not based on a summary measurement across all samples) given that you're doing GSVA which is on a per sample basis. If you pick the max value, you're preferentially going to pick something with a higher rank and that caveat should get documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, that all makes sense. I will use the default way and take the max value and be sure to document the caveat (as well as link to the reference).

**DRAFT**

What do you notice about these distributions?
How might you use this information to inform your interpretation of GSVA scores?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this text is marked as a draft, but I want to make a larger point about the purpose of this plot.

This looks like it's taken from the training material. In the training material, we generate a distribution of scores using random gene sets of different sizes to specifically illustrate the point about how the number of genes in a pathway/gene set influences the scores. I'm not sure what this violin plot is intended to illustrate if we're not going to include information about the gene set sizes. I don't know that we want to include that same experiment from training here (in fact I think we don't want to include it). Instead, a more common use case for a violin plot is to look at the differences between groups. If you want to do that, you probably need to pick a different dataset with more metadata based on your comment when you filed the PR. Another idea would be to include the gene set size in the labels (e.g., HALLMARK_ANGIOGENESIS (<number of genes>)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think that the gene set size experiment from training would not be well suited here but was not sure of the best alternative here, that said, I plan to hopefully find another dataset with more metadata. However, if that ends up being a bit more time-consuming than we would want, I'll go ahead and included the gene set size in the labels.

# Minimum gene set size
min.sz = 5,
# Maximum gene set size
max.sz = 50,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty low to me. Why did you pick this value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this was actually unintentionally left over from development.
I had to do a bit of debugging which I thought the smaller parameters would assist with (as it seemed that it had timed out or had a memory issue) but now this should would just fine with the larger parameters so I will update this to use the paramaters min.sz = 15 and max.sz = 500 (tested it and it only takes a second or two longer than it does now).

@cbethell
Copy link
Contributor Author

This can be closed now that #343 has been addressed with closed PRs #359 and #362.

@cbethell cbethell closed this Nov 30, 2020
@cbethell cbethell deleted the cbethell/add-microarray-gsva-draft branch November 30, 2020 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants