Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add RNA-seq Pathway Analysis - GSEA Example #397

Closed
wants to merge 8 commits into from

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Dec 2, 2020

Analysis Purpose

This draft PR addresses issue #368.

Pull Request Stage

This is a Draft PR - needs review of big concepts and outline

Strategy

I began by following the plan posted on issue #368 which is as follows:

  1. Load in the needed packages (clusterProfiler to run GSEA, msigdbr and species annotation package -- Homo sapiens in this case)
  2. Import the results from this repo's differential-expression_rnaseq_01.Rmd using a URL (noting that this is tentative to change based on how the visualization of the GSEA results for this dataset looks later on in the example)
  3. For consistency with the microarray analyses using clusterProfiler include the “getting familiar with clusterProfiler's options" section
  4. Isolate the hallmark gene sets using msigdbr()
  5. Perform gene identifier conversion (using the mapIds() function)
  6. Join the expression data and filter out any duplicate gene identifiers based on highest absolute log2FC value (as opposed to the t-statistic which was used in the microarray example as it does not appear that t-statistic values are available in the results dataset from differential-expression_rnaseq_01.Rmd
  7. Determine the pre-ranked gene list based on the gene level statistic from the previous step
  8. Perform gsea using GSEA() function from clusterProfiler providing the pre-ranked gene list to the geneList argument and largely keeping the parameters the same as in 02-microarray/pathway-analysis_microarray_02_gsea.Rmd.Rmd example
  9. Preview the most negative and the most positive enrichment scores and visualize them using enrichplot::gseaplot()
  10. Save plots using ggsave()
  11. Write GSEA results to file

Note that the decision to use log2FoldChange instead of the t-statistic to handle duplicate gene identifiers was made based on the stats available in the differential expression results file we are using in this example.

Note also that the decision to use all gene sets related to the species rather than isolate to use just the hallmark gene sets (like we do in the microarray example) was made based on a lack of GSEA results generated when using the hallmark gene sets with this particular dataset.

Concerns/Questions for reviewers:

The code in this PR that differs from the microarray GSEA example (which mainly includes the code related to reading in all the gene sets rather than hallmark gene sets, the removal of duplicate gene identifiers using the higher absolute log2 fold change value, and the plots).

I added the prompt DRAFT mainly before places where the context is rough and will be refined in upcoming PRs and REVIEW before code which could benefit from review at this stage and is not identical to the microarray example.

Analysis Pull Request Check List (roughly in order):

Content checks

  • All {{BLANKS}} have been replaced with the correct content.
  • Sources are cited
  • Seed is set (if applicable)

Formatting Checks

  • Removed any manual numbering of sections.
  • Removed any instances of chunk naming.
  • Comments and documentation are up to date.
  • All links have been checked and are properly formatted.

Add datasets to S3

Docker/Snakemake rendering components

  • Added the .html link to the navigation bar.
  • Any not yet added packages needed for this analysis have been added to the Dockerfile and it successfully builds.

cansavvy and others added 4 commits October 22, 2020 11:04
* Adding in some style with css

* Use css magic

* Try making the navbar blue

* Add survey link

* Make font smaller

* Need a comma

* Change to normalizePath

* normalizepath separate step references.bib

* Move references.bib to component folder

* Made ccs modifications, added logo file

Made changes to css/navbar.html
Tried to add the logo but it but it cuts out and not sure how to make it decent.

* Resolve render-notebooks.R conflict

* Remove testing html from file diff

* uncommented mobile nav

Co-authored-by: dvenprasad <[email protected]>
* Adding in some style with css

* Use css magic

* Try making the navbar blue

* Add survey link

* Make font smaller

* Need a comma

* Change to normalizePath

* normalizepath separate step references.bib

* Move references.bib to component folder

* Update github actions to reflect staging branch (#311)

* Update github actions to reflect staging branch

* Add libglpk40 to Dockerfile

* Make it gh-pages-stages!

* Remove dockerfile change that should have been on its own all along

* Does this work?

* Declare a uses

* Switch how env is declared

* Force it to run so we can test it

* try no curly brackets

* What's up with the branch

* Move to bash if instead

* Need quotes?

* forgot a `then`

* Try dollar signs

* Doesn't like the `.`?

* Use curly brackets

* Try ${GITHUB_REF}

* Try ${BRANCH_NAME}

* try ${GITHUB_REF#refs/*/}

* use jashapiro suggestion

* Change to base ref

* Change back to `github.ref`

* Get rid of PR `on:`

* Try another test

* Docker dep fix: Add lib package 40 thing that clusterprofiler needs (#316)

* Add lib package 40 thing that clusterprofiler needs

* Try adding options(warn = 2)

* Test if options(warn =2) means it breaks like it should

* Revert "Test if options(warn =2) means it breaks like it should"

This reverts commit d9f688f.

* Revert "Try another test"

This reverts commit 845cf1a.

* Add google analytics to renderings (#314)

* Try adding google analytics

* Add to header using includes

* temporary file snuck in there

* Restore master version so they aren't in the review

* Let's call an html file and html file

* Docker dep fix: Add lib package 40 thing that clusterprofiler needs (#316)

* Add lib package 40 thing that clusterprofiler needs

* Try adding options(warn = 2)

* Test if options(warn =2) means it breaks like it should

* Revert "Test if options(warn =2) means it breaks like it should"

This reverts commit d9f688f.

* Only push if we are in master.

For simplicity, we will now run this even if the dockerfile hasn't changed.

* Add test target

* test staging workflow with this branch

* back to latest tag

* Try separate push step

* change tags to test push

* Revert "change tags to test push"

This reverts commit 6a38574.

* Remove this branch from triggers

* Push staging, retag and push master

Okay, so the branch name is now inaccurate, but that is fine...

* Made ccs modifications, added logo file

Made changes to css/navbar.html
Tried to add the logo but it but it cuts out and not sure how to make it decent.

* Resolve render-notebooks.R conflict

* Remove testing html from file diff

* uncommented mobile nav

* Update scripts/render-notebooks.R

* Add some issue templates (#319)

* Add some rough draft issue templates

* Incorporate cbethell review

* Get rid of `Other` labels that aren't useful

* Update diagrams showing how microarray/RNA-seq work  (#326)

* Mechanics for CSS file and navbar add feedback URL (#303)

* Adding in some style with css

* Use css magic

* Try making the navbar blue

* Add survey link

* Make font smaller

* Need a comma

* Change to normalizePath

* normalizepath separate step references.bib

* Move references.bib to component folder

* Made ccs modifications, added logo file

Made changes to css/navbar.html
Tried to add the logo but it but it cuts out and not sure how to make it decent.

* Resolve render-notebooks.R conflict

* Remove testing html from file diff

* uncommented mobile nav

Co-authored-by: dvenprasad <[email protected]>

* Update microarray and RNAseq overview figures


- add context re figures
- change .jpg to .png for consistency

* Revert "Mechanics for CSS file and navbar add feedback URL (#303)"

This reverts commit 8b81fdd.

* update links to diagrams

* @dvenprasad updated figure spacing

* add the right updated figure

* replace section of link to figures with updated commit id

* incorporate @cansavvy's suggested changes

Co-authored-by: Candace Savonen <[email protected]>
Co-authored-by: dvenprasad <[email protected]>

Co-authored-by: Joshua Shapiro <[email protected]>
Co-authored-by: dvenprasad <[email protected]>
Co-authored-by: Chante Bethell <[email protected]>
@cbethell cbethell assigned cansavvy and unassigned cansavvy Dec 2, 2020
Copy link
Contributor

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as "big picture" reviews go, I don't have many comments, just think we should discuss a bit about what stat from DESeq2 differential expression to use.


**IDENTICAL TO MICROARRAY EXAMPLE**

This particular example analysis shows how you can use gene set enrichment analysis (GSEA) to detect situations where all genes in a predefined gene set change in a coordinated way, detecting even small statistical but coordinated changes between two biological states.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also still needing some revamping though - #354 ?

dplyr::filter(gene_symbol %in% dup_gene_symbols) %>%
dplyr::arrange(gene_symbol)
```
**IDENTICAL TO MICROARRAY EXAMPLE** (except using absolute log2 fold change value instead of t-statistic value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fold changes could sometimes be deceiving in some instances; you could have one big sample that drives things for some genes.

What about using -log10(padj) instead? Or lfcSE would be better than plain log2 fold change too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested whether or not using lfcSE rather than log2FoldChange would make a difference and there were indeed differences where the data frame filtered to handle duplicate gene identifiers is not identical, and the most positive and most negative pathways according to NES are also not identical.

That said, I visited the DESeq2 docs and found that the lfcShrink() function (that we implemented the RNA-seq differential expression example) is expected to handle instances where one big sample may drive things by shrinking the fold changes to account for genes with very low counts and highly variable counts. So this should not be an issue here.

Seeing as it appears that using log2FoldChange is more standard in exploring DESeq2 results downstream (at least visually as seen in this RNA-seq workflow vignette and the DESeq2 vignette and other smaller examples I found), I am inclined to stick with using log2FoldChange here.

I will also note, that when using lfcSE, we get the following warning message from the GSEA() function:

Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize,
 gseaParam, : All values in the stats vector are greater than zero and scoreType
 is "std", maybe you should switch to scoreType = "pos".

which of course can be handled by adding the suggested argument scoreType = "pos" but I am not sure is ideal.

For your reference @cansavvy, (and visualization of the varying plots), here is the rendered html using lfcSE and here is the rendered html using log2FoldChange.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lfcShrink() function (that we implemented the RNA-seq differential expression example) is expected to handle instances where one big sample may drive things by shrinking the fold changes to account for genes with very low counts and highly variable counts.

Good to know about lfcShrink()! This is good to know!

@jaclyn-taroni
Copy link
Member

Having looked at #408 (the plots, specifically), I get the sense that something might be wrong with the results we're getting. In an ideal world, we would have "caught" that here in this draft pull request. On #405, we're making the decision to use all the gene sets (which may or may not contribute to the strange results, we can't tell without trying something else). This is to say that I think this might need a step back or more holistic look: Is it the choice of gene sets? Is something subtle wrong with the identifier mapping? What about the stats and do we need to use another dataset?

The question is "How should we proceed?" What comes to mind is to keep iterating here on this draft until we have something that we think is returning reasonable results, then #405 and #408 can be updated (or closed and new PRs started as deemed appropriate). Thoughts?

@cbethell
Copy link
Contributor Author

cbethell commented Dec 4, 2020

Having looked at #408 (the plots, specifically), I get the sense that something might be wrong with the results we're getting. In an ideal world, we would have "caught" that here in this draft pull request. On #405, we're making the decision to use all the gene sets (which may or may not contribute to the strange results, we can't tell without trying something else). This is to say that I think this might need a step back or more holistic look: Is it the choice of gene sets? Is something subtle wrong with the identifier mapping? What about the stats and do we need to use another dataset?

The question is "How should we proceed?" What comes to mind is to keep iterating here on this draft until we have something that we think is returning reasonable results, then #405 and #408 can be updated (or closed and new PRs started as deemed appropriate). Thoughts?

I think the idea of iterating here on this draft sounds good. In response to the comment on #408, I did some googling and found a closed issue referencing this warning message here. According to said issue, the solution is to rank the log2FoldChange values and use the rankings of the values in the GSEA() function. I started to implement this over on that branch so I can move those steps over to this branch and try using the KEGG gene sets to see if that changes anything and go from there. Thoughts?

@jaclyn-taroni
Copy link
Member

That plans sounds good to me @cbethell!

@cbethell
Copy link
Contributor Author

cbethell commented Dec 4, 2020

In the last commit, I implemented the use of the KEGG gene sets and used the rank() function to rank by the log2FoldChange values for use in the GSEA() function. This got rid of the GSEA() function warning message referenced in #408 comment.

It also slightly improved the most positive NES plot. However I believe now that the rankings are all positive values (I also had to use the scoreType = "pos" argument to run GSEA() for this purpose), the most negative NES plot no longer depicts what we would want.

See the rendered html for reference. Any thoughts here @jaclyn-taroni @cansavvy?

@cansavvy
Copy link
Contributor

cansavvy commented Dec 4, 2020

It also slightly improved the most positive NES plot. However I believe now that the rankings are all positive values (I also had to use the scoreType = "pos" argument to run GSEA() for this purpose), the most negative NES plot no longer depicts what we would want.

Can you explain to me why we need scoreType = "pos" and where you found the docs for this argument? The docs I'm finding aren't super explanatory about when this is needed or if its just a preference. (If you've explained this elsewhere, sorry if I missed it, can you comment where that is?)

@cbethell
Copy link
Contributor Author

cbethell commented Dec 4, 2020

Can you explain to me why we need scoreType = "pos" and where you found the docs for this argument? The docs I'm finding aren't super explanatory about when this is needed or if its just a preference. (If you've explained this elsewhere, sorry if I missed it, can you comment where that is?)

Of course @cansavvy , in one of my comments above, I posted the following error:

Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize,
 gseaParam, : All values in the stats vector are greater than zero and scoreType
 is "std", maybe you should switch to scoreType = "pos".

which was prompted by the use of the lfcSE values which were all greater than zero.

Now that we are implemented rankings of the log2FoldChange values, rather than the actual values, we are now facing a situation again where all the values are greater than zero.

The warning message suggested that we use scoreType = "pos" to handle this.

@jaclyn-taroni
Copy link
Member

What does GSEA() do when there are ties? Pick one randomly? That's what I understand you are doing as well and for 3 values that we get warned about I'm not sure it's worth doing something that is less interpretable.

@cbethell
Copy link
Contributor Author

cbethell commented Dec 4, 2020

What does GSEA() do when there are ties? Pick one randomly? That's what I understand you are doing as well and for 3 values that we get warned about I'm not sure it's worth doing something that is less interpretable.

@jaclyn-taroni, according to the GSEA user guide:
"It is strongly recommended to make sure that the data do not include duplicate ranking values because GSEA does not resolve ties. In the case of a tie, the order of genes will be arbitrary, which may or may not produce erroneous results."

That said, perhaps, similarly to what was done in this vignette, we can remove the rank() step and go back to providing the ranked vector of the actual log2 fold change values and make a note that

"The warning produced indicates that there are few genes that have the same fold change and so are ranked equally. fgsea with arbitrarily order determine which comes first in the ranked list. As long as this number is small it shouldn’t significantly effect the results. If the number is large something is suspicious about the fold change results".

However, upon re-running this example notebook without the rank() step, but still using the KEGG gene sets, 0 GSEA results were returned which makes me wonder if we need to revisit the stat that we are using (although log2 fold change seems standard) or the dataset that we are using. It also makes me wonder if something is wrong with the step creating the ranked vector.

@jaclyn-taroni
Copy link
Member

Can you make some plots (enrichplot::gseaplot()) with some arbitrarily selected pathways (all of them should have scores that are not significant) and see what they look like? I think that might help us diagnose a possible issue with the vector or the names.

My gut is telling me that there may just not be significant GSEA results for this dataset (which does happen and may make some sense but I do not recall the details of the experiment we're using) and we may need to use a different dataset. We'd have to figure out how to accomplish that (e.g., replace the dataset in the DGE example or take another approach).

@cbethell
Copy link
Contributor Author

cbethell commented Dec 4, 2020

Can you make some plots (enrichplot::gseaplot()) with some arbitrarily selected pathways (all of them should have scores that are not significant) and see what they look like? I think that might help us diagnose a possible issue with the vector or the names.

My gut is telling me that there may just not be significant GSEA results for this dataset (which does happen and may make some sense but I do not recall the details of the experiment we're using) and we may need to use a different dataset. We'd have to figure out how to accomplish that (e.g., replace the dataset in the DGE example or take another approach).

See the Plot other pathways section of the rendered html for some additional plots with some arbitrarily selected pathways per your comment above @jaclyn-taroni. I am thinking that these plots may mean that we will indeed need to use a different dataset, what do you think?

@jaclyn-taroni
Copy link
Member

I was expecting to see the version without using rank() based on this comment #397 (comment).

@cbethell
Copy link
Contributor Author

cbethell commented Dec 4, 2020

I was expecting to see the version without using rank() based on this comment #397 (comment).

That version produces 0 results with the curated (KEGG) gene sets. Should we go back to using all gene sets in that case?

@cbethell
Copy link
Contributor Author

cbethell commented Dec 4, 2020

I was expecting to see the version without using rank() based on this comment #397 (comment).

Using all gene sets, I plotted all six pathways that were returned by GSEA() using enrichplot::gseaplot().
See the updated rendered html here.

@cbethell
Copy link
Contributor Author

cbethell commented Dec 5, 2020

Upon further investigation, I was unable to get to the bottom of the weird results and plots.
Here's what I tried:

  1. As suggested by @jaclyn-taroni above and in Slack, I plotted some arbitrarily selected pathways (some specifically from the KEGG gene sets) using enrichplot::gseaplot(). See the "Plot other pathways" section at the rendered html here for reference.

Upon plotting the pathways, Jackie noticed that the particular pathway "BENPORATH NOS TARGETS" appeared to have a lot of gene that do not show up in our list of genes, but with further investigation, it appeared that the "BENPORATH NOS TARGETS" pathway has only two genes that do not overlap with our list.

  1. I also tried swapping the gene identifier - using Entrez IDs instead of gene symbols, interestingly this resulted in 0 results in the @result slot of the gsea_results object (so I moved back to using gene symbols)

  2. I also tried chunking out each of the code chunks -- to ensure that each step is doing what one would expect and that there's no mismatching between identifiers and values taking place. More specifically, for example, I tried breaking up the GSEA step

gsea_results <- GSEA(
  geneList = ranked_vector, # ordered ranked gene list
  minGSSize = 25, # minimum gene set size
  maxGSSize = 500, # maximum gene set set
  pvalueCutoff = 0.05, # p value cutoff
  eps = 0, # boundary for calculating the p value
  seed = TRUE, # set seed to make results reproducible
  pAdjustMethod = "BH", # Benjamini-Hochberg correction
  TERM2GENE = dplyr::select(
    hs_gene_sets,
    gs_name,
    entrez_gene
  )
)

into

gene_terms <- hs_gene_sets %>%
  dplyr::select(gs_name,
                entrez_gene)
)

gsea_results <- GSEA(
  geneList = ranked_vector, # ordered ranked gene list
  minGSSize = 25, # minimum gene set size
  maxGSSize = 500, # maximum gene set set
  pvalueCutoff = 0.05, # p value cutoff
  eps = 0, # boundary for calculating the p value
  seed = TRUE, # set seed to make results reproducible
  pAdjustMethod = "BH", # Benjamini-Hochberg correction
  TERM2GENE = gene_terms
)

but that produced the same results and I was unable to locate mismatching when I chunked out the code.

  1. I also tried swapping the gene sets in and out (KEGG vs Hallmark vs other curated gene sets) -- I got 0 results for most all in the @result slot of the gsea_results object so I moved back to using all gene sets

  2. I also tried changing the parameters of the GSEA function like minGSSize and pvalueCutoff but the results did not improve.

I've committed my most recent attempts for reference.

@jaclyn-taroni
Copy link
Member

Okay, having poked around a bit I think I know what's up. I tried setting the pvalueCutoff = 1 in GSEA() and looking at pathways that had core_enrichment in @result that contained more than one or two gene symbols. Pathways with larger core set of genes had gseaplot() output that looked like:

Whereas those that had a low number of genes in the core_enrichment column looked like the KEGG pathways that are already in the notebook on this branch.

Having not looked at the underlying data myself, the bars in the Ranked List Metric panel originally had me concerned that perhaps there was something wrong with the gene symbol mapping and some genes were not being "recognized" as in the pathway. What is in fact happening is that the two genes that do show up in that panel have much higher LFC values relative to the rest of the genes in the pathway (don't mind the NA, I'm being lazy with the 3 genes that are in the pathway but not our dataset):

> summary(ranked_vector[gsea_results@geneSets$KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY])
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max.       NA's 
-0.0000110 -0.0000007  0.0000009  0.0710014  0.0000033  2.0321201          3 

> head(sort(ranked_vector[gsea_results@geneSets$KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY], decreasing = TRUE))
       ACSL1          TNF        SOCS3        CPT1A        STAT3         MTOR 
2.032120e+00 1.333237e+00 1.177425e+00 1.169419e-03 7.170636e-05 1.734270e-05 

So we can't see them in the Ranked List Metric panel because they values so close to zero that they can not be visualized in that particular plot.

But in the KEGG_OXIDATIVE_PHOSPHORYLATION case from above, there's a narrower range of LFC values (again, don't mind the NA!):

> summary(ranked_vector[gsea_results@geneSets$KEGG_OXIDATIVE_PHOSPHORYLATION])
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
-0.000122 -0.000002  0.000000  0.000002  0.000003  0.000216        21 

Which is why many (all?) of the genes have values that are visible in the Ranked List Metric panel and the y-axis range reflects that all the values are pretty close to zero anyway. And going back to the differential expression notebook that produces the table this notebook uses, the visualization we have there seems to generally align with these observations. In retrospect, the plots using the ranks() approach should have tipped me off that the identifier mapping was not the issue.

All that is to say yes, it is the dataset that is the problem when using the LFC values.

Potential next steps

Here are some ideas for how we can move forward:

  1. Swap in a new dataset in 03-rnaseq/differential-expression_rnaseq_01 – maybe SRP123625 for straightforward experimental design + some metadata reasons?
  2. Generate another DGE result table somewhere without altering the 03-rnaseq/differential-expression_rnaseq_01 example – could be via addressing Create RNA-seq example of differential expression with several groups #242, but then you have more than 2 groups that you need to write around/address/what have you in the GSEA notebook.
  3. Use the rank() method with the existing dataset – this is, in my opinion, getting at a slightly different question than when folks use GSEA with some kind of statistic and might encourage things we don't want to implicitly encourage.

These are in order of: I am most enthusiastic about this -> least enthusiastic about this. I haven't thought about it very deeply, so we should discuss.

@cbethell
Copy link
Contributor Author

cbethell commented Dec 7, 2020

Having not looked at the underlying data myself, the bars in the Ranked List Metric panel originally had me concerned that perhaps there was something wrong with the gene symbol mapping and some genes were not being "recognized" as in the pathway. What is in fact happening is that the two genes that do show up in that panel have much higher LFC values relative to the rest of the genes in the pathway (don't mind the NA, I'm being lazy with the 3 genes that are in the pathway but not our dataset):

> summary(ranked_vector[gsea_results@geneSets$KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY])
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max.       NA's 
-0.0000110 -0.0000007  0.0000009  0.0710014  0.0000033  2.0321201          3 

> head(sort(ranked_vector[gsea_results@geneSets$KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY], decreasing = TRUE))
       ACSL1          TNF        SOCS3        CPT1A        STAT3         MTOR 
2.032120e+00 1.333237e+00 1.177425e+00 1.169419e-03 7.170636e-05 1.734270e-05 

So we can't see them in the Ranked List Metric panel because they values so close to zero that they can not be visualized in that particular plot.

But in the KEGG_OXIDATIVE_PHOSPHORYLATION case from above, there's a narrower range of LFC values (again, don't mind the NA!):

> summary(ranked_vector[gsea_results@geneSets$KEGG_OXIDATIVE_PHOSPHORYLATION])
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
-0.000122 -0.000002  0.000000  0.000002  0.000003  0.000216        21 

Which is why many (all?) of the genes have values that are visible in the Ranked List Metric panel and the y-axis range reflects that all the values are pretty close to zero anyway. And going back to the differential expression notebook that produces the table this notebook uses, the visualization we have there seems to generally align with these observations. In retrospect, the plots using the ranks() approach should have tipped me off that the identifier mapping was not the issue.

All that is to say yes, it is the dataset that is the problem when using the LFC values.

Thank you @jaclyn-taroni for digging into this and breaking down the explanation of what's going on here as you did above! I double checked your above explanation locally and the results do make much more sense now!

Potential next steps

Here are some ideas for how we can move forward:

  1. Swap in a new dataset in 03-rnaseq/differential-expression_rnaseq_01 – maybe SRP123625 for straightforward experimental design + some metadata reasons?
  2. Generate another DGE result table somewhere without altering the 03-rnaseq/differential-expression_rnaseq_01 example – could be via addressing Create RNA-seq example of differential expression with several groups #242, but then you have more than 2 groups that you need to write around/address/what have you in the GSEA notebook.
  3. Use the rank() method with the existing dataset – this is, in my opinion, getting at a slightly different question than when folks use GSEA with some kind of statistic and might encourage things we don't want to implicitly encourage.

These are in order of: I am most enthusiastic about this -> least enthusiastic about this. I haven't thought about it very deeply, so we should discuss.

Looks like moving forward, the steps would be to swap in the results file from 03-rnaseq/differential-expression_rnaseq_01 (updated with the new dataset SRP123625 as discussed via Slack) and re-run what's in this PR.
Once we confirm that the changes look as we would expect, we can propagate them to PRs #405 and #408.
Would you agree?

@jaclyn-taroni
Copy link
Member

@cbethell my preference if it were me might be to close #405 and #408 and start new, but I think that's really up to you! Otherwise, yes I agree – wait for the upcoming PR for 03-rnaseq/differential-expression_rnaseq_01 that's using SRP123625 to go in and then pick up again here.

@cbethell
Copy link
Contributor Author

cbethell commented Dec 7, 2020

@cbethell my preference if it were me might be to close #405 and #408 and start new, but I think that's really up to you! Otherwise, yes I agree – wait for the upcoming PR for 03-rnaseq/differential-expression_rnaseq_01 that's using SRP123625 to go in and then pick up again here.

Good point! I agree with starting fresh for the PRs to be merged 👍

@cbethell
Copy link
Contributor Author

cbethell commented Dec 7, 2020

In the last commit, I swapped the differential results file with the new output file generated as result of the input dataset update in PR #416.

Looks like the plots look a lot more like what we would expect to see in this example workflow!
See the rendered html for reference.
I will note however, that the warning message mentioned in this comment on PR #408, still exists.

There are ties in the preranked stats (0.19% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.leading edge analysis...

This is a result of duplicates in the log2 fold change values, since we are not using rank() here.
Do we want to leave a note, similar to the one the authors left in this vignette:

"The warning produced indicates that there are few genes that have the same fold change and so are ranked equally. fgsea with arbitrarily order determine which comes first in the ranked list. As long as this number is small it shouldn’t significantly effect the results. If the number is large something is suspicious about the fold change results."

or do we now want to dig into how to handle this?

@jaclyn-taroni
Copy link
Member

Yes, this does look better! I think we can continue without rank() and add a note about the warning.

@cansavvy
Copy link
Contributor

cansavvy commented Dec 7, 2020

1) Is there a way that we can have the values carried out to more significant digits in such as way that the values would be different? Or are the values just literally exactly the same? Now that I've seen Jackie's response, forget this idea.

  1. If 1 doesn't lead to a resolution (and there's no other obvious fixes), I feel like a note similar to that seems fine to me. 0.19% is not so bad.

@cansavvy
Copy link
Contributor

Are we done with this? 🎉 ❓

@jaclyn-taroni
Copy link
Member

Yup!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants