Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pr 1 of 2: Add Microarray Pathway Analysis - GSEA example #345
Pr 1 of 2: Add Microarray Pathway Analysis - GSEA example #345
Changes from 7 commits
8b81fdd
052286c
d5f6045
8a3d59a
670277a
b2518c3
feffa26
c7c0a2c
6fa5e43
59f4df6
5152b3c
ef6d9aa
fcb451b
90e0aaa
519972e
d64e208
447b36d
dbe5a08
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to spend as much time on the gene ID mapping since we have separate examples for that? We could just skip right to using
multiVals = "filter"
since this is essentially what you are doing here but manually.That being said, we may not want to drop data here and I don't think we are as particular about which gene ID is used, so maybe we should just simplify, use
multiVals = "first"
, tell them to see the mapping examples for more info and then move on?https://www.rdocumentation.org/packages/AnnotationDbi/versions/1.30.1/topics/AnnotationDb-objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I implemented
multiVals = "first"
as you suggest here in the last commit. I used aninner_join()
to join the expression data as to not have NAs in theentrez_id
column (which may pose an issue later when runningGSEA()
). Please let me know if this is what you intended @cansavvy or if I should make any further changes here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will note however, that this still leaves us with two duplicated Entrez IDs that map to multiple Ensembl IDs resulting in the following warning message when running
GSEA()
later in the notebook:There are duplicate gene names, fgsea may produce unexpected results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably show users that these duplicates exist then and how to deal with them. The fact that there are only two instances of this is kinda of annoying but its probably good we have this come up so we can show users how to deal with it -- in other datasets or species its possible it will come up more (or maybe not at all).
I think we should incorporate two steps (that should maybe be their own chunk).
Show users how to test for if there are multiple entrez ids. A TRUE/FALSE like
any(duplicated())
would probably work.Show one way that you can decide on which entrez id's data to keep (here's where there could be a lot of ways to do this but we will just have to pick one that we think will be generally useful in most contexts.
I think an okay way to do this would be to keep the data for the entrez gene id with the higher t value (or lower p value) since it will be of greater interest.
May be good to get a @jashapiro opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picking whatever entry has the larger absolute value for the stat we use for ranking makes sense to me. (Are we still using LFC?) A caveat we should point out is that the genes that have duplicate identifiers could be enriched in a particular pathway/gene set and you may get an overly optimistic view of how perturbed that pathway is using this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is log fold change. And based on the draft PR LFC will be used for GSEA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using the
t
value now (per your comment on the draft PR @jaclyn-taroni).In the last commit, I added two steps, one to check for duplicate identifiers and the other to sort by
t
and remove the lower duplicate value.Let me know if you think this is the best approach given the suggestions above @cansavvy and @jaclyn-taroni.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you find any evidence to support my thought that t would be more "standard?" That was mostly based on recollection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linking a comment that answers this question from PR 2 of 2 #347 here:
"""
The decision to create the pre-ranked gene vector based on the t-statistic rather than log fold change was explored based on this comment from the draft PR and eventually made based on what is recommended in Discovering statistically significant pathways in expression profiling studies and the explanation from a biostars forum which says "Try to understand how they relate to the question you're interested in e.g. if you're most interested in effect size then the fold change is what you should use but if you're more interested in statistical significance then look for one of the statistics taking into consideration the assumptions they make e.g. t-test" as I believe we are encouraging users to look at statistical significance here.
"""
Large diffs are not rendered by default.