Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pr 1 of 2: Add Microarray Pathway Analysis - GSEA example #345
Pr 1 of 2: Add Microarray Pathway Analysis - GSEA example #345
Changes from 1 commit
8b81fdd
052286c
d5f6045
8a3d59a
670277a
b2518c3
feffa26
c7c0a2c
6fa5e43
59f4df6
5152b3c
ef6d9aa
fcb451b
90e0aaa
519972e
d64e208
447b36d
dbe5a08
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this in my first review. I would expect this to be based on absolute value, i.e., whichever value is most likely to be highly- or lowly-ranked (or further away from the center of the ranking depending on how you'd like to talk about it) #345 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, that does make the most sense here. I overlooked this in the first comment on #345 but believe I implemented it in the most recent commit. Please let me know if I missed an important step in the implementation @jaclyn-taroni.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting, when running the
GSEA()
step included in PR #347 throws the following error when the vector is sorted based on absolute value:Error in GSEA_internal(geneList = geneList, exponent = exponent, minGSSize = minGSSize, : geneList should be a decreasing sorted vector...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying we should sort the vector we pass
GSEA()
by absolute value (provided I am following you correctly), only that we should select the duplicate instances with a greater absolute value.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha! Makes even more sense now thinking about it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
distinct()
is a more direct version of thedplyr::filter(!duplicated())
you have here.Second question, can we prove this to ourselves a bit? Perhaps as simply as printing out one of the duplicated entrez IDs and their t values before and after? (I don't want to add too much length to these steps, but I also think its good to make data removal steps proved and clear).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I opted to use
dplyr::filter(!duplicated(entrez_id))
is becausedplyr::distinct(entrez_id)
returns only the columnentrez_id
whiledplyr::distinct()
returns all the rows containing duplicate identifiers (since theirt
values etc. are different). Perhaps my implementation ofdplyr::distinct()
is incorrect in this case?Also, I agree with your second point here. While developing, I used the following to find the duplicate ids and their associated data as a sanity check:
dge_mapped_df %>% dplyr::filter(duplicated(entrez_id))
However, this returns just one of the rows with each of the duplicate ids (I manually searched the before and after data frames for the associated data using the exact
entrez_id
value returned).Perhaps I can include the step to print out the below output and use
dge_mapped_df %>% dplyr::filter(entrez_id == 336702)
as a sanity check?I implemented this plan in the last commit.
What do you think? Do you have any suggestions to truncate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. There's an
.keep_all = TRUE
argument fordistinct()
that you need to use to not drop columns.I was trying to find a straight forward way of doing this without having to make a separate duplicate entrez ids object, but I didn't come up with anything that's great. I was hoping
duplicated()
had an option to return values directly so you could use an%in%
but it doesn't. I also looked to see if tidyverse had a reverse ofdistinct()
but it doesn't seem to. If we installed another package (which I don't want to do) we could usejanitor::get_dupes()
but I don't find that worth having users install another package for.So we are left with doing the manual preview you used here -- which I think may be the simplest route for users to follow and still get the point across. OR, this kind of thing:
where you then have to use
dup_entrez_ids
to retrieve things, but you'd still have to do an arrange.I think we should just stick with your simple and effective use of an example entrez id like
336702
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the most recent commit, I went ahead and replaced the
dplyr::filter(!duplicated(entrez_id))
step withdplyr::distinct(entrez_id, .keep_all = TRUE)
as you suggested @cansavvy, and left the subsequent steps as is because I also believe it is not worth having users install another package for. I do wishdistinct()
had a reverse function but I believe what we currently have is the next best simple yet effective solution.