Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pr 1 of 2: Add Microarray Pathway Analysis - GSEA example #345

Merged
merged 18 commits into from
Nov 6, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions 02-microarray/pathway-analysis_microarray_03_gsea.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,25 @@ Let's see a preview of `dge_mapped_df`.
head(dge_mapped_df)
```

Now let's check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs.
cbethell marked this conversation as resolved.
Show resolved Hide resolved

```{r}
any(duplicated(dge_mapped_df$entrez_id))
```

Looks like we do have duplicated Entrez IDs.
We do not want duplicated gene identifiers for the GSEA steps later, so let's keep the Entrez IDs associated with the higher t-statistic value.
cbethell marked this conversation as resolved.
Show resolved Hide resolved

```{r}
filtered_dge_mapped_df <- dge_mapped_df %>%
# Sort so that highest t-statistic values are at the top
dplyr::arrange(dplyr::desc(t)) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this in my first review. I would expect this to be based on absolute value, i.e., whichever value is most likely to be highly- or lowly-ranked (or further away from the center of the ranking depending on how you'd like to talk about it) #345 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, that does make the most sense here. I overlooked this in the first comment on #345 but believe I implemented it in the most recent commit. Please let me know if I missed an important step in the implementation @jaclyn-taroni.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting, when running the GSEA() step included in PR #347 throws the following error when the vector is sorted based on absolute value:

Error in GSEA_internal(geneList = geneList, exponent = exponent, minGSSize = minGSSize, : geneList should be a decreasing sorted vector...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying we should sort the vector we pass GSEA() by absolute value (provided I am following you correctly), only that we should select the duplicate instances with a greater absolute value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha! Makes even more sense now thinking about it 👍

# Filter out the duplicated rows
dplyr::filter(!duplicated(entrez_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think distinct() is a more direct version of the dplyr::filter(!duplicated()) you have here.

Second question, can we prove this to ourselves a bit? Perhaps as simply as printing out one of the duplicated entrez IDs and their t values before and after? (I don't want to add too much length to these steps, but I also think its good to make data removal steps proved and clear).

Copy link
Contributor Author

@cbethell cbethell Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I opted to use dplyr::filter(!duplicated(entrez_id)) is because dplyr::distinct(entrez_id) returns only the column entrez_id while dplyr::distinct() returns all the rows containing duplicate identifiers (since their t values etc. are different). Perhaps my implementation of dplyr::distinct() is incorrect in this case?

Also, I agree with your second point here. While developing, I used the following to find the duplicate ids and their associated data as a sanity check:

dge_mapped_df %>% dplyr::filter(duplicated(entrez_id))

However, this returns just one of the rows with each of the duplicate ids (I manually searched the before and after data frames for the associated data using the exact entrez_id value returned).

Perhaps I can include the step to print out the below output and use dge_mapped_df %>% dplyr::filter(entrez_id == 336702) as a sanity check?
I implemented this plan in the last commit.
What do you think? Do you have any suggestions to truncate this?

Screen Shot 2020-11-04 at 12 48 04 PM

Copy link
Contributor

@cansavvy cansavvy Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dplyr::distinct(entrez_id) returns only the column entrez_id

Ah. There's an .keep_all = TRUE argument for distinct() that you need to use to not drop columns.

Also, I agree with your second point here. While developing, I used the following to find the duplicate ids and their associated data as a sanity check:
dge_mapped_df %>% dplyr::filter(duplicated(entrez_id))

I was trying to find a straight forward way of doing this without having to make a separate duplicate entrez ids object, but I didn't come up with anything that's great. I was hoping duplicated() had an option to return values directly so you could use an %in% but it doesn't. I also looked to see if tidyverse had a reverse of distinct() but it doesn't seem to. If we installed another package (which I don't want to do) we could use janitor::get_dupes() but I don't find that worth having users install another package for.

So we are left with doing the manual preview you used here -- which I think may be the simplest route for users to follow and still get the point across. OR, this kind of thing:

dup_entrez_ids <- dge_mapped_df %>% 
  dplyr::filter(duplicated(entrez_id)) %>%
  dplyr::pull(entrez_id)

where you then have to use dup_entrez_ids to retrieve things, but you'd still have to do an arrange.

I think we should just stick with your simple and effective use of an example entrez id like 336702.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the most recent commit, I went ahead and replaced the dplyr::filter(!duplicated(entrez_id)) step with dplyr::distinct(entrez_id, .keep_all = TRUE) as you suggested @cansavvy, and left the subsequent steps as is because I also believe it is not worth having users install another package for. I do wish distinct() had a reverse function but I believe what we currently have is the next best simple yet effective solution.

```

Note however, that a caveat in using this approach is that the genes that have duplicate. identifiers could be enriched in a particular pathway/gene set and we may get an overly optimistic view of how perturbed that pathway truly is.
cbethell marked this conversation as resolved.
Show resolved Hide resolved

## Perform gene set enrichment analysis (GSEA)

_Addressed in upcoming PR_
Expand Down
18 changes: 14 additions & 4 deletions 02-microarray/pathway-analysis_microarray_03_gsea.html
Original file line number Diff line number Diff line change
Expand Up @@ -3938,7 +3938,7 @@ <h2><span class="header-section-number">4.2</span> Import data</h2>
<span id="cb31-3"><a href="#cb31-3"></a><span class="co"># desired gene list results</span></span>
<span id="cb31-4"><a href="#cb31-4"></a>dge_df &lt;-<span class="st"> </span>readr<span class="op">::</span><span class="kw">read_tsv</span>(dge_url)</span></code></pre></div>
<pre><code>##
## ── Column specification ────────────────────────────────────────────────────────────────────
## ── Column specification ───────────────────────────────────────────────────────────────────────────
## cols(
## Gene = col_character(),
## logFC = col_double(),
Expand Down Expand Up @@ -4025,6 +4025,16 @@ <h2><span class="header-section-number">4.4</span> Gene identifier conversion</h
{"columns":[{"label":[""],"name":["_rn_"],"type":[""],"align":["left"]},{"label":["Ensembl"],"name":[1],"type":["chr"],"align":["left"]},{"label":["entrez_id"],"name":[2],"type":["chr"],"align":["left"]},{"label":["logFC"],"name":[3],"type":["dbl"],"align":["right"]},{"label":["AveExpr"],"name":[4],"type":["dbl"],"align":["right"]},{"label":["t"],"name":[5],"type":["dbl"],"align":["right"]},{"label":["P.Value"],"name":[6],"type":["dbl"],"align":["right"]},{"label":["adj.P.Val"],"name":[7],"type":["dbl"],"align":["right"]},{"label":["B"],"name":[8],"type":["dbl"],"align":["right"]}],"data":[{"1":"ENSDARG00000104315","2":"555053","3":"0.9797633","4":"0.5623370","5":"20.22172","6":"2.965470e-09","7":"2.735943e-05","8":"11.031530","_rn_":"1"},{"1":"ENSDARG00000101341","2":"334310","3":"-0.6505860","4":"1.0051319","5":"-15.88369","6":"2.895358e-08","7":"1.335629e-04","8":"9.305940","_rn_":"2"},{"1":"ENSDARG00000034503","2":"140633","3":"0.9219813","4":"0.6515454","5":"14.48634","6":"6.842701e-08","7":"1.875194e-04","8":"8.593576","_rn_":"3"},{"1":"ENSDARG00000014945","2":"407728","3":"0.5699025","4":"0.7756560","5":"13.88657","6":"1.013543e-07","7":"1.875194e-04","8":"8.258341","_rn_":"4"},{"1":"ENSDARG00000019113","2":"325366","3":"-0.6196192","4":"0.8643650","5":"-13.88257","6":"1.016255e-07","7":"1.875194e-04","8":"8.256041","_rn_":"5"},{"1":"ENSDARG00000005799","2":"402894","3":"-0.9383565","4":"1.1067382","5":"-12.51418","6":"2.648122e-07","7":"3.651949e-04","8":"7.415053","_rn_":"6"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}}
</script>
</div>
<p>Now let’s check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs.</p>
<div class="sourceCode" id="cb42"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb42-1"><a href="#cb42-1"></a><span class="kw">any</span>(<span class="kw">duplicated</span>(dge_mapped_df<span class="op">$</span>entrez_id))</span></code></pre></div>
<pre><code>## [1] TRUE</code></pre>
<p>Looks like we do have duplicated Entrez IDs. We do not want duplicated gene identifiers for the GSEA steps later, so let’s keep the Entrez IDs associated with the higher t-statistic value.</p>
<div class="sourceCode" id="cb44"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb44-1"><a href="#cb44-1"></a>filtered_dge_mapped_df &lt;-<span class="st"> </span>dge_mapped_df <span class="op">%&gt;%</span></span>
<span id="cb44-2"><a href="#cb44-2"></a><span class="st"> </span><span class="co"># Sort so that highest t-statistic values are at the top</span></span>
<span id="cb44-3"><a href="#cb44-3"></a><span class="st"> </span>dplyr<span class="op">::</span><span class="kw">arrange</span>(dplyr<span class="op">::</span><span class="kw">desc</span>(t)) <span class="op">%&gt;%</span></span>
<span id="cb44-4"><a href="#cb44-4"></a><span class="st"> </span><span class="co"># Filter out the duplicated rows</span></span>
<span id="cb44-5"><a href="#cb44-5"></a><span class="st"> </span>dplyr<span class="op">::</span><span class="kw">filter</span>(<span class="op">!</span><span class="kw">duplicated</span>(entrez_id))</span></code></pre></div>
<p>Note however, that a caveat in using this approach is that the genes that have duplicate. identifiers could be enriched in a particular pathway/gene set and we may get an overly optimistic view of how perturbed that pathway truly is.</p>
</div>
<div id="perform-gene-set-enrichment-analysis-gsea" class="section level2">
<h2><span class="header-section-number">4.5</span> Perform gene set enrichment analysis (GSEA)</h2>
Expand All @@ -4047,8 +4057,8 @@ <h1><span class="header-section-number">5</span> Resources for further learning<
<div id="session-info" class="section level1">
<h1><span class="header-section-number">6</span> Session info</h1>
<p>At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.</p>
<div class="sourceCode" id="cb42"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb42-1"><a href="#cb42-1"></a><span class="co"># Print session info</span></span>
<span id="cb42-2"><a href="#cb42-2"></a>sessioninfo<span class="op">::</span><span class="kw">session_info</span>()</span></code></pre></div>
<div class="sourceCode" id="cb45"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb45-1"><a href="#cb45-1"></a><span class="co"># Print session info</span></span>
<span id="cb45-2"><a href="#cb45-2"></a>sessioninfo<span class="op">::</span><span class="kw">session_info</span>()</span></code></pre></div>
<pre><code>## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
Expand All @@ -4059,7 +4069,7 @@ <h1><span class="header-section-number">6</span> Session info</h1>
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2020-10-30
## date 2020-11-04
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
Expand Down