AlexsLemonade · cbethell · Nov 6, 2020 · Oct 22, 2020 · Oct 26, 2020 · Oct 30, 2020
diff --git a/02-microarray/pathway-analysis_microarray_03_gsea.Rmd b/02-microarray/pathway-analysis_microarray_03_gsea.Rmd
@@ -270,6 +270,25 @@ Let's see a preview of `dge_mapped_df`.
 head(dge_mapped_df)
 ```
 
+Now let's check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs.
+
+```{r}
+any(duplicated(dge_mapped_df$entrez_id))
+```
+
+Looks like we do have duplicated Entrez IDs.
+We do not want duplicated gene identifiers for the GSEA steps later, so let's keep the Entrez IDs associated with the higher t-statistic value.
+
+```{r}
+filtered_dge_mapped_df <- dge_mapped_df %>%
+ # Sort so that highest t-statistic values are at the top
+ dplyr::arrange(dplyr::desc(t)) %>%
+ # Filter out the duplicated rows
+ dplyr::filter(!duplicated(entrez_id))
+```
+
+Note however, that a caveat in using this approach is that the genes that have duplicate. identifiers could be enriched in a particular pathway/gene set and we may get an overly optimistic view of how perturbed that pathway truly is.
+
 ## Perform gene set enrichment analysis (GSEA)
 
 _Addressed in upcoming PR_

diff --git a/02-microarray/pathway-analysis_microarray_03_gsea.html b/02-microarray/pathway-analysis_microarray_03_gsea.html
@@ -3938,7 +3938,7 @@ <h2><span class="header-section-number">4.2</span> Import data</h2>
 <span id="cb31-3"><a href="#cb31-3"></a><span class="co"># desired gene list results</span></span>
 <span id="cb31-4"><a href="#cb31-4"></a>dge_df &lt;-<span class="st"> </span>readr<span class="op">::</span><span class="kw">read_tsv</span>(dge_url)</span></code></pre></div>
 <pre><code>## 
-## ── Column specification ────────────────────────────────────────────────────────────────────
+## ── Column specification ───────────────────────────────────────────────────────────────────────────
 ## cols(
 ## Gene = col_character(),
 ## logFC = col_double(),
@@ -4025,6 +4025,16 @@ <h2><span class="header-section-number">4.4</span> Gene identifier conversion</h
 {"columns":[{"label":[""],"name":["_rn_"],"type":[""],"align":["left"]},{"label":["Ensembl"],"name":[1],"type":["chr"],"align":["left"]},{"label":["entrez_id"],"name":[2],"type":["chr"],"align":["left"]},{"label":["logFC"],"name":[3],"type":["dbl"],"align":["right"]},{"label":["AveExpr"],"name":[4],"type":["dbl"],"align":["right"]},{"label":["t"],"name":[5],"type":["dbl"],"align":["right"]},{"label":["P.Value"],"name":[6],"type":["dbl"],"align":["right"]},{"label":["adj.P.Val"],"name":[7],"type":["dbl"],"align":["right"]},{"label":["B"],"name":[8],"type":["dbl"],"align":["right"]}],"data":[{"1":"ENSDARG00000104315","2":"555053","3":"0.9797633","4":"0.5623370","5":"20.22172","6":"2.965470e-09","7":"2.735943e-05","8":"11.031530","_rn_":"1"},{"1":"ENSDARG00000101341","2":"334310","3":"-0.6505860","4":"1.0051319","5":"-15.88369","6":"2.895358e-08","7":"1.335629e-04","8":"9.305940","_rn_":"2"},{"1":"ENSDARG00000034503","2":"140633","3":"0.9219813","4":"0.6515454","5":"14.48634","6":"6.842701e-08","7":"1.875194e-04","8":"8.593576","_rn_":"3"},{"1":"ENSDARG00000014945","2":"407728","3":"0.5699025","4":"0.7756560","5":"13.88657","6":"1.013543e-07","7":"1.875194e-04","8":"8.258341","_rn_":"4"},{"1":"ENSDARG00000019113","2":"325366","3":"-0.6196192","4":"0.8643650","5":"-13.88257","6":"1.016255e-07","7":"1.875194e-04","8":"8.256041","_rn_":"5"},{"1":"ENSDARG00000005799","2":"402894","3":"-0.9383565","4":"1.1067382","5":"-12.51418","6":"2.648122e-07","7":"3.651949e-04","8":"7.415053","_rn_":"6"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}}
  </script>
 </div>
+<p>Now let’s check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs.</p>
+<div class="sourceCode" id="cb42"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb42-1"><a href="#cb42-1"></a><span class="kw">any</span>(<span class="kw">duplicated</span>(dge_mapped_df<span class="op">$</span>entrez_id))</span></code></pre></div>
+<pre><code>## [1] TRUE</code></pre>
+<p>Looks like we do have duplicated Entrez IDs. We do not want duplicated gene identifiers for the GSEA steps later, so let’s keep the Entrez IDs associated with the higher t-statistic value.</p>
+<div class="sourceCode" id="cb44"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb44-1"><a href="#cb44-1"></a>filtered_dge_mapped_df &lt;-<span class="st"> </span>dge_mapped_df <span class="op">%&gt;%</span></span>
+<span id="cb44-2"><a href="#cb44-2"></a><span class="st"> </span><span class="co"># Sort so that highest t-statistic values are at the top</span></span>
+<span id="cb44-3"><a href="#cb44-3"></a><span class="st"> </span>dplyr<span class="op">::</span><span class="kw">arrange</span>(dplyr<span class="op">::</span><span class="kw">desc</span>(t)) <span class="op">%&gt;%</span></span>
+<span id="cb44-4"><a href="#cb44-4"></a><span class="st"> </span><span class="co"># Filter out the duplicated rows</span></span>
+<span id="cb44-5"><a href="#cb44-5"></a><span class="st"> </span>dplyr<span class="op">::</span><span class="kw">filter</span>(<span class="op">!</span><span class="kw">duplicated</span>(entrez_id))</span></code></pre></div>
+<p>Note however, that a caveat in using this approach is that the genes that have duplicate. identifiers could be enriched in a particular pathway/gene set and we may get an overly optimistic view of how perturbed that pathway truly is.</p>
 </div>
 <div id="perform-gene-set-enrichment-analysis-gsea" class="section level2">
 <h2><span class="header-section-number">4.5</span> Perform gene set enrichment analysis (GSEA)</h2>
@@ -4047,8 +4057,8 @@ <h1><span class="header-section-number">5</span> Resources for further learning<
 <div id="session-info" class="section level1">
 <h1><span class="header-section-number">6</span> Session info</h1>
 <p>At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.</p>
-<div class="sourceCode" id="cb42"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb42-1"><a href="#cb42-1"></a><span class="co"># Print session info</span></span>
-<span id="cb42-2"><a href="#cb42-2"></a>sessioninfo<span class="op">::</span><span class="kw">session_info</span>()</span></code></pre></div>
+<div class="sourceCode" id="cb45"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb45-1"><a href="#cb45-1"></a><span class="co"># Print session info</span></span>
+<span id="cb45-2"><a href="#cb45-2"></a>sessioninfo<span class="op">::</span><span class="kw">session_info</span>()</span></code></pre></div>
 <pre><code>## ─ Session info ───────────────────────────────────────────────────────────────
 ## setting value 
 ## version R version 4.0.2 (2020-06-22)
@@ -4059,7 +4069,7 @@ <h1><span class="header-section-number">6</span> Session info</h1>
 ## collate en_US.UTF-8 
 ## ctype en_US.UTF-8 
 ## tz Etc/UTC 
-## date 2020-10-30 
+## date 2020-11-04 
 ## 
 ## ─ Packages ───────────────────────────────────────────────────────────────────
 ## package * version date lib source