-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlecture9.Rmd
268 lines (213 loc) · 7.19 KB
/
lecture9.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
---
title: 'SFB 1036 Course on Data Analysis: Lesson 9'
author: "Simon Anders"
output:
html_document:
df_print: paged
---
# Gene Set Enrichment Analysis
We load the saved data from last time
```{r}
library( tidyverse )
library( DESeq2 )
load( "lecture8.rda" )
```
# Gene Set Enrichment
Here are all the upregulated genes
What do all the significantly upregulated genes have in common? Are certain groups of genes
overrepresented among them? The `clusterProfiler` package (Yu et al., [Omics 16:284](http://doi.org/10.1089/omi.2011.0118), 2012) allows us to compare with many data bases of gene
groups ("categories").
```{r}
library( clusterProfiler )
```
Let's start with the most popular of these group collections, the Gene Ontology ([Nucl Acid Res, 45:D331](http://doi.org/10.1093/nar/gkw1108), 2017). It is divided into three sub-ontologies, for
cellular components ("CC"), molecular functions ("MF"), and biological processes ("BP"). The assignment
of human genes to the GO catogies is one of the many gene annotation information in Bioconductor's
Homo sapients organism database package
```{r}
library( org.Hs.eg.db )
```
We can ask clusterProfiler to check for enrichment of the significantly upregulated genes in the
CC sub-ontology:
```{r}
res %>%
filter( signif, log2FoldChange > 0 ) %>%
pull( "EnsemblID" ) %>%
enrichGO(
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "CC",
universe = ( res %>% filter( !is.na(padj) ) %>% pull( "EnsemblID" ) ) ) ->
egoCC
egoCC
```
About the `universe` option: This is the list of all genes for whcih we had enought data to actually test for differential expression. We use thse for which an adjusted p value has been reported. This
is non-obvious, and we will have to discuss p-value adjustment and independent hypothesis filtering before explaining.
clusterProfile offers several visualization methods. Here is one:
```{r}
dotplot(egoCC)
```
GO categories are organized in a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph), so, we should look at that, too:
```{r}
goplot( egoCC )
```
Be sure to check out the [clusterProfiler vignette](https://www.bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html) to learn more about the package's possibilities:
```{r eval=FALSE}
openVignette( "clusterProfiler" )
```
Let's look at one, `emapplot`, which shows us how much overlap there is between related categories
```{r}
emapplot( egoCC )
```
Also, don't forget to also check enrichment for the other two sub-ontologies, and for
the down-regulated gene.
There are other useful gene categories, for example the Moleculare Signature Database (MSigDb) (Liberson et al., [Bioinformatics, 27:1739](https://doi.org/10.1093/bioinformatics/btr260), 2011), and within it,
the Hallmark Gene Set Set Collection (Liberzon et al., [Cell Systems, 1:417](https://doi.org/10.1016/j.cels.2015.12.004), 2015)
Let's download the Hallmark collection from the [MsigDb web page](http://software.broadinstitute.org/gsea/msigdb/collections.jsp). It's the file `h.all.v6.2.symbols.gmt`. As always, have a look at the file first. You'll notice that each line starts with the name of a gene set, followed by a web link to a description and then a list of gene names.
The `read.gmt` function of clusterProfiler can read such files.
```{r}
gmtH <- read.gmt( "RNASeq/airway/h.all.v6.2.symbols.gmt" )
head(gmtH)
```
Instead of the `enrichGO` function, we now use clusterProfiler's more generic `enricher` function:
```{r}
enrH <- enricher(
gene = ( res %>% filter( signif & log2FoldChange>0 ) %>% pull( "Gene name" ) ),
TERM2GENE = gmtH,
universe = ( res %>% filter( !is.na(padj) ) %>% pull( "Gene name" ) )
)
dotplot( enrH )
```
NEXT: Do we believe that? Do it by hand.
```{r}
gmtH %>%
filter( ont == "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION" ) %>%
pull( gene ) ->
emtGenes
```
```{r}
res %>%
ggplot( ) +
geom_point(
aes(
x = baseMean,
y = log2FoldChange,
col = interaction(
signif,
`Gene name` %in% emtGenes )
)
) +
scale_x_log10() +
scale_y_continuous( oob = scales::squish )
```
Explaining the test:
```{r}
res %>%
filter( !is.na( padj ) ) %>% # universe
mutate(
signifUp = signif & log2FoldChange > 0,
emtGene = `Gene name` %in% emtGenes ) %>%
group_by( signifUp, emtGene ) %>%
tally() %>%
spread( emtGene, n ) %>%
column_to_rownames( "signifUp" ) %>%
fisher.test
```
## LFC Shrinkage
Which genes react most strongly to dexamethasone? Not necessarily the ones at the top of the MA plot:
```{r}
plotMA( results(dds), ylim = c( -6, 6 ) )
```
This is because fold changes are exaggerated for weakly expressed or otherwise variable genes. We can
compensate by shrinking the reported log fold changes to "trade bias for variance". We will discuss this in more
detail next time.
```{r}
res2 <- lfcShrink( dds, coef = "dexamethasoneTRUE", type = "apeglm" )
res2
```
```{r}
plotMA( res2, ylim = c( -6, 6 ) )
```
Let's look at the new top genes
```{r}
res2 %>%
as_tibble( rownames = "EnsemblID" ) %>%
left_join(
read_tsv( "RNASeq/airway/mart_export.txt" ),
by = c( "EnsemblID" = "Gene stable ID" ) ) %>%
arrange( -log2FoldChange )
```
We can also redo our interactvie plot
```{r}
library( rlc )
openPage( useViewer=FALSE, layout="table2x2" )
lc_scatter(
dat(
x = res2$baseMean,
y = res2$log2FoldChange,
colorValue = !is.na(res2$padj) & res2$padj < .1,
label = res2$`Gene name`,
on_click = function(k) { gene <<- k; updateCharts("A2") }
),
place = "A1",
logScaleX = 10,
size = 2,
colourLegendTitle = "significant"
)
lc_scatter(
dat(
x = colData(dds)$cell_line,
y = log10( 1 + counts( dds, normalized=TRUE )[ gene, ] ),
colourValue = colData(dds)$treatment
),
place = "A2",
domainY = c( 0, 5 )
)
```
How about now using just the genes with LFC > 1.5 in an GO enrichment analysis?
```{r}
res2 %>%
as_tibble( rownames = "EnsemblID" ) %>%
filter( log2FoldChange > 1.5 ) %>%
pull( "EnsemblID" ) %>%
enrichGO(
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "CC",
universe = ( res %>% filter( baseMean > 8 ) %>% pull( "EnsemblID" ) ) ) ->
egoCC2
egoCC2
```
How about the hallmark set?
```{r}
res2 %>%
as_tibble( rownames = "EnsemblID" ) %>%
filter( log2FoldChange > 1.5 ) %>%
left_join(
read_tsv( "RNASeq/airway/mart_export.txt" ),
by = c( "EnsemblID" = "Gene stable ID" ) ) %>%
pull( "Gene name" ) %>%
enricher(
TERM2GENE = gmtH,
universe = ( res %>% filter( baseMean > 8 ) %>% pull( "Gene name" ) ) ) ->
egoH2
egoH2
```
Instead of a Fisher test, we can also do a Kolmogorov-Smirnov-style test,
which is independent of a threshold, by ranking the genes by shrunken fold change.
```{r}
res2 %>%
as_tibble( rownames = "EnsemblID" ) %>%
filter( baseMean > 8 ) %>%
left_join(
read_tsv( "RNASeq/airway/mart_export.txt" ),
by = c( "EnsemblID" = "Gene stable ID" ) ) %>%
arrange( -log2FoldChange ) %>%
dplyr::select( `EnsemblID`, log2FoldChange ) %>%
deframe %>%
gseGO( OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "CC",
nPerm = 1000 ) ->
gseCC
```