-
Notifications
You must be signed in to change notification settings - Fork 14
/
slide_functional.rmd
299 lines (217 loc) · 9.68 KB
/
slide_functional.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
title: "GSA: Gene Set Analysis"
author: "`r paste0('<b>Nima Rafati</b>')`"
subtitle: "Workshop on RNA-Seq"
institute: NBIS, SciLifeLab
keywords: bioinformatics, course, scilifelab, nbis
output:
xaringan::moon_reader:
encoding: 'UTF-8'
self_contained: false
chakra: 'assets/remark-latest.min.js'
css: 'assets/slide.css'
lib_dir: libs
nature:
ratio: '4:3'
highlightLanguage: r
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
slideNumberFormat: "%current%/%total%"
include: NULL
---
```{r, include = FALSE}
#Load the packages
library(ggplot2)
# functions
calculate_wcss <- function(data, k) {
kmeans_model <- kmeans(data, centers = k, nstart = 20)
return(kmeans_model$tot.withinss)
}
```
```{r ora-log10p, echo = F, eval = F}
# Assuming ora_results is a dataframe with 'Category' and '-log10(p-value)' columns
ora_results <- data.frame(
Category = c("Pathway A", "Pathway B", "Pathway C", "Pathway D", "Pathway E", "Pathway F"),
`log10.p.value` = c(1, 1, 2, 3, 4, 4) # Example data
)
# Create a bar plot
ggplot(ora_results, aes(x=reorder(Category, `log10.p.value`), y=`log10.p.value`)) +
geom_bar(stat="identity", fill="skyblue") +
coord_flip() + # Flip the coordinates to make the plot horizontal
labs(x="Pathway", y="-log10(p-value)", title="Overrepresentation Analysis Results") + theme(panel.grid = element_blank()) + theme_bw()
```
---
name: intro
## Introduction
- What do the identified DEGs do?
- How can we link them to phenotypes/diseases/biological features we study?
- We can do that by exploring their function and in which pathways they are involved.
- While differential expression analysis identifies certain genes, is it feasible to manually explore the function of each gene?
- There are different approaches and dependent on available data we can expand it.
- At transcriptome level:
- **Gene set analysis (GSA)**
---
name: GSA
## Why GSA?
- Biological interpretation of the results; From gene list to biological insights!
- Reduce the complexity; Identifying key biological processes that are affected under the experiment or condition.
- Integrating external information.
- Cross-experiment comparisons; We can compare the results across different studies and experimental platforms.
```{r pathview-example, echo = F, fig.align='center', out.width='50%'}
knitr::include_graphics('data/Pathview_example.png')
```
---
name: GS-resource
## Gene set resources/databases
- **Gene Ontology (GO). **
```{r GO, echo = F, fig.align='right', out.width='20%'}
knitr::include_graphics('data/GO.png')
```
- **Pathways (Kyoto Encyclopedia of Genes and Genomes (KEGG)). **
```{r KEGG, echo = F, fig.align='right', out.width='10%'}
knitr::include_graphics('data/KEGG.jpeg')
```
- Protein-protein interaction (PPI)
```{r PPI, echo = F, fig.align='right', out.width='20%'}
knitr::include_graphics('data/PPI.png')
```
- Cell type
- Chromosomal location
- Metabolic and Signaling pathway.
- Diseases
---
name: GO
# Gene Ontology
- It is a resource to unify the representation of gene/gene products into hierarchical categories:
- Biological Process (BP); _Cell cycle, Signal transduction_.
- Molecular Function (MF); _Phosphorylation, DNA binding_.
- Cellular Component (CC); _Nucleus, Cytoplasm_.
- Genes can belong to multiple GO *terms*
```{r GO-example, echo = F, out.width='100%', fig.align='center'}
knitr::include_graphics('data/GO_ORA_results.png')
```
---
name: pathway1
# Pathway
Can you unravel the mystery of this pathway?
```{r pathway-example1, echo = F, out.width='100%', fig.align='center'}
knitr::include_graphics('data/Tokyo_Metro.jpeg')
```
---
name: pathway
# Pathway
- Biology is complex but has an organized structure.
```{r kegg1, echo=FALSE, results='asis', out.width='40%'}
cat('
<div style="display: flex; justify-content: space-around;">
<img src="', knitr::image_uri("data/Pathway.jpeg"), '" style="width: 45%; margin-right: 10px;" />
</div>
')
```
---
name: kegg
# KEGG
- KEGG is a comprehensive database resource that integrates genomic, chemical, and systemic functional information. It provides data on biological pathways, genomes, diseases, drugs, and chemical substances. KEGG is widely used for bioinformatics research, including the study of gene functions and networks.
```{r kegg2, echo=FALSE, results='asis', out.width='50%', fig.align='center'}
knitr::include_graphics('data/KEGG_database_categories.png')
```
---
name: wikipathway
# Wikipahtway
- WikiPathways is an open, collaborative platform dedicated to the curation of biological pathways.
```{r wikipathway, echo=FALSE, results='asis', out.width='40%'}
knitr::include_graphics('data/Wikipahtways.svg')
```
- It allows scientists from various fields to contribute to and edit pathway information, offering a wide range of pathways for research and education purposes.
- The database facilitates the visualization and analysis of pathway information to support understanding of complex biological processes.
---
name: Reactome
# Reactome
- Reactome is a curated database of pathways and reactions in human biology.
- It covers various aspects of human biology, including metabolism, signaling, molecular transport, and cellular processes.
- Reactome provides tools for visualization, interpretation, and analysis of pathway data, making it a valuable resource for researchers in genomics and systems biology.
```{r reactome, echo = F, fig.align='center', out.width='80%'}
knitr::include_graphics('data/Reactome_Browser.png')
```
---
name: Transcription Factor
# Transcription Factor databases
- There are different databases compile information about TF, their DNA binding sites and regulatory network they form:
- TRANSFAC.
- JASPAR.
- ENCODE.
---
name: Hallmark
# Hallmark Gene Set
- The Hallmark gene set is part of the Molecular Signatures Database (MSigDB), which is a collection of annotated gene sets for use with GSEA (Gene Set Enrichment Analysis) software.
- The Hallmark gene set distills complex gene signatures into a concise set of gene sets that represent specific and well-defined biological states or processes.
- These gene sets are designed to be universally applicable for annotating gene expression patterns in a wide variety of biological contexts.
```{r msigdb, echo = F, fig.align='center', out.width='50%'}
knitr::include_graphics('data/MSigDB.jpg')
```
---
name: gene-sets
# Where to get gene sets for the analyses?
```{r genesetdb, echo=FALSE, results='asis', out.width='40%'}
cat('
<div style="display: flex; justify-content: space-around;">
<img src="', knitr::image_uri("data/EnrichR.png"), '" style="width: 50%; margin-right: 10px;" />
<img src="', knitr::image_uri("data/MSigDB_database.png"), '" style="width: 50%; margin-right: 10px;" />
</div>
')
```
---
name: GSA
# Gene set analysis methods
- Overrepresentation analysis (ORA):
- A statistical method for identifying terms (e.g. GO terms or pahtways) that are more represented in a given gene/protein set than expected by chance.
- Gene Set Enrichment Analysis (GSEA):
- A statistical method for evaluating the distribution of genes across a ranked list of genes showing the same signature (upregulated or downregulated) which happen to be involved in a given category (e.g. pathway).
---
name: ORA
# ORA
- It is a hypergeometric test (Fisher's exact test)
- Selected genes are differentially expressed genes (DEG: Up or Down)
- Category can be GO, Pathway,....
```{r ora-example, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics('data/ORA.png')
```
---
name: ORA1
# ORA
```{r ora-example1, echo=FALSE, fig.align='center', out.width='80%'}
knitr::include_graphics('data/ORA_Paulo_2019.png')
```
---
name: GSEA1
# GSEA
- In GSEA we do not have any prior selection of the genes (such as DEG)
- Genes are listed by logFC and their distribution is tested with a statistical test adapted from Kolmogrov-smirinov test. This test calculates an enrichment score (ES) for each predefined gene set which reflects the degree to which the genes in the set are overrepresented at the extremes (top or bottom) of the ranked list. In other words, it tries to identify maximum deviation form zero.
```{r gsea-example, echo=FALSE, fig.align='center', out.width='60%'}
knitr::include_graphics('data/GSEA.png')
```
---
name: GSEA1
# GSEA
- Few notes:
- The ES differ among tested pathways/terms. Thus use Normalized Enrichment Score (NES).
- Some genes may be involved in different pathways and thus can bias interpretation.
- As an alternative, topology-based method has been introduced which takes gene-set interaction into account. ([Ma et al., 2019](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3146-1)).
```{r gsea-notes, echo=FALSE, fig.align='center', out.width='60%'}
knitr::include_graphics('data/Gene_set_interaction.png')
```
---
name: summary
# Consideration for GSA
- Be mindful of choosing appropriate thresholds when identifying differentially expressed genes (DEGs) for further analysis.
- Gene-set names can be misleading.
- GSEA is sensitive to the size of gene-set.
- In ORA, the method does not account for the fact that not all genes contribute equally to a biological process or pathway. In this test, genes are treated in binary fashion (In a pathway or Nor in a pathway).
- There is not a linear relationship between genes and gene-sets; Not all the time upregulation/downregulation of genes can increase/decrease the activity of given pathway. They may suppress/activate that pathway.
- Analyzing outcomes within a pathway that contains both upregulated and downregulated genes can be challenging.
---
name: end_slide
class: end-slide, middle
count: false
# Thank you. Questions?