From 76105e94f972e878ee1b6c90d81bb30356ed12ef Mon Sep 17 00:00:00 2001
From: almeidasilvaf <fabricio_almeidasilva@hotmail.com>
Date: Sat, 4 Nov 2023 21:22:37 +0100
Subject: [PATCH] Added section on ORA with examples + added FAQ item on
 SummarizedExperiment objects

---
 vignettes/HybridExpress.Rmd | 147 ++++++++++++++++++++++++++++++++++++
 1 file changed, 147 insertions(+)

diff --git a/vignettes/HybridExpress.Rmd b/vignettes/HybridExpress.Rmd
index 7086daa..a11624a 100644
--- a/vignettes/HybridExpress.Rmd
+++ b/vignettes/HybridExpress.Rmd
@@ -374,11 +374,158 @@ plot_partition_frequencies(exp_partitions)
 plot_partition_frequencies(exp_partitions, group_by = "Class")
 ```
 
+# Overrepresentation analysis of functional terms
+
+Lastly, you'd want to explore whether gene sets of interest (e.g., up-regulated
+genes in each contrast) are enriched in any particular GO term, pathway,
+protein domain, etc. For that, you will use the function `ora()`,
+which performs overrepresentation analysis for a gene set given a data frame 
+of functional annotation for each gene.
+
+Here, we will use an example data set with GO annotation for *C. reinhardtii*
+genes. This data set illustrates how the annotation data frame must be
+shaped: gene ID in the first column, and functional annotations in other
+columns.
+
+```{r}
+# Load example functional annotation (GO terms)
+data(go_chlamy)
+
+head(go_chlamy)
+```
+
+To demonstrate the usage of `ora()`, let's check if we can find enrichment
+for any GO term among genes assigned to class "ADD".
+
+```{r}
+# Get a vector of genes in class "ADD"
+genes_add <- exp_partitions$Gene[exp_partitions$Class == "ADD"]
+head(genes_add)
+
+# Get background genes - genes in the count matrix
+bg <- rownames(se)
+
+# Perform overrepresentation analysis
+ora_add <- ora(genes_add, go_chlamy, background = bg)
+
+# Inspect results
+head(ora_add)
+```
+
+## Example 1: overrepresentation analyses for all expression-based classes
+
+In the example above, we performed an overrepresentation analysis in
+genes associated with class "ADD". What if we wanted to do the same for *all*
+classes?
+
+In that case, you could run `ora()` multiple times by looping over each class.
+In details, you would do the following:
+
+1. For each class, create a vector of genes associated with it;
+2. Run `ora()` to get a data frame with ORA results.
+
+Below you can find an example on how to do it using `lapply()`. Results
+for each class will be stored in elements of a list object.
+
+```{r}
+# Create a list of genes in each class
+genes_by_class <- split(exp_partitions$Gene, exp_partitions$Class)
+
+names(genes_by_class)
+head(genes_by_class$UP)
+
+# Iterate through each class and perform ORA
+ora_classes <- lapply(
+    genes_by_class, # list through which we will iterate
+    ora, # function we will apply to each element of the list
+    annotation = go_chlamy, background = bg # additional arguments to function
+)
+
+# Inspect output
+ora_classes
+```
+
+To do the same for each expression-based category (not class), you'd need
+to replace `Class` with `Category` in the `split()` function (see example 
+above).
+
+## Example 2: overrepresentation analyses for differentially expressed genes
+
+The same you can use `lapply()` to loop through each expression class,
+you can also loop through each contrast in the list returned 
+by `get_deg_list()`, and perform ORA for up- and down-regulated genes.
+Below you can find an example:
+
+```{r}
+# Get up-regulated genes for each contrast
+up_genes <- lapply(deg_list, function(x) rownames(x[x$log2FoldChange > 0, ]))
+names(up_genes)
+head(up_genes$F1_vs_P1)
+
+# Perform ORA
+ora_up <- lapply(
+    up_genes,
+    ora,
+    annotation = go_chlamy, background = bg
+)
+
+ora_up
+```
+
+Likewise, for down-regulated genes, you need to replace the `>` symbol
+with a `<` symbol in the anonymous function to subset rows.
 
 # FAQ {.unnumbered}
 
 1. How do I create a `SummarizedExperiment` object?
 
+A `SummarizedExperiment` is a data structure (an S4 class) that can
+be used to store, in a single object, the following elements:
+
+1. **assay:** A quantitative matrix with features in rows and samples in 
+columns. In the context of `r BiocStyle::Githubpkg("HybridExpress")`, this 
+would be a gene expression matrix (in counts) with genes in rows and samples
+in columns.
+2. **colData:** A data frame of sample metadata with samples in rows
+and variables that describe samples (e.g., tissue, treatment, and other 
+covariates) in columns.
+3. **rowData:** A data frame of gene metadata with genes in rows and
+variables that describe genes (e.g., chromosome name, alternative IDs,
+functional information, etc) in columns.
+
+For this package, you must have an *assay* containing the count matrix
+and a *colData* slot with sample metadata. *rowData* can be present,
+but it is not required. 
+
+To demonstrate how to create a `SummarizedExperiment` object, we will
+extract the *assay* and *colData* from the example object `se_chlamy`
+that comes with this package.
+
+```{r}
+# Get count matrix
+count_matrix <- assay(se_chlamy)
+head(count_matrix)
+
+# Get colData (data frame of sample metadata)
+coldata <- colData(se_chlamy)
+head(coldata)
+```
+
+Once you have these two objects, you can create a `SummarizedExperiment` object
+with:
+
+```{r}
+# Create a SummarizedExperiment object
+new_se <- SummarizedExperiment(
+    assays = list(counts = count_matrix),
+    colData = coldata
+)
+
+new_se
+```
+
+For more details on this data structure, read the vignette
+of the `r BiocStyle::Biocpkg("SummarizedExperiment")` package.
 
 # Session information {.unnumbered}