diff --git a/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd b/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd new file mode 100644 index 00000000..dffe1801 --- /dev/null +++ b/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd @@ -0,0 +1,547 @@ +--- +title: "Gene set enrichment analysis - RNA-seq" +author: "CCDL for ALSF" +date: "December 2020" +output: + html_notebook: + toc: true + toc_float: true + number_sections: true +--- + +# Purpose of this analysis + +This example is one of pathway analysis module set, we recommend looking at the [pathway analysis table below](#how-to-choose-a-pathway-analysis) to help you determine which pathway analysis method is best suited for your purposes. + +This particular example analysis shows how you can use gene set enrichment analysis (GSEA) to detect situations where all genes in a predefined gene set change in a coordinated way, detecting even small statistical but coordinated changes between two biological states. +Genes are ranked from most highly positive to most highly negative, weighted according to their gene-level statistic, which is essential to the calculation of the enrichment score (ES), a pathway-level statistic, for each gene set. +More specifically, an ES is calculated by starting with the most highly ranked genes (based on the gene-level statistic selected for ranking) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. +Normalized enrichment scores (NES) are enrichment scores that are scaled to account for gene sets of different sizes [@Subramanian2005; @Korotkevich2019]. + +⬇️ [**Jump to the analysis code**](#analysis) ⬇️ + +### What is pathway analysis? + +Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. +In the context of [refine.bio](https://www.refine.bio/), we use these techniques to analyze and interpret genome-wide gene expression experiments. +The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. +In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis. + +We highly recommend taking a look at [Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002375) from @Khatri2012 for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the [`Resources for further learning` section](#resources-for-further-learning). + +### How to choose a pathway analysis? + +This table summarizes the pathway analyses examples in this module. + +|Analysis|What is required for input|What output looks like |✅ Pros| ⚠️ Cons| +|--------|--------------------------|-----------------------|-------|-------| +|[**ORA (Over-representation Analysis)**](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_01_ora.html)|A list of gene IDs (no stats needed)|A per-pathway hypergeometric test result|- Simple

- Inexpensive computationally to calculate p-values| - Requires arbitrary thresholds and ignores any statistics associated with a gene

- Assumes independence of genes and pathways| +|[**GSEA (Gene Set Enrichment Analysis)**](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html)|A list of genes IDs with gene-level summary statistics|A per-pathway enrichment score|- Includes all genes (no arbitrary threshold!)

- Attempts to measure coordination of genes|- Permutations can be expensive

- Does not account for pathway overlap

- Two-group comparisons not always appropriate/feasible| +|[**GSVA (Gene Set Variation Analysis)**](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_03_gsva.html)|A gene expression matrix (like what you get from refine.bio directly)|Pathway-level scores on a per-sample basis|- Does not require two groups to compare upfront

- Normally distributed scores|- Scores are not a good fit for gene sets that contain genes that go up AND down

- Method doesn’t assign statistical significance itself

- Recommended sample size n > 10| + +# How to run this example + +For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. + +## Obtain the `.Rmd` file + +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd). + +Clicking this link will most likely send this to your downloads folder on your computer. +Move this `.Rmd` file to where you would like this example and its files to be stored. + +You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) + +## Set up your analysis folders + +Good file organization is helpful for keeping your data analysis project on track! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! + +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. + +```{r} +# Create the data folder if it doesn't exist +if (!dir.exists("data")) { + dir.create("data") +} + +# Define the file path to the plots directory +plots_dir <- "plots" # Can replace with path to desired output plots directory + +# Create the plots folder if it doesn't exist +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} + +# Define the file path to the results directory +results_dir <- "results" # Can replace with path to desired output results directory + +# Create the results folder if it doesn't exist +if (!dir.exists(results_dir)) { + dir.create(results_dir) +} +``` + +In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! + +## Obtain the gene set for this example + +In this example, we are using the differential expression results table we obtained from an [example analysis of an acute myeloid leukemia (AML) dataset](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential-expression_rnaseq_01.html) using the `DESeq2` package [@Love2014]. +The table contains summary statistics including Ensembl gene IDs, log2 fold change values, and adjusted p-values (FDR in this case). + +We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you'd like to follow the steps for obtaining this results file yourself, we suggest going through [that differential expression analysis example](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential-expression_rnaseq_01.html). + +## About the dataset we are using for this example + +For this example analysis, we are using RNA-seq data from an [acute lymphoblastic leukemia (ALL) mouse lymphoid cell model](https://www.refine.bio/experiments/SRP123625) [@Kampen2019]. +All lymphoid mouse cells have human RPL10 but three of the mice have a knock-in R98S mutated RPL10 and three have the human wildtype RPL10. +Differential expression was performed using these knock-in and wildtype mice designations. + +## Check out our file structure! + +Your new analysis folder should contain: + +- The example analysis `.Rmd` you downloaded +- A folder called `data` (currently empty) +- A folder for `plots` (currently empty) +- A folder for `results` (currently empty) + +Your example analysis folder should contain your `.Rmd` and three empty folders (which won't be empty for long!). + +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). + +# Using a different refine.bio dataset with this analysis? + +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). +We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. +From here you can customize this analysis example to fit your own scientific questions and preferences. + +*** + +   + +# Gene set enrichment analysis - RNA-seq + +## Install libraries + +See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. + +In this analysis, we will be using [`clusterProfiler`](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package to perform GSEA and the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package which contains gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) already in the tidy format required by `clusterProfiler` [@Yu2012; @Igor2020; @Subramanian2005]. + +We will also need the [`org.Mm.eg.db`](http://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html) package to perform gene identifier conversion [@Carlson2020-human]. + +```{r} +if (!("clusterProfiler" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("clusterProfiler", update = FALSE) +} + +if (!("msigdbr" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("msigdbr", update = FALSE) +} + +if (!("org.Mm.eg.db" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("org.Mm.eg.db", update = FALSE) +} +``` + +Attach the packages we need for this analysis. + +```{r message=FALSE} +# Attach the library +library(clusterProfiler) + +# Package that contains MSigDB gene sets in tidy format +library(msigdbr) + +# Human annotation package we'll use for gene identifier conversion +library(org.Mm.eg.db) + +# We will need this so we can use the pipe: %>% +library(magrittr) +``` + +The GSEA algorithm utilizes random sampling so we are going to set the seed to make our results reproducible. + +```{r} +# Set the seed so our results are reproducible: +set.seed(2020) +``` + +## Import data + +We will read in the differential expression results we will download from online. +These results are from an acute myeloid leukemia (AML) RNA-seq dataset we used for [differential expression analysis](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential-expression_rnaseq_01.html) using `DESeq2` [@Love2014]. +The table contains summary statistics including Ensembl gene IDs, log2 fold change values, and adjusted p-values (FDR in this case). +We can identify differentially regulated genes by filtering these results and use this list as input to GSEA. + +Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. +First we will assign the URL to its own variable called, `dge_url`. + +```{r} +# Define the url to your differential expression results file +dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/03-rnaseq/results/SRP123625/SRP123625_differential_expression_results.tsv" +``` + +Read in the file that has differential expression results. +Here we are using the URL we set up above, but this can be a local file path instead _i.e._ you can replace `dge_url` in the code below with a path to file you have on your computer like: `file.path("results", "SRP123625_differential_expression_results.tsv")`. + +```{r} +# Read in the contents of your differential expression results file +# `dge_url` can be replaced with a file path to a TSV file with your +# desired gene list results +dge_df <- readr::read_tsv(dge_url) +``` + +`read_tsv()` can read TSV files online and doesn't necessarily require you download the file first. +Let's take a look at what the contrast results from the differential expression analysis looks like. + +```{r} +dge_df +``` + +## Getting familiar with `clusterProfiler`'s options + +Let's take a look at what organisms the package supports. + +```{r} +msigdbr_species() +``` + +MSigDB contains 8 different gene set collections [@Subramanian2005]. + + H: hallmark gene sets + C1: positional gene sets + C2: curated gene sets + C3: motif gene sets + C4: computational gene sets + C5: GO gene sets + C6: oncogenic signatures + C7: immunologic signatures + +The data we're interested in here comes from human samples, so we can obtain just the gene sets (from all of the gene set collections listed above) relevant to _Homo sapiens_ by specifying `species = "Homo sapiens"`. + +```{r} +mm_gene_sets <- msigdbr( + species = "Mus musculus", # Replace with species name relevant to your data + category = "H" +) +``` + +If you run the chunk above with specifying `category = "H"` to the `msigdbr()` function, it will return only the Hallmark gene sets for _Homo sapiens_. +See `?msigdbr` for more options. + +Let's preview what's in `hs_gene_sets`. + +```{r} +head(mm_gene_sets) +``` + +Looks like we have a data frame of gene sets with associated gene symbols and Entrez IDs. + +In our differential expression results data frame, `dge_df` we have Ensembl gene identifiers. +So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs for GSEA. + +## Gene identifier conversion + +We're going to convert our identifiers in `dge_df` to gene symbols, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers! + +The annotation package `org.Hs.eg.db` contains information for different identifiers [@Carlson2020-human]. +`org.Hs.eg.db` is specific to _Homo sapiens_ -- this is what the `Hs` in the package name is referencing. + +We can see what types of IDs are available to us in an annotation package with `keytypes()`. + +```{r} +keytypes(org.Mm.eg.db) +``` + +Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to gene symbols (`SYMBOL`), we could just as easily use it to convert to and from any of these `keytypes()` listed above. + +The function we will use to map from Ensembl gene IDs to gene symbols is called `mapIds()`. + +Let's create a data frame that shows the mapped gene symbols along with the differential expression stats for the respective Ensembl IDs. + +```{r} +# First let's create a mapped data frame we can join to the differential expression stats +dge_mapped_df <- data.frame( + "gene_symbol" = mapIds( + org.Mm.eg.db, # Replace with annotation package for the organism relevant to your data + keys = dge_df$Gene, + column = "SYMBOL", # Replace with the type of gene identifiers you would like to map to + keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data + multiVals = "first" # This will keep only the first mapped value for each Ensembl ID + ) +) %>% + # If an Ensembl gene identifier doesn't map to a gene symbol, drop that + # from the data frame + dplyr::filter(!is.na(gene_symbol)) %>% + # Make an `Ensembl` column to store the rownames + tibble::rownames_to_column("Ensembl") %>% + # Now let's join the rest of the expression data + dplyr::inner_join(dge_df, by = c("Ensembl" = "Gene")) +``` + +This `1:many mapping between keys and columns` message means that some Ensembl gene identifiers map to multiple gene symbols. +In this case, it's also possible that a gene symbol will map to multiple Ensembl IDs. +For the purpose of performing GSEA later in this notebook, we keep only the first mapped IDs. +Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: +[the microarray example](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html) and [the RNA-seq example](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html). + +Let's see a preview of `dge_mapped_df`. + +```{r} +head(dge_mapped_df) +``` + +We will want to keep in mind that GSEA requires that genes are ranked from most highly positive to most highly negative and weighted according to their gene-level statistic, which is essential to the calculation of the enrichment score (ES), a pathway-level statistic. +GSEA also requires unique gene identifiers to produce the most accurate results. +This means if any duplicate gene identifiers are found in our dataset, we will want to retain the instance with the higher absolute value as this will be the instance most likely to be ranked most highly negative or most highly positive. +Otherwise the enrichment score results may be skewed and the `GSEA()` function will throw a warning. + +Let's check to see if we have any gene symbols that mapped to multiple Ensembl IDs. + +```{r} +any(duplicated(dge_mapped_df$gene_symbol)) +``` + +Looks like we do have duplicated gene symbols. +Let's find out which gene symbols have been duplicated. + +```{r} +dup_gene_symbols <- dge_mapped_df %>% + dplyr::filter(duplicated(gene_symbol)) %>% + dplyr::pull(gene_symbol) +``` + +Now let's take a look at the rows associated with the duplicated gene symbols. + +```{r} +dge_mapped_df %>% + dplyr::filter(gene_symbol %in% dup_gene_symbols) %>% + dplyr::arrange(gene_symbol) +``` + +We can see that the associated values vary for each row. + +As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSEA steps later, so let's keep the gene symbols associated with the higher ranked absolute value of the log2 fold change. +GSEA relies on genes' rankings on the basis of this statistic. +Retaining the instance of the gene symbol with the higher absolute value means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. +We should keep this decision in mind when interpreting our results. +For example, if the duplicate identifiers tended to be enriched in a particular gene set, we may get an optimistic view of how perturbed that gene set is by preferentially selecting values that have a higher absolute value. + +In the next chunk, we are going to filter out the duplicated row using the `dplyr::distinct()` function. +This will keep the first row with the duplicated value thus keeping the row with the highest ranked absolute value of the log2 fold change. + +```{r} +filtered_dge_mapped_df <- dge_mapped_df %>% + # First let's create a column to hold the rankings of our absolute log2 fold + # change values -- this handles duplicate log2 fold change values + # dplyr::mutate(rank = rank(abs(log2FoldChange), ties.method = "random")) %>% + # Sort so that the highest ranked absolute values of the log2 fold change are at the top + dplyr::arrange(dplyr::desc(abs(log2FoldChange))) %>% + # Filter out the duplicated rows using `dplyr::distinct()` + dplyr::distinct(gene_symbol, .keep_all = TRUE) +``` + +Note that the log2 fold change values we use here for ranking have been transformed by `DESeq2` to account for genes with low counts or highly variable counts. +See the [`DESeq2` package vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) for more information on how `DESeq2` handles the log2 fold change values with the `lfcShrink()` function. + +Let's check to see that we removed the duplicate gene symbols and kept the rows with the higher ranked absolute value of the log2 fold change. + +```{r} +any(duplicated(filtered_dge_mapped_df$gene_symbol)) +``` + +Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher ranked absolute value of the log2 fold change! + +## Perform gene set enrichment analysis (GSEA) + +The goal of GSEA is to detect situations where all genes in a gene set change in a coordinated way, detecting even small statistical but coordinated changes. +This is done by ranking genes within a gene set from most highly positive to most highly negative, weighting them according to their gene-level statistic, to calculate the enrichment score (ES), which is a pathway-level statistic [@clusterProfiler-book]. + +### Determine our pre-ranked genes list + +The `GSEA()` function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. +We will therefore need to create a gene list with statistics that GSEA will rank by. + +In the next chunk, we will create a named vector ranked based on the rankings of the gene-level log2 fold change values. + +```{r} +# Let's create a named vector of the rankings based on the log2 fold change values +ranked_vector <- filtered_dge_mapped_df$log2FoldChange +names(ranked_vector) <- filtered_dge_mapped_df$gene_symbol + +# We need to sort the ranked vector in descending order here +ranked_vector <- sort(ranked_vector, decreasing = TRUE) +``` + +Let's preview our pre-ranked named vector. + +```{r} +# Look at first entries of the ranked vector +head(ranked_vector) +``` + +### Run GSEA using the `GSEA()` function + +Genes were ranked from most highly positive to most highly negative, weighted according to their gene-level statistic, in the previous section. +In this section, we will implement GSEA to calculate the enrichment score (ES) for each gene set using our pre-ranked gene list. + +We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`). + +```{r} +gene_terms <- mm_gene_sets %>% + dplyr::select( + gs_name, + gene_symbol + ) + +gsea_results <- GSEA( + geneList = ranked_vector, # ordered ranked gene list + minGSSize = 25, # minimum gene set size + maxGSSize = 500, # maximum gene set set + pvalueCutoff = 0.05, # p value cutoff + eps = 0, # boundary for calculating the p value + seed = TRUE, # set seed to make results reproducible + pAdjustMethod = "BH", # Benjamini-Hochberg correction + TERM2GENE = dplyr::select( + mm_gene_sets, + gs_name, + gene_symbol + ) +) +``` + +Let's take a look at the results. + +```{r} +# We can access the results from our `gsea_results` object using `@result` +head(gsea_results@result) +``` + +Significance is assessed by permutating the gene labels of the pre-ranked gene list and recomputing the ES of the gene set for the permutated data, which generates a null distribution for the ES. +The ES for each gene set is then normalized to account for the size of the set, resulting in a normalized enrichment score (NES), and an FDR (false discovery rate) value is calculated to account for multiple hypothesis testing. + +Looks like we have gene sets returned as significant at FDR of `0.05`, according to the `p.adjust` column above. + +The information we're most likely interested in is in the `results` slot. +Let's convert this into a data frame that we can use for further analysis and write to file. + +```{r} +gsea_result_df <- data.frame(gsea_results@result) +``` + +## Visualizing results + +We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`. +Let's take a look at 2 different pathways -- one with a highly positive NES and one with a highly negative NES -- to get more insight into how ES are calculated. +Note that if we didn't have any significant pathways, our visualizations below would show up blank as nothing would have met our `pvalueCutoff` above. + +### Most Positive NES + +Let's look for the gene set with the most positive NES. + +```{r} +gsea_result_df %>% + # Use `slice_max()` to return the top n observation using `NES` as the ordering variable + dplyr::slice_max(NES, n = 3) +``` + +The gene set `HALLMARK_MYC_TARGETS_V2` has the most positive NES score. + +```{r} +most_positive_nes_plot <- enrichplot::gseaplot( + gsea_results, + geneSetID = "HALLMARK_MYC_TARGETS_V2", + title = "HALLMARK_MYC_TARGETS_V2", + color.line = "#0d76ff" +) + +most_positive_nes_plot +``` + +The red dashed line indicates the peak ES score (the ES is the maximum deviation from zero). +As mentioned earlier, an ES is calculated by starting with the most highly ranked genes (according to the gene-level log2 fold change values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. +In this case, the most highly positive enrichment score's data are being displayed. + +The plots returned by `enrichplot::gseaplot` are ggplots, so we can use `ggplot2::ggsave()` to save them to file. + +Let's save to PNG. + +```{r} +ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_positive_plot.png"), + plot = most_positive_nes_plot +) +``` + +### Most Negative NES + +Let's look for the gene set with the most negative NES. + +```{r} +gsea_result_df %>% + # Use `slice_min()` to return the bottom n observation using `NES` as the ordering + # variable + dplyr::slice_min(NES, n = 3) +``` + +The gene set `HALLMARK_HYPOXIA` has the most negative NES. + +```{r} +most_negative_nes_plot <- enrichplot::gseaplot( + gsea_results, + geneSetID = "HALLMARK_HYPOXIA", + title = "HALLMARK_HYPOXIA", + color.line = "#0d76ff" +) + +most_negative_nes_plot +``` + +Again, the red dashed line here indicates the maximum deviation from zero, in other words, the most negative ES score. +As we know, the ES is calculated by starting with the most highly ranked genes (according to the gene-level log2 fold change values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. +A negative enrichment score will be returned when not many of the genes are found at the top of the list, as this would mean decreasing the score a great deal thus producing a negative value. +In this case, the most negative enrichment score's data are being displayed. + +Let's save this plot to PNG. + +```{r} +ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_negative_plot.png"), + plot = most_negative_nes_plot +) +``` + +## Write results to file + +```{r} +readr::write_tsv( + gsea_result_df, + file.path( + results_dir, + "SRP123625_gsea_results.tsv" + ) +) +``` + +# Resources for further learning + +- [clusterProfiler paper](https://doi.org/10.1089/omi.2011.0118) [@Yu2012]. +- [clusterProfiler book](https://yulab-smu.github.io/clusterProfiler-book/index.html) [@clusterProfiler-book]. +- [This handy review](https://doi.org/10.1371/journal.pcbi.1002375) which summarizes the different types of pathway analysis and their limitations [@Khatri2012]. +- See this [Broad Institute page](https://www.gsea-msigdb.org/gsea/index.jsp) for more on GSEA and MSigDB [@GSEA-broad-institute]. + +# Session info + +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. + +```{r} +# Print session info +sessioninfo::session_info() +``` + +# References diff --git a/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html b/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html new file mode 100644 index 00000000..af6a55e3 --- /dev/null +++ b/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html @@ -0,0 +1,4450 @@ + + + + + + + + + + + + + + +Gene set enrichment analysis - RNA-seq + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + + + +
+

1 Purpose of this analysis

+

This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

+

This particular example analysis shows how you can use gene set enrichment analysis (GSEA) to detect situations where all genes in a predefined gene set change in a coordinated way, detecting even small statistical but coordinated changes between two biological states. Genes are ranked from most highly positive to most highly negative, weighted according to their gene-level statistic, which is essential to the calculation of the enrichment score (ES), a pathway-level statistic, for each gene set. More specifically, an ES is calculated by starting with the most highly ranked genes (based on the gene-level statistic selected for ranking) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. Normalized enrichment scores (NES) are enrichment scores that are scaled to account for gene sets of different sizes (Korotkevich et al. 2019; Subramanian et al. 2005).

+

⬇️ Jump to the analysis code ⬇️

+
+

1.0.1 What is pathway analysis?

+

Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

+

We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

+
+
+

1.0.2 How to choose a pathway analysis?

+

This table summarizes the pathway analyses examples in this module.

+ +++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
AnalysisWhat is required for inputWhat output looks like✅ Pros⚠️ Cons
ORA (Over-representation Analysis)A list of gene IDs (no stats needed)A per-pathway hypergeometric test result- Simple

- Inexpensive computationally to calculate p-values
- Requires arbitrary thresholds and ignores any statistics associated with a gene

- Assumes independence of genes and pathways
GSEA (Gene Set Enrichment Analysis)A list of genes IDs with gene-level summary statisticsA per-pathway enrichment score- Includes all genes (no arbitrary threshold!)

- Attempts to measure coordination of genes
- Permutations can be expensive

- Does not account for pathway overlap

- Two-group comparisons not always appropriate/feasible
GSVA (Gene Set Variation Analysis)A gene expression matrix (like what you get from refine.bio directly)Pathway-level scores on a per-sample basis- Does not require two groups to compare upfront

- Normally distributed scores
- Scores are not a good fit for gene sets that contain genes that go up AND down

- Method doesn’t assign statistical significance itself

- Recommended sample size n > 10
+
+
+
+

2 How to run this example

+

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

+
+

2.1 Obtain the .Rmd file

+

To run this example yourself, download the .Rmd for this analysis by clicking this link.

+

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

+

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

+
+
+

2.2 Set up your analysis folders

+

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

+

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots" # Can replace with path to desired output plots directory
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results" # Can replace with path to desired output results directory
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}
+

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

+
+
+

2.3 Obtain the gene set for this example

+

In this example, we are using the differential expression results table we obtained from an example analysis of an acute myeloid leukemia (AML) dataset using the DESeq2 package (Love et al. 2014). The table contains summary statistics including Ensembl gene IDs, log2 fold change values, and adjusted p-values (FDR in this case).

+

We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you’d like to follow the steps for obtaining this results file yourself, we suggest going through that differential expression analysis example.

+
+
+

2.4 About the dataset we are using for this example

+

For this example analysis, we are using RNA-seq data from an acute lymphoblastic leukemia (ALL) mouse lymphoid cell model (???). All lymphoid mouse cells have human RPL10 but three of the mice have a knock-in R98S mutated RPL10 and three have the human wildtype RPL10. Differential expression was performed using these knock-in and wildtype mice designations.

+
+
+

2.5 Check out our file structure!

+

Your new analysis folder should contain:

+
    +
  • The example analysis .Rmd you downloaded
    +
  • +
  • A folder called data (currently empty)
  • +
  • A folder for plots (currently empty)
    +
  • +
  • A folder for results (currently empty)
  • +
+

Your example analysis folder should contain your .Rmd and three empty folders (which won’t be empty for long!).

+

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

+
+
+
+

3 Using a different refine.bio dataset with this analysis?

+

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

+
+ +

 

+
+
+

4 Gene set enrichment analysis - RNA-seq

+
+

4.1 Install libraries

+

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

+

In this analysis, we will be using clusterProfiler package to perform GSEA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005).

+

We will also need the org.Mm.eg.db package to perform gene identifier conversion (Carlson 2020).

+
if (!("clusterProfiler" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("clusterProfiler", update = FALSE)
+}
+
+if (!("msigdbr" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("msigdbr", update = FALSE)
+}
+
+if (!("org.Mm.eg.db" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("org.Mm.eg.db", update = FALSE)
+}
+

Attach the packages we need for this analysis.

+
# Attach the library
+library(clusterProfiler)
+
+# Package that contains MSigDB gene sets in tidy format
+library(msigdbr)
+
+# Human annotation package we'll use for gene identifier conversion
+library(org.Mm.eg.db)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
+

The GSEA algorithm utilizes random sampling so we are going to set the seed to make our results reproducible.

+
# Set the seed so our results are reproducible:
+set.seed(2020)
+
+
+

4.2 Import data

+

We will read in the differential expression results we will download from online. These results are from an acute myeloid leukemia (AML) RNA-seq dataset we used for differential expression analysis using DESeq2 (Love et al. 2014). The table contains summary statistics including Ensembl gene IDs, log2 fold change values, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to GSEA.

+

Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. First we will assign the URL to its own variable called, dge_url.

+
# Define the url to your differential expression results file
+dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/03-rnaseq/results/SRP123625/SRP123625_differential_expression_results.tsv"
+

Read in the file that has differential expression results. Here we are using the URL we set up above, but this can be a local file path instead i.e. you can replace dge_url in the code below with a path to file you have on your computer like: file.path("results", "SRP123625_differential_expression_results.tsv").

+
# Read in the contents of your differential expression results file
+# `dge_url` can be replaced with a file path to a TSV file with your
+# desired gene list results
+dge_df <- readr::read_tsv(dge_url)
+
## 
+## ── Column specification ────────────────────────────────────────────────────────────────────
+## cols(
+##   Gene = col_character(),
+##   baseMean = col_double(),
+##   log2FoldChange = col_double(),
+##   lfcSE = col_double(),
+##   pvalue = col_double(),
+##   padj = col_double(),
+##   threshold = col_logical()
+## )
+

read_tsv() can read TSV files online and doesn’t necessarily require you download the file first. Let’s take a look at what the contrast results from the differential expression analysis looks like.

+
dge_df
+
+ +
+
+
+

4.3 Getting familiar with clusterProfiler’s options

+

Let’s take a look at what organisms the package supports.

+
msigdbr_species()
+
+ +
+

MSigDB contains 8 different gene set collections (Subramanian et al. 2005).

+
H: hallmark gene sets
+C1: positional gene sets
+C2: curated gene sets
+C3: motif gene sets
+C4: computational gene sets
+C5: GO gene sets
+C6: oncogenic signatures
+C7: immunologic signatures
+

The data we’re interested in here comes from human samples, so we can obtain just the gene sets (from all of the gene set collections listed above) relevant to Homo sapiens by specifying species = "Homo sapiens".

+
mm_gene_sets <- msigdbr(
+  species = "Mus musculus", # Replace with species name relevant to your data
+  category = "H"
+)
+

If you run the chunk above with specifying category = "H" to the msigdbr() function, it will return only the Hallmark gene sets for Homo sapiens. See ?msigdbr for more options.

+

Let’s preview what’s in hs_gene_sets.

+
head(mm_gene_sets)
+
+ +
+

Looks like we have a data frame of gene sets with associated gene symbols and Entrez IDs.

+

In our differential expression results data frame, dge_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs for GSEA.

+
+
+

4.4 Gene identifier conversion

+

We’re going to convert our identifiers in dge_df to gene symbols, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

+

The annotation package org.Hs.eg.db contains information for different identifiers (Carlson 2020). org.Hs.eg.db is specific to Homo sapiens – this is what the Hs in the package name is referencing.

+

We can see what types of IDs are available to us in an annotation package with keytypes().

+
keytypes(org.Mm.eg.db)
+
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
+##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
+## [11] "GO"           "GOALL"        "IPI"          "MGI"          "ONTOLOGY"    
+## [16] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"     
+## [21] "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"
+

Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to gene symbols (SYMBOL), we could just as easily use it to convert to and from any of these keytypes() listed above.

+

The function we will use to map from Ensembl gene IDs to gene symbols is called mapIds().

+

Let’s create a data frame that shows the mapped gene symbols along with the differential expression stats for the respective Ensembl IDs.

+
# First let's create a mapped data frame we can join to the differential expression stats
+dge_mapped_df <- data.frame(
+  "gene_symbol" = mapIds(
+    org.Mm.eg.db, # Replace with annotation package for the organism relevant to your data
+    keys = dge_df$Gene,
+    column = "SYMBOL", # Replace with the type of gene identifiers you would like to map to
+    keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
+    multiVals = "first" # This will keep only the first mapped value for each Ensembl ID
+  )
+) %>%
+  # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
+  # from the data frame
+  dplyr::filter(!is.na(gene_symbol)) %>%
+  # Make an `Ensembl` column to store the rownames
+  tibble::rownames_to_column("Ensembl") %>%
+  # Now let's join the rest of the expression data
+  dplyr::inner_join(dge_df, by = c("Ensembl" = "Gene"))
+
## 'select()' returned 1:many mapping between keys and columns
+

This 1:many mapping between keys and columns message means that some Ensembl gene identifiers map to multiple gene symbols. In this case, it’s also possible that a gene symbol will map to multiple Ensembl IDs. For the purpose of performing GSEA later in this notebook, we keep only the first mapped IDs. Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

+

Let’s see a preview of dge_mapped_df.

+
head(dge_mapped_df)
+
+ +
+

We will want to keep in mind that GSEA requires that genes are ranked from most highly positive to most highly negative and weighted according to their gene-level statistic, which is essential to the calculation of the enrichment score (ES), a pathway-level statistic. GSEA also requires unique gene identifiers to produce the most accurate results. This means if any duplicate gene identifiers are found in our dataset, we will want to retain the instance with the higher absolute value as this will be the instance most likely to be ranked most highly negative or most highly positive. Otherwise the enrichment score results may be skewed and the GSEA() function will throw a warning.

+

Let’s check to see if we have any gene symbols that mapped to multiple Ensembl IDs.

+
any(duplicated(dge_mapped_df$gene_symbol))
+
## [1] TRUE
+

Looks like we do have duplicated gene symbols. Let’s find out which gene symbols have been duplicated.

+
dup_gene_symbols <- dge_mapped_df %>%
+  dplyr::filter(duplicated(gene_symbol)) %>%
+  dplyr::pull(gene_symbol)
+

Now let’s take a look at the rows associated with the duplicated gene symbols.

+
dge_mapped_df %>%
+  dplyr::filter(gene_symbol %in% dup_gene_symbols) %>%
+  dplyr::arrange(gene_symbol)
+
+ +
+

We can see that the associated values vary for each row.

+

As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSEA steps later, so let’s keep the gene symbols associated with the higher ranked absolute value of the log2 fold change. GSEA relies on genes’ rankings on the basis of this statistic. Retaining the instance of the gene symbol with the higher absolute value means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. We should keep this decision in mind when interpreting our results. For example, if the duplicate identifiers tended to be enriched in a particular gene set, we may get an optimistic view of how perturbed that gene set is by preferentially selecting values that have a higher absolute value.

+

In the next chunk, we are going to filter out the duplicated row using the dplyr::distinct() function. This will keep the first row with the duplicated value thus keeping the row with the highest ranked absolute value of the log2 fold change.

+
filtered_dge_mapped_df <- dge_mapped_df %>%
+  # First let's create a column to hold the rankings of our absolute log2 fold
+  # change values -- this handles duplicate log2 fold change values
+  # dplyr::mutate(rank = rank(abs(log2FoldChange), ties.method = "random")) %>%
+  # Sort so that the highest ranked absolute values of the log2 fold change are at the top
+  dplyr::arrange(dplyr::desc(abs(log2FoldChange))) %>%
+  # Filter out the duplicated rows using `dplyr::distinct()`
+  dplyr::distinct(gene_symbol, .keep_all = TRUE)
+

Note that the log2 fold change values we use here for ranking have been transformed by DESeq2 to account for genes with low counts or highly variable counts. See the DESeq2 package vignette for more information on how DESeq2 handles the log2 fold change values with the lfcShrink() function.

+

Let’s check to see that we removed the duplicate gene symbols and kept the rows with the higher ranked absolute value of the log2 fold change.

+
any(duplicated(filtered_dge_mapped_df$gene_symbol))
+
## [1] FALSE
+

Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher ranked absolute value of the log2 fold change!

+
+
+

4.5 Perform gene set enrichment analysis (GSEA)

+

The goal of GSEA is to detect situations where all genes in a gene set change in a coordinated way, detecting even small statistical but coordinated changes. This is done by ranking genes within a gene set from most highly positive to most highly negative, weighting them according to their gene-level statistic, to calculate the enrichment score (ES), which is a pathway-level statistic (Yu).

+
+

4.5.1 Determine our pre-ranked genes list

+

The GSEA() function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. We will therefore need to create a gene list with statistics that GSEA will rank by.

+

In the next chunk, we will create a named vector ranked based on the rankings of the gene-level log2 fold change values.

+
# Let's create a named vector of the rankings based on the log2 fold change values
+ranked_vector <- filtered_dge_mapped_df$log2FoldChange
+names(ranked_vector) <- filtered_dge_mapped_df$gene_symbol
+
+# We need to sort the ranked vector in descending order here
+ranked_vector <- sort(ranked_vector, decreasing = TRUE)
+

Let’s preview our pre-ranked named vector.

+
# Look at first entries of the ranked vector
+head(ranked_vector)
+
##   Lpgat1   Lgals7    Gm973     Bbs7     Clnk   Zfp575 
+## 13.34941 12.64196 12.51824 12.19278 11.52481 10.20900
+
+
+

4.5.2 Run GSEA using the GSEA() function

+

Genes were ranked from most highly positive to most highly negative, weighted according to their gene-level statistic, in the previous section. In this section, we will implement GSEA to calculate the enrichment score (ES) for each gene set using our pre-ranked gene list.

+

We can use the GSEA() function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., gseKEGG()).

+
gene_terms <- mm_gene_sets %>%
+  dplyr::select(
+    gs_name,
+    gene_symbol
+  )
+
+gsea_results <- GSEA(
+  geneList = ranked_vector, # ordered ranked gene list
+  minGSSize = 25, # minimum gene set size
+  maxGSSize = 500, # maximum gene set set
+  pvalueCutoff = 0.05, # p value cutoff
+  eps = 0, # boundary for calculating the p value
+  seed = TRUE, # set seed to make results reproducible
+  pAdjustMethod = "BH", # Benjamini-Hochberg correction
+  TERM2GENE = dplyr::select(
+    mm_gene_sets,
+    gs_name,
+    gene_symbol
+  )
+)
+
## preparing geneSet collections...
+
## GSEA analysis...
+
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.19% of the list).
+## The order of those tied genes will be arbitrary, which may produce unexpected results.
+
## leading edge analysis...
+
## done...
+

Let’s take a look at the results.

+
# We can access the results from our `gsea_results` object using `@result`
+head(gsea_results@result)
+
+ +
+

Significance is assessed by permutating the gene labels of the pre-ranked gene list and recomputing the ES of the gene set for the permutated data, which generates a null distribution for the ES. The ES for each gene set is then normalized to account for the size of the set, resulting in a normalized enrichment score (NES), and an FDR (false discovery rate) value is calculated to account for multiple hypothesis testing.

+

Looks like we have gene sets returned as significant at FDR of 0.05, according to the p.adjust column above.

+

The information we’re most likely interested in is in the results slot. Let’s convert this into a data frame that we can use for further analysis and write to file.

+
gsea_result_df <- data.frame(gsea_results@result)
+
+
+
+

4.6 Visualizing results

+

We can visualize GSEA results for individual pathways or gene sets using enrichplot::gseaplot(). Let’s take a look at 2 different pathways – one with a highly positive NES and one with a highly negative NES – to get more insight into how ES are calculated. Note that if we didn’t have any significant pathways, our visualizations below would show up blank as nothing would have met our pvalueCutoff above.

+
+

4.6.1 Most Positive NES

+

Let’s look for the gene set with the most positive NES.

+
gsea_result_df %>%
+  # Use `slice_max()` to return the top n observation using `NES` as the ordering variable
+  dplyr::slice_max(NES, n = 3)
+
+ +
+

The gene set HALLMARK_MYC_TARGETS_V2 has the most positive NES score.

+
most_positive_nes_plot <- enrichplot::gseaplot(
+  gsea_results,
+  geneSetID = "HALLMARK_MYC_TARGETS_V2",
+  title = "HALLMARK_MYC_TARGETS_V2",
+  color.line = "#0d76ff"
+)
+
+most_positive_nes_plot
+

+

The red dashed line indicates the peak ES score (the ES is the maximum deviation from zero). As mentioned earlier, an ES is calculated by starting with the most highly ranked genes (according to the gene-level log2 fold change values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. In this case, the most highly positive enrichment score’s data are being displayed.

+

The plots returned by enrichplot::gseaplot are ggplots, so we can use ggplot2::ggsave() to save them to file.

+

Let’s save to PNG.

+
ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_positive_plot.png"),
+  plot = most_positive_nes_plot
+)
+
## Saving 7 x 5 in image
+
+
+

4.6.2 Most Negative NES

+

Let’s look for the gene set with the most negative NES.

+
gsea_result_df %>%
+  # Use `slice_min()` to return the bottom n observation using `NES` as the ordering
+  # variable
+  dplyr::slice_min(NES, n = 3)
+
+ +
+

The gene set HALLMARK_HYPOXIA has the most negative NES.

+
most_negative_nes_plot <- enrichplot::gseaplot(
+  gsea_results,
+  geneSetID = "HALLMARK_HYPOXIA",
+  title = "HALLMARK_HYPOXIA",
+  color.line = "#0d76ff"
+)
+
+most_negative_nes_plot
+

+

Again, the red dashed line here indicates the maximum deviation from zero, in other words, the most negative ES score. As we know, the ES is calculated by starting with the most highly ranked genes (according to the gene-level log2 fold change values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. A negative enrichment score will be returned when not many of the genes are found at the top of the list, as this would mean decreasing the score a great deal thus producing a negative value. In this case, the most negative enrichment score’s data are being displayed.

+

Let’s save this plot to PNG.

+
ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_negative_plot.png"),
+  plot = most_negative_nes_plot
+)
+
## Saving 7 x 5 in image
+
+
+
+

4.7 Write results to file

+
readr::write_tsv(
+  gsea_result_df,
+  file.path(
+    results_dir,
+    "SRP123625_gsea_results.tsv"
+  )
+)
+
+
+
+

5 Resources for further learning

+ +
+
+

6 Session info

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ───────────────────────────────────────────────────────────────
+##  setting  value                       
+##  version  R version 4.0.2 (2020-06-22)
+##  os       Ubuntu 20.04 LTS            
+##  system   x86_64, linux-gnu           
+##  ui       X11                         
+##  language (EN)                        
+##  collate  en_US.UTF-8                 
+##  ctype    en_US.UTF-8                 
+##  tz       Etc/UTC                     
+##  date     2020-12-07                  
+## 
+## ─ Packages ───────────────────────────────────────────────────────────────────
+##  package         * version  date       lib source        
+##  AnnotationDbi   * 1.52.0   2020-10-27 [1] Bioconductor  
+##  assertthat        0.2.1    2019-03-21 [1] RSPM (R 4.0.0)
+##  backports         1.1.10   2020-09-15 [1] RSPM (R 4.0.2)
+##  Biobase         * 2.50.0   2020-10-27 [1] Bioconductor  
+##  BiocGenerics    * 0.36.0   2020-10-27 [1] Bioconductor  
+##  BiocManager       1.30.10  2019-11-16 [1] RSPM (R 4.0.0)
+##  BiocParallel      1.24.1   2020-11-06 [1] Bioconductor  
+##  bit               4.0.4    2020-08-04 [1] RSPM (R 4.0.2)
+##  bit64             4.0.5    2020-08-30 [1] RSPM (R 4.0.2)
+##  blob              1.2.1    2020-01-20 [1] RSPM (R 4.0.0)
+##  cli               2.1.0    2020-10-12 [1] RSPM (R 4.0.2)
+##  clusterProfiler * 3.18.0   2020-10-27 [1] Bioconductor  
+##  colorspace        1.4-1    2019-03-18 [1] RSPM (R 4.0.0)
+##  cowplot           1.1.0    2020-09-08 [1] RSPM (R 4.0.2)
+##  crayon            1.3.4    2017-09-16 [1] RSPM (R 4.0.0)
+##  curl              4.3      2019-12-02 [1] RSPM (R 4.0.0)
+##  data.table        1.13.0   2020-07-24 [1] RSPM (R 4.0.2)
+##  DBI               1.1.0    2019-12-15 [1] RSPM (R 4.0.0)
+##  digest            0.6.25   2020-02-23 [1] RSPM (R 4.0.0)
+##  DO.db             2.9      2020-12-02 [1] Bioconductor  
+##  DOSE              3.16.0   2020-10-27 [1] Bioconductor  
+##  downloader        0.4      2015-07-09 [1] RSPM (R 4.0.0)
+##  dplyr             1.0.2    2020-08-18 [1] RSPM (R 4.0.2)
+##  ellipsis          0.3.1    2020-05-15 [1] RSPM (R 4.0.0)
+##  enrichplot        1.10.1   2020-11-14 [1] Bioconductor  
+##  evaluate          0.14     2019-05-28 [1] RSPM (R 4.0.0)
+##  fansi             0.4.1    2020-01-08 [1] RSPM (R 4.0.0)
+##  farver            2.0.3    2020-01-16 [1] RSPM (R 4.0.0)
+##  fastmatch         1.1-0    2017-01-28 [1] RSPM (R 4.0.0)
+##  fgsea             1.16.0   2020-10-27 [1] Bioconductor  
+##  generics          0.0.2    2018-11-29 [1] RSPM (R 4.0.0)
+##  getopt            1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
+##  ggforce           0.3.2    2020-06-23 [1] RSPM (R 4.0.2)
+##  ggplot2           3.3.2    2020-06-19 [1] RSPM (R 4.0.1)
+##  ggraph            2.0.3    2020-05-20 [1] RSPM (R 4.0.2)
+##  ggrepel           0.8.2    2020-03-08 [1] RSPM (R 4.0.2)
+##  glue              1.4.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  GO.db             3.12.1   2020-12-02 [1] Bioconductor  
+##  GOSemSim          2.16.1   2020-10-29 [1] Bioconductor  
+##  graphlayouts      0.7.0    2020-04-25 [1] RSPM (R 4.0.2)
+##  gridExtra         2.3      2017-09-09 [1] RSPM (R 4.0.0)
+##  gtable            0.3.0    2019-03-25 [1] RSPM (R 4.0.0)
+##  hms               0.5.3    2020-01-08 [1] RSPM (R 4.0.0)
+##  htmltools         0.5.0    2020-06-16 [1] RSPM (R 4.0.1)
+##  igraph            1.2.6    2020-10-06 [1] RSPM (R 4.0.2)
+##  IRanges         * 2.24.0   2020-10-27 [1] Bioconductor  
+##  jsonlite          1.7.1    2020-09-07 [1] RSPM (R 4.0.2)
+##  knitr             1.30     2020-09-22 [1] RSPM (R 4.0.2)
+##  labeling          0.3      2014-08-23 [1] RSPM (R 4.0.0)
+##  lattice           0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
+##  lifecycle         0.2.0    2020-03-06 [1] RSPM (R 4.0.0)
+##  magrittr        * 1.5      2014-11-22 [1] RSPM (R 4.0.0)
+##  MASS              7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+##  Matrix            1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
+##  memoise           1.1.0    2017-04-21 [1] RSPM (R 4.0.0)
+##  msigdbr         * 7.2.1    2020-10-02 [1] RSPM (R 4.0.2)
+##  munsell           0.5.0    2018-06-12 [1] RSPM (R 4.0.0)
+##  optparse        * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
+##  org.Mm.eg.db    * 3.12.0   2020-12-02 [1] Bioconductor  
+##  pillar            1.4.6    2020-07-10 [1] RSPM (R 4.0.2)
+##  pkgconfig         2.0.3    2019-09-22 [1] RSPM (R 4.0.0)
+##  plyr              1.8.6    2020-03-03 [1] RSPM (R 4.0.2)
+##  polyclip          1.10-0   2019-03-14 [1] RSPM (R 4.0.0)
+##  ps                1.4.0    2020-10-07 [1] RSPM (R 4.0.2)
+##  purrr             0.3.4    2020-04-17 [1] RSPM (R 4.0.0)
+##  qvalue            2.22.0   2020-10-27 [1] Bioconductor  
+##  R.cache           0.14.0   2019-12-06 [1] RSPM (R 4.0.0)
+##  R.methodsS3       1.8.1    2020-08-26 [1] RSPM (R 4.0.2)
+##  R.oo              1.24.0   2020-08-26 [1] RSPM (R 4.0.2)
+##  R.utils           2.10.1   2020-08-26 [1] RSPM (R 4.0.2)
+##  R6                2.4.1    2019-11-12 [1] RSPM (R 4.0.0)
+##  RColorBrewer      1.1-2    2014-12-07 [1] RSPM (R 4.0.0)
+##  Rcpp              1.0.5    2020-07-06 [1] RSPM (R 4.0.2)
+##  readr             1.4.0    2020-10-05 [1] RSPM (R 4.0.2)
+##  rematch2          2.1.2    2020-05-01 [1] RSPM (R 4.0.0)
+##  reshape2          1.4.4    2020-04-09 [1] RSPM (R 4.0.2)
+##  rlang             0.4.8    2020-10-08 [1] RSPM (R 4.0.2)
+##  rmarkdown         2.4      2020-09-30 [1] RSPM (R 4.0.2)
+##  RSQLite           2.2.1    2020-09-30 [1] RSPM (R 4.0.2)
+##  rstudioapi        0.11     2020-02-07 [1] RSPM (R 4.0.0)
+##  rvcheck           0.1.8    2020-03-01 [1] RSPM (R 4.0.0)
+##  S4Vectors       * 0.28.0   2020-10-27 [1] Bioconductor  
+##  scales            1.1.1    2020-05-11 [1] RSPM (R 4.0.0)
+##  scatterpie        0.1.5    2020-09-09 [1] RSPM (R 4.0.2)
+##  sessioninfo       1.1.1    2018-11-05 [1] RSPM (R 4.0.0)
+##  shadowtext        0.0.7    2019-11-06 [1] RSPM (R 4.0.0)
+##  stringi           1.5.3    2020-09-09 [1] RSPM (R 4.0.2)
+##  stringr           1.4.0    2019-02-10 [1] RSPM (R 4.0.0)
+##  styler            1.3.2    2020-02-23 [1] RSPM (R 4.0.0)
+##  tibble            3.0.4    2020-10-12 [1] RSPM (R 4.0.2)
+##  tidygraph         1.2.0    2020-05-12 [1] RSPM (R 4.0.2)
+##  tidyr             1.1.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  tidyselect        1.1.0    2020-05-11 [1] RSPM (R 4.0.0)
+##  tweenr            1.0.1    2018-12-14 [1] RSPM (R 4.0.2)
+##  vctrs             0.3.4    2020-08-29 [1] RSPM (R 4.0.2)
+##  viridis           0.5.1    2018-03-29 [1] RSPM (R 4.0.0)
+##  viridisLite       0.3.0    2018-02-01 [1] RSPM (R 4.0.0)
+##  withr             2.3.0    2020-09-22 [1] RSPM (R 4.0.2)
+##  xfun              0.18     2020-09-29 [1] RSPM (R 4.0.2)
+##  yaml              2.2.1    2020-02-01 [1] RSPM (R 4.0.0)
+## 
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
+
+
+

References

+
+
+

Carlson M., 2020 org.Hs.eg.db: Genome wide annotation for human. http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html

+
+
+

Diego U. S., and B. I. Team, GSEA: Gene set enrichment analysis. https://www.gsea-msigdb.org/gsea/index.jsp

+
+
+

Dolgalev I., 2020 Msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html

+
+
+

Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

+
+
+

Korotkevich G., V. Sukhov, and A. Sergushichev, 2019 Fast gene set enrichment analysis. bioRxiv. https://doi.org/10.1101/060012

+
+
+

Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8

+
+
+

Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

+
+
+

Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118

+
+
+

Yu G., clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html

+
+
+
+ + + + +
+
+ +
+ + + + + + + + + + + + + + + + diff --git a/Snakefile b/Snakefile index e9c26d36..47184ca4 100644 --- a/Snakefile +++ b/Snakefile @@ -18,6 +18,7 @@ rule target: "03-rnaseq/dimension-reduction_rnaseq_01_pca.html", "03-rnaseq/dimension-reduction_rnaseq_02_umap.html", "03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html", + "03-rnaseq/pathway-analysis_rnaseq_02_gsea.html", "03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.html", "04-advanced-topics/00-intro-to-advanced-topics.html", "04-advanced-topics/network-analysis_rnaseq_01_wgcna.html" diff --git a/components/dictionary.txt b/components/dictionary.txt index 8fe5cca5..61ccfcec 100644 --- a/components/dictionary.txt +++ b/components/dictionary.txt @@ -56,6 +56,7 @@ ENSDARG Ensembl ENSG Entrez +ENTREZID et FACS frac @@ -77,6 +78,7 @@ hexamer HGNC histological Hochberg +Hs hypomethylating IDH Illumina @@ -126,6 +128,7 @@ ribosomes RMA RPKMs RStudio +sina ssGSEA StatQuest Subramanian