diff --git a/02-microarray/pathway-analysis_microarray_04_gsva.Rmd b/02-microarray/pathway-analysis_microarray_04_gsva.Rmd new file mode 100644 index 00000000..4fef7354 --- /dev/null +++ b/02-microarray/pathway-analysis_microarray_04_gsva.Rmd @@ -0,0 +1,631 @@ +--- +title: "Gene set variation analysis - Microarray" +author: "CCDL for ALSF" +date: "October 2020" +output: + html_notebook: + toc: true + toc_float: true + number_sections: true +--- + +# Purpose of this analysis + +This example is one of pathway analysis module set, we recommend looking at the [pathway analysis introduction](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_00_intro.html) to help you determine which pathway analysis method is best suited for your purposes. + +**DRAFT** + +In this example we will cover a method called Gene Set Variation Analysis (GSVA) ([Hänzelmann2013](https://doi.org/10.1186/1471-2105-14-7)) that allows us to calculate gene set or pathway scores on a per-sample basis. + +We like this quote from the GSVA paper ([Hänzelmann2013](https://doi.org/10.1186/1471-2105-14-7)) to set the stage: + +While [gene set enrichment] methods are generally regarded as end points of a bioinformatics analysis, GSVA constitutes a starting point to build pathway-centric models of biology. +Rather than contextualizing some results you _already have_ from another analysis like DGE, GSVA is designed to provide an estimate of pathway variation for each of the samples in an experiment. +Note that these scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and reran GSVA, you would expect the scores to change. + +GSVA assesses the relative enrichment of gene sets across samples using +a non-parametric approach. Conceptually, GSVA transforms a p-gene by n-sample +gene expression matrix into a g-gene set by n-sample pathway enrichment matrix. +(https://github.com/rcastelo/GSVA/blob/master/man/gsva.Rd) + +⬇️ [**Jump to the analysis code**](#analysis) ⬇️ + +# How to run this example + +For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. + +## Obtain the `.Rmd` file + +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_04_gsva.Rmd). + +Clicking this link will most likely send this to your downloads folder on your computer. +Move this `.Rmd` file to where you would like this example and its files to be stored. + +You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) + +## Set up your analysis folders + +Good file organization is helpful for keeping your data analysis project on track! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! + +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. + +```{r} +# Create the data folder if it doesn't exist +if (!dir.exists("data")) { + dir.create("data") +} +# Define the file path to the plots directory +plots_dir <- "plots" # Can replace with path to desired output plots directory +# Create the plots folder if it doesn't exist +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} +# Define the file path to the results directory +results_dir <- "results" # Can replace with path to desired output results directory +# Create the results folder if it doesn't exist +if (!dir.exists(results_dir)) { + dir.create(results_dir) +} +``` + +In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! + +## Obtain the dataset from refine.bio + +For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). + +Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/GSE37418/novel-mutations-target-distinct-subgroups-of-medulloblastoma). + +Click the "Download Now" button on the right side of this screen. + + + +Fill out the pop up window with your email and our Terms and Conditions: + + + +It may take a few minutes for the dataset to process. +You will get an email when it is ready. + +## About the dataset we are using for this example + +For this example analysis, we will use this [medulloblastoma samples](https://www.refine.bio/experiments/GSE37418/novel-mutations-target-distinct-subgroups-of-medulloblastoma). +@Robinson2012 measured microarray gene expression of 71 medulloblastoma tumor samples. + +## Place the dataset in your new `data/` folder + +refine.bio will send you a download button in the email when it is ready. +Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. +Double clicking should unzip this for you and create a folder of the same name. + + + +For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). + +The `GSE37418` folder has the data and metadata TSV files you will need for this example analysis. +Experiment accession ids usually look something like `GSE1235` or `SRP12345`. + +Copy and paste the `GSE37418` folder into your newly created `data/` folder. + +## Check out our file structure! + +Your new analysis folder should contain: + +- The example analysis `.Rmd` you downloaded +- A folder called "data" which contains: + - The `GSE37418` folder which contains: + - The gene expression + - The metadata TSV +- A folder for `plots` (currently empty) +- A folder for `results` (currently empty) + +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): + + + +In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. +These chunks will declare your file paths and double check that your files are in the right place. + +First we will declare our file paths to our data and metadata files, which should be in our data directory. +This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started. + +```{r} +# Define the file path to the data directory +data_dir <- file.path("data", "GSE37418") # Replace with accession number which will be the name of the folder the files will be in + +# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` +data_file <- file.path(data_dir, "GSE37418.tsv") # Replace with file path to your dataset + +# Declare the file path to the metadata file using the data directory saved as `data_dir` +metadata_file <- file.path(data_dir, "metadata_GSE37418.tsv") # Replace with file path to your metadata +``` + +Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. + +```{r} +# Check if the gene expression matrix file is at the file path stored in `data_file` +file.exists(data_file) + +# Check if the metadata file is at the file path stored in `metadata_file` +file.exists(metadata_file) +``` + +If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. + +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). + +# Using a different refine.bio dataset with this analysis? + +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). +We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. +From here you can customize this analysis example to fit your own scientific questions and preferences. + +*** + +   + +# Gene set variation analysis - Microarray + +## Install libraries + +See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. + +**DRAFT** + +In this analysis, we will be using [`GSVA`]() package to perform GSVA and the [`qusage`]() package which reads in GMT files. + +We will also need the [`org.Hs.eg.db`](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) package to perform gene identifier conversion [@Carlson2020-human]. + +```{r} +if (!("GSVA" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("GSVA", update = FALSE) +} + +if (!("qusage" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("qusage", update = FALSE) +} + +if (!("org.Hs.eg.db" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("org.Hs.eg.db", update = FALSE) +} + +if (!("pheatmap" %in% installed.packages())) { + # Install pheatmap + install.packages("pheatmap", update = FALSE) +} +``` + +Attach the packages we need for this analysis. + +```{r} +# Attach the `qusage` library +library(qusage) + +# Attach the `GSVA` library +library(GSVA) + +# Human annotation package we'll use for gene identifier conversion +library(org.Hs.eg.db) + +# Attach the `pheatmap` library +library(pheatmap) + +# We will need this so we can use the pipe: %>% +library(magrittr) +``` + +## Import and set up data + +Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. +This chunk of code will read both TSV files and add them as data frames to your environment. + +We stored our file paths as objects named `metadata_file` and `data_file` in [this previous step](#check-out-our-file-structure). + +```{r} +# Read in metadata TSV file +metadata <- readr::read_tsv(metadata_file) + +# Read in data TSV file +df <- readr::read_tsv(data_file) %>% + # Tuck away the gene ID column as rownames + tibble::column_to_rownames("Gene") +``` + +Let's ensure that the metadata and data are in the same sample order. + +```{r} +# Make the data in the order of the metadata +df <- df %>% + dplyr::select(metadata$geo_accession) + +# Check if this is in the same order +all.equal(colnames(df), metadata$geo_accession) + +# Bring back the "Gene" column in preparation for mapping +df <- df %>% + tibble::rownames_to_column("Gene") +``` + +### Import Gene Sets + +**DRAFT** + +The function that we will use to run GSVA wants the gene sets to be in a list, rather than a tidy data frame that we used with `clusterProfiler` (although it does accept other formats). + +We're going to take this opportunity to introduce a different format that gene sets are often distributed in called [GMT (Gene Matrix Transposed)](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). + +We're going to read in the Hallmark collection file directly from [MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp), rather than using `msigdbr` like we did in earlier notebooks. + +Note that when reading in the Hallmark collection file in this manner, the data is only compatible with _Homo sapiens_ datasets. + +**REVIEW** + +```{r} +# R can often read in data from a URL +hallmarks_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.2/h.all.v7.2.entrez.gmt" + +# QuSAGE is another pathway analysis method, the `qusage` package has a function +# for reading GMT files and turning them into a list +hallmarks_list <- qusage::read.gmt(hallmarks_url) +``` + +What does this `hallmarks_list` look like? + +```{r} +head(hallmarks_list) +``` + +Looks like we have a list of gene sets with associated Entrez IDs. + +In our gene expression data frame, `df` we have Ensembl gene identifiers. +So we will need to convert our Ensembl IDs into Entrez IDs for GSVA. + +### Gene identifier conversion + +We're going to convert our identifiers in `dge_df` to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers! + +The annotation package `org.Hs.eg.db` contains information for different identifiers [@Carlson2020-human]. +`org.Hs.eg.db` is specific to _Homo sapiens_ -- this is what the `Hs` in the package name is referencing. + +We can see what types of IDs are available to us in an annotation package with `keytypes()`. + +```{r} +keytypes(org.Hs.eg.db) +``` + +Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to Entrez IDs (`ENTREZID`), we could just as easily use it to convert from an Ensembl transcript ID (`ENSEMBLTRANS`) to gene symbols (`SYMBOL`). + +Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: +[the microarray example](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html) and [the RNA-seq example](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html). + +The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called `mapIds()`. + +Let's create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs. + +```{r} +# First let's create a mapped data frame we can join to the gene expression values +mapped_df <- data.frame( + "entrez_id" = mapIds( + org.Hs.eg.db, # Replace with annotation package for the organism relevant to your data + keys = df$Gene, + column = "ENTREZID", # Replace with the type of gene identifiers you would like to map to + keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data + multiVals = "first" # This will keep only the first mapped value for each Ensembl ID + ) +) %>% + # If an Ensembl gene identifier doesn't map to a Entrez gene identifier, drop that + # from the data frame + dplyr::filter(!is.na(entrez_id)) %>% + # Make an `Ensembl` column to store the rownames + tibble::rownames_to_column("Ensembl") %>% + # Now let's join the rest of the expression data + dplyr::inner_join(df, by = c("Ensembl" = "Gene")) +``` + +This `1:many mapping between keys and columns` message means that some Ensembl gene identifiers map to multiple Entrez IDs. +In this case, it's also possible that a Entrez ID will map to multiple Ensembl IDs. +For the purpose of performing GSEA later in this notebook, we keep only the first mapped IDs. +For more about how to explore this, take a look at our [microarray gene ID conversion example](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html). + +Let's see a preview of `mapped_df`. + +```{r} +head(mapped_df) +``` + +We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as rownames. +In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers. + +Let's check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs. + +```{r} +any(duplicated(mapped_df$entrez_id)) +``` + +Looks like we do have duplicated Entrez IDs. +Let's find out which Entrez IDs have been duplicated. + +```{r} +dup_entrez_ids <- mapped_df %>% + dplyr::filter(duplicated(entrez_id)) %>% + dplyr::pull(entrez_id) + +dup_entrez_ids +``` + +Now let's take a look at the rows associated with the duplicated Entrez IDs. + +```{r} +mapped_df %>% + dplyr::filter(entrez_id %in% dup_entrez_ids) +``` + +We can see that the associated values vary for each row. + +As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSVA steps later, so let's keep the Entrez IDs associated with the larger mean expression value per sample. + +We are removing values for three genes here, so it is unlikely to have a considerable impact on our results. + +**REVIEW** + +```{r} +# First let's create a column with the mean expression value for each row -- we +# will want to calculate the mean using the numeric expression values only so +# here we use negative indexing to account for the two gene identifier columns +# from the previous mapping step that we do not want included in the calculation +# of the mean +mapped_df$mean <- apply(mapped_df[, -c(1:2)], 1, mean) + +# Let's look at the rows associated with the duplicated Entrez IDs now that we +# have a mean expression value +mapped_df %>% + dplyr::filter(entrez_id %in% dup_entrez_ids) %>% + dplyr::select(Ensembl, entrez_id, mean, dplyr::everything()) +``` + +Now we can filter out the duplicate gene identifiers using the mean expression value. + +```{r} +filtered_mapped_df <- mapped_df %>% + # Sort so that the highest absolute mean expression values are at the top + dplyr::arrange(dplyr::desc(abs(mean))) %>% + # Filter out the duplicated rows using `dplyr::distinct()`-- this will keep + # the first row with the duplicated value thus keeping the row with the + # highest absolute value of the mean expression + dplyr::distinct(entrez_id, .keep_all = TRUE) +``` + +Let's check to see that we removed the duplicate Entrez IDs and kept the rows with the higher absolute value of the mean expression. + +```{r} +filtered_mapped_df %>% + dplyr::filter(entrez_id %in% dup_entrez_ids) %>% + dplyr::select(Ensembl, entrez_id, mean, dplyr::everything()) +``` + +Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher absolute mean expression value! + +As mentioned earlier, we need a matrix for GSVA. +Let's now convert our data frame into a matrix and prepare our object for GSVA. + +```{r} +matrix <- filtered_mapped_df %>% + # First let's remove the columns we no longer need + dplyr::select(-c("Ensembl", "mean")) %>% + # Now we need to store our gene identifiers as rownames + tibble::column_to_rownames("entrez_id") %>% + # Now we can convert our object into a matrix + as.matrix() +``` + +Note that if we had duplicate gene identifiers here, we would not be able to set them as rownames. + +## GSVA - Microarray + +**DRAFT** + +You may notice that GSVA has some commonalities with GSEA. +Rather than ranking genes based on some statistic _we_ selected ahead of time, GSVA fits a model and ranks genes based on their expression level relative to the sample distribution. +This is a way of asking if a gene _i_ is highly or lowly expressed in a sample _j_ in the context of this experiment and ranking accordingly ([Hänzelmann2013](https://doi.org/10.1186/1471-2105-14-7)). +The pathway-level score calculated is a way of asking how genes _within_ a gene set vary as compared to genes that are _outside_ of that gene set ([Malhotra. 2018](https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3)). +(This is sometimes called a competitive test in gene set enrichment literature.) +The intuition here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) ([Hänzelmann2013](https://doi.org/10.1186/1471-2105-14-7)). + +The output is a gene set by sample matrix of GSVA scores. + +### Perform GSVA + +**REVIEW** + +```{r} +gsva_results <- gsva(matrix, + hallmarks_list, + method = "gsva", + # Appropriate for our transformed data on log2-like scale + kcdf = "Gaussian", + # Minimum gene set size + min.sz = 5, + # Maximum gene set size + max.sz = 50, + # Compute Gaussian-distributed scores + mx.diff = TRUE, + # Print out information about each calculation step + verbose = TRUE +) +``` + +**DRAFT** + +Note that the `gsva()` function documentation says we can use `kcdf = "Gaussian"` if we had log-CPMs, log-RPKMs or log-TPMs, but we would use `kcdf = "Poisson"` on integer counts. +See `?gsva` for more options. + +```{r} +# Let's explore what the output of gsva() looks like +head(gsva_results) +``` + +## Visualizing Results + +### Violin plot + +Now let's make a violin plot to look at the distribution of scores. +First we need to get this data in an appropriate format for `ggplot2`. + +**REVIEW** + +```{r} +# The results are a matrix +long_df <- gsva_results %>% + data.frame() %>% + # Gene set names are rownames + tibble::rownames_to_column("gene_set") %>% + # Get into long format + tidyr::pivot_longer( + cols = -gene_set, + names_to = "Kids_First_Biospecimen_ID", + values_to = "gsva_score" + ) %>% + dplyr::arrange(dplyr::desc(gsva_score)) +``` + +Let's make a violin plot so we can look at the distribution of scores for each pathway. + +**REVIEW** + +```{r} +# Violin plot comparing GSVA scores of different random gene set sizes +violin_plot <- long_df %>% + ggplot2::ggplot(ggplot2::aes( + x = gene_set, + y = gsva_score + )) + + # Make a violin plot that's a pretty blue! + ggplot2::geom_violin(fill = "#99CCFF", alpha = 0.5) + + # Add a point with the mean value + ggplot2::stat_summary( + geom = "point", + fun = "mean", + # Change the aesthetics of the points + size = 3, + color = "#0066CC", + shape = 18 + ) + + # Flip the axes + ggplot2::coord_flip() + + ggplot2::labs( + title = "Gene Set GSVA Scores", + x = "gene set", + y = "GSVA score" + ) + + ggplot2::theme_bw() + +# Display plot +violin_plot +``` + +**DRAFT** + +What do you notice about these distributions? +How might you use this information to inform your interpretation of GSVA scores? + +Let's save this plot to PNG. + +```{r} +ggplot2::ggsave(file.path(plots_dir, "GSE37418_gsva_violin_plot.png"), + plot = violin_plot +) +``` + +### Heatmap + +Now let's make a heatmap to look at the distribution of scores. +First, we will need to prepare the metadata for plotting. + +**REVIEW** + +```{r} +# Let's prepare an annotation data frame for plotting +annotation_df <- metadata %>% + # We want to select the variables that we want for annotating the heatmap + dplyr::select( + refinebio_accession_code, + refinebio_sex + ) %>% + # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our dataset object + tibble::column_to_rownames("refinebio_accession_code") +``` + +Now let's make a heatmap using the `pheatmap()` function. + +**REVIEW** + +```{r} +heatmap_annotated <- + pheatmap( + gsva_results, + cluster_rows = TRUE, # We want to cluster the heatmap by rows (gene sets in this case) + cluster_cols = TRUE, # We also want to cluster the heatmap by columns (samples in this case) + show_rownames = TRUE, # We want to show the names of the gene sets + show_colnames = FALSE, # We don't want to show the rownames because there are too many samples for the labels to be clearly seen + annotation_col = annotation_df, + fontsize = 8, + width = 15, + main = "Annotated Heatmap", + colorRampPalette(c( + "deepskyblue", + "black", + "yellow" + ))(25), + scale = "row" # Scale values in the direction of genes (rows) + ) + +# Display heatmap +heatmap_annotated +``` + +Let's save this plot to PNG. + +```{r} +ggplot2::ggsave(file.path(plots_dir, "GSE37418_gsva_annotated_heatmap.png"), + plot = heatmap_annotated +) +``` + +## Write results to file + +Now let's write our GSVA results to file. + +```{r} +gsva_results %>% + as.data.frame() %>% + tibble::rownames_to_column("pathway") %>% + readr::write_tsv(file.path( + results_dir, + "GSE37418_gsva_results.tsv" + )) +``` + +# Resources for further learning + +**DRAFT** + +- * [Malhotra. _Decoding Gene Set Variation Analysis_. (2018)](https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3) + + +# Session info + +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. + +```{r} +# Print session info +sessioninfo::session_info() +``` + +# References diff --git a/02-microarray/pathway-analysis_microarray_04_gsva.html b/02-microarray/pathway-analysis_microarray_04_gsva.html new file mode 100644 index 00000000..5026daee --- /dev/null +++ b/02-microarray/pathway-analysis_microarray_04_gsva.html @@ -0,0 +1,4803 @@ + + + + + + + + + + + + + + +Gene set variation analysis - Microarray + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + + + +
+

1 Purpose of this analysis

+

This example is one of pathway analysis module set, we recommend looking at the pathway analysis introduction to help you determine which pathway analysis method is best suited for your purposes.

+

DRAFT

+

In this example we will cover a method called Gene Set Variation Analysis (GSVA) (Hänzelmann2013) that allows us to calculate gene set or pathway scores on a per-sample basis.

+

We like this quote from the GSVA paper (Hänzelmann, Castelo, and Guinney. 2013) to set the stage:

+

While [gene set enrichment] methods are generally regarded as end points of a bioinformatics analysis, GSVA constitutes a starting point to build pathway-centric models of biology. Rather than contextualizing some results you already have from another analysis like DGE, GSVA is designed to provide an estimate of pathway variation for each of the samples in an experiment. Note that these scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and reran GSVA, you would expect the scores to change.

+

GSVA assesses the relative enrichment of gene sets across samples using a non-parametric approach. Conceptually, GSVA transforms a p-gene by n-sample gene expression matrix into a g-gene set by n-sample pathway enrichment matrix. (https://github.com/rcastelo/GSVA/blob/master/man/gsva.Rd)

+

⬇️ Jump to the analysis code ⬇️

+
+
+

2 How to run this example

+

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

+
+

2.1 Obtain the .Rmd file

+

To run this example yourself, download the .Rmd for this analysis by clicking this link.

+

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

+

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

+
+
+

2.2 Set up your analysis folders

+

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

+

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+# Define the file path to the plots directory
+plots_dir <- "plots" # Can replace with path to desired output plots directory
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+# Define the file path to the results directory
+results_dir <- "results" # Can replace with path to desired output results directory
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}
+

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

+
+
+

2.3 Obtain the dataset from refine.bio

+

For general information about downloading data for these examples, see our ‘Getting Started’ section.

+

Go to this dataset’s page on refine.bio.

+

Click the “Download Now” button on the right side of this screen.

+

+

Fill out the pop up window with your email and our Terms and Conditions:

+

+

It may take a few minutes for the dataset to process. You will get an email when it is ready.

+
+
+

2.4 About the dataset we are using for this example

+

For this example analysis, we will use this medulloblastoma samples. Robinson et al. (2012) measured microarray gene expression of 71 medulloblastoma tumor samples.

+
+
+

2.5 Place the dataset in your new data/ folder

+

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

+

+

For more details on the contents of this folder see these docs on refine.bio.

+

The GSE37418 folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

+

Copy and paste the GSE37418 folder into your newly created data/ folder.

+
+
+

2.6 Check out our file structure!

+

Your new analysis folder should contain:

+
    +
  • The example analysis .Rmd you downloaded
    +
  • +
  • A folder called “data” which contains: +
      +
    • The GSE37418 folder which contains: +
        +
      • The gene expression
        +
      • +
      • The metadata TSV
        +
      • +
    • +
  • +
  • A folder for plots (currently empty)
    +
  • +
  • A folder for results (currently empty)
  • +
+

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

+

+

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

+

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

+
# Define the file path to the data directory
+data_dir <- file.path("data", "GSE37418") # Replace with accession number which will be the name of the folder the files will be in
+
+# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
+data_file <- file.path(data_dir, "GSE37418.tsv") # Replace with file path to your dataset
+
+# Declare the file path to the metadata file using the data directory saved as `data_dir`
+metadata_file <- file.path(data_dir, "metadata_GSE37418.tsv") # Replace with file path to your metadata
+

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

+
# Check if the gene expression matrix file is at the file path stored in `data_file`
+file.exists(data_file)
+
## [1] TRUE
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
+
## [1] TRUE
+

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

+

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

+
+
+
+

3 Using a different refine.bio dataset with this analysis?

+

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

+
+ +

 

+
+
+

4 Gene set variation analysis - Microarray

+
+

4.1 Install libraries

+

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

+

DRAFT

+

In this analysis, we will be using GSVA package to perform GSVA and the qusage package which reads in GMT files.

+

We will also need the org.Hs.eg.db package to perform gene identifier conversion (Carlson 2020).

+
if (!("GSVA" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("GSVA", update = FALSE)
+}
+
+if (!("qusage" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("qusage", update = FALSE)
+}
+
+if (!("org.Hs.eg.db" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("org.Hs.eg.db", update = FALSE)
+}
+
+if (!("pheatmap" %in% installed.packages())) {
+  # Install pheatmap
+  install.packages("pheatmap", update = FALSE)
+}
+

Attach the packages we need for this analysis.

+
# Attach the `qusage` library
+library(qusage)
+
## Loading required package: limma
+
# Attach the `GSVA` library
+library(GSVA)
+
+# Human annotation package we'll use for gene identifier conversion
+library(org.Hs.eg.db)
+
## Loading required package: AnnotationDbi
+
## Loading required package: stats4
+
## Loading required package: BiocGenerics
+
## Loading required package: parallel
+
## 
+## Attaching package: 'BiocGenerics'
+
## The following objects are masked from 'package:parallel':
+## 
+##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+##     clusterExport, clusterMap, parApply, parCapply, parLapply,
+##     parLapplyLB, parRapply, parSapply, parSapplyLB
+
## The following object is masked from 'package:limma':
+## 
+##     plotMA
+
## The following objects are masked from 'package:stats':
+## 
+##     IQR, mad, sd, var, xtabs
+
## The following objects are masked from 'package:base':
+## 
+##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
+##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
+##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
+##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
+##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
+##     union, unique, unsplit, which, which.max, which.min
+
## Loading required package: Biobase
+
## Welcome to Bioconductor
+## 
+##     Vignettes contain introductory material; view with
+##     'browseVignettes()'. To cite Bioconductor, see
+##     'citation("Biobase")', and for packages 'citation("pkgname")'.
+
## Loading required package: IRanges
+
## Loading required package: S4Vectors
+
## 
+## Attaching package: 'S4Vectors'
+
## The following object is masked from 'package:base':
+## 
+##     expand.grid
+
## 
+
# Attach the `pheatmap` library
+library(pheatmap)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+
+

4.2 Import and set up data

+

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read both TSV files and add them as data frames to your environment.

+

We stored our file paths as objects named metadata_file and data_file in this previous step.

+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────────────────────────────────
+## cols(
+##   .default = col_character(),
+##   refinebio_age = col_logical(),
+##   refinebio_cell_line = col_logical(),
+##   refinebio_compound = col_logical(),
+##   refinebio_disease = col_logical(),
+##   refinebio_disease_stage = col_logical(),
+##   refinebio_genetic_information = col_logical(),
+##   refinebio_processed = col_logical(),
+##   refinebio_race = col_logical(),
+##   refinebio_source_archive_url = col_logical(),
+##   refinebio_specimen_part = col_logical(),
+##   refinebio_subject = col_logical(),
+##   refinebio_time = col_logical(),
+##   refinebio_treatment = col_logical(),
+##   channel_count = col_double(),
+##   contact_phone = col_double(),
+##   `contact_zip/postal_code` = col_double(),
+##   data_row_count = col_double(),
+##   taxid_ch1 = col_double()
+## )
+## ℹ Use `spec()` for the full column specifications.
+
# Read in data TSV file
+df <- readr::read_tsv(data_file) %>%
+  # Tuck away the gene ID  column as rownames
+  tibble::column_to_rownames("Gene")
+
## 
+## ── Column specification ──────────────────────────────────────────────────────────────────────────
+## cols(
+##   .default = col_double(),
+##   Gene = col_character()
+## )
+## ℹ Use `spec()` for the full column specifications.
+

Let’s ensure that the metadata and data are in the same sample order.

+
# Make the data in the order of the metadata
+df <- df %>%
+  dplyr::select(metadata$geo_accession)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$geo_accession)
+
## [1] TRUE
+
# Bring back the "Gene" column in preparation for mapping
+df <- df %>%
+  tibble::rownames_to_column("Gene")
+
+

4.2.1 Import Gene Sets

+

DRAFT

+

The function that we will use to run GSVA wants the gene sets to be in a list, rather than a tidy data frame that we used with clusterProfiler (although it does accept other formats).

+

We’re going to take this opportunity to introduce a different format that gene sets are often distributed in called GMT (Gene Matrix Transposed).

+

We’re going to read in the Hallmark collection file directly from MSigDB, rather than using msigdbr like we did in earlier notebooks.

+

Note that when reading in the Hallmark collection file in this manner, the data is only compatible with Homo sapiens datasets.

+

REVIEW

+
# R can often read in data from a URL
+hallmarks_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.2/h.all.v7.2.entrez.gmt"
+
+# QuSAGE is another pathway analysis method, the `qusage` package has a function
+# for reading GMT files and turning them into a list
+hallmarks_list <- qusage::read.gmt(hallmarks_url)
+

What does this hallmarks_list look like?

+
head(hallmarks_list)
+
## $HALLMARK_TNFA_SIGNALING_VIA_NFKB
+##   [1] "3726"   "2920"   "467"    "4792"   "7128"   "5743"   "2919"   "8870"  
+##   [9] "9308"   "6364"   "2921"   "23764"  "4791"   "7127"   "1839"   "1316"  
+##  [17] "330"    "5329"   "7538"   "3383"   "3725"   "1960"   "3553"   "597"   
+##  [25] "23645"  "80149"  "6648"   "4929"   "3552"   "5971"   "7185"   "7832"  
+##  [33] "1843"   "1326"   "2114"   "2152"   "6385"   "1958"   "3569"   "7124"  
+##  [41] "23135"  "4790"   "3976"   "5806"   "8061"   "3164"   "182"    "6351"  
+##  [49] "2643"   "6347"   "1827"   "1844"   "10938"  "9592"   "5966"   "8837"  
+##  [57] "8767"   "4794"   "8013"   "22822"  "51278"  "8744"   "2669"   "1647"  
+##  [65] "3627"   "10769"  "8553"   "1959"   "9021"   "11182"  "5734"   "1847"  
+##  [73] "5055"   "4783"   "5054"   "10221"  "25976"  "5970"   "329"    "6372"  
+##  [81] "9516"   "7130"   "960"    "3624"   "5328"   "4609"   "3604"   "6446"  
+##  [89] "10318"  "10135"  "2355"   "10957"  "3398"   "969"    "3575"   "1942"  
+##  [97] "7262"   "5209"   "6352"   "79693"  "3460"   "8878"   "10950"  "4616"  
+## [105] "8942"   "50486"  "694"    "4170"   "7422"   "5606"   "1026"   "3491"  
+## [113] "10010"  "3433"   "3606"   "7280"   "3659"   "2353"   "4973"   "388"   
+## [121] "374"    "4814"   "65986"  "8613"   "9314"   "6373"   "6303"   "1435"  
+## [129] "1880"   "56937"  "5791"   "7097"   "57007"  "7071"   "4082"   "3914"  
+## [137] "1051"   "9322"   "2150"   "687"    "3949"   "7050"   "127544" "55332" 
+## [145] "2683"   "11080"  "1437"   "5142"   "8303"   "5341"   "6776"   "23258" 
+## [153] "595"    "23586"  "8877"   "941"    "25816"  "57018"  "2526"   "9034"  
+## [161] "80176"  "8848"   "9334"   "150094" "23529"  "4780"   "2354"   "5187"  
+## [169] "10725"  "490"    "3593"   "3572"   "9120"   "19"     "3280"   "604"   
+## [177] "8660"   "6515"   "1052"   "51561"  "4088"   "6890"   "9242"   "64135" 
+## [185] "3601"   "79155"  "602"    "24145"  "24147"  "1906"   "10209"  "650"   
+## [193] "1846"   "10611"  "23308"  "9945"   "10365"  "3371"   "5271"   "4084"  
+## 
+## $HALLMARK_HYPOXIA
+##   [1] "5230"   "5163"   "2632"   "5211"   "226"    "2026"   "5236"   "10397" 
+##   [9] "3099"   "230"    "2821"   "4601"   "6513"   "5033"   "133"    "8974"  
+##  [17] "2023"   "5214"   "205"    "26355"  "5209"   "7422"   "665"    "7167"  
+##  [25] "30001"  "55818"  "901"    "3939"   "2997"   "2597"   "8553"   "51129" 
+##  [33] "3725"   "5054"   "4015"   "2645"   "8497"   "23764"  "54541"  "6515"  
+##  [41] "3486"   "4783"   "2353"   "3516"   "3098"   "10370"  "3669"   "2584"  
+##  [49] "26118"  "5837"   "6781"   "23036"  "694"    "123"    "1466"   "7436"  
+##  [57] "23210"  "2131"   "2152"   "5165"   "55139"  "7360"   "229"    "8614"  
+##  [65] "54206"  "2027"   "10957"  "3162"   "5228"   "26330"  "9435"   "55076" 
+##  [73] "63827"  "467"    "857"    "272"    "2719"   "3340"   "8660"   "8819"  
+##  [81] "2548"   "6385"   "8987"   "8870"   "5313"   "3484"   "5329"   "112464"
+##  [89] "8839"   "9215"   "25819"  "6275"   "58528"  "7538"   "1956"   "1907"  
+##  [97] "3423"   "1026"   "6095"   "1843"   "4282"   "5507"   "10570"  "11015" 
+## [105] "1837"   "136"    "9957"   "284119" "2908"   "1316"   "2239"   "3491"  
+## [113] "7128"   "771"    "3073"   "633"    "23645"  "55276"  "5292"   "25824" 
+## [121] "55577"  "1027"   "680"    "8277"   "4493"   "538"    "4502"   "9672"  
+## [129] "25976"  "5317"   "302"    "5224"   "1649"   "5578"   "2542"   "7852"  
+## [137] "1944"   "1356"   "8609"   "1490"   "9469"   "7163"   "56925"  "124872"
+## [145] "10891"  "596"    "2651"   "3036"   "54800"  "949"    "6576"   "6383"  
+## [153] "839"    "7428"   "2309"   "5155"   "126792" "6518"   "8406"   "1942"  
+## [161] "2745"   "57007"  "5066"   "7045"   "1634"   "6478"   "51316"  "2203"  
+## [169] "8459"   "5260"   "4627"   "1028"   "9380"   "5105"   "3623"   "3309"  
+## [177] "8509"   "23327"  "7162"   "7511"   "3569"   "6533"   "4214"   "3948"  
+## [185] "9590"   "26136"  "3798"   "3906"   "1289"   "2817"   "3069"   "10994" 
+## [193] "1463"   "7052"   "2113"   "3219"   "8991"   "2355"   "6820"   "7043"  
+## 
+## $HALLMARK_CHOLESTEROL_HOMEOSTASIS
+##  [1] "2224"   "1595"   "3422"   "2222"   "1717"   "6713"   "3157"   "50814" 
+##  [9] "4047"   "4597"   "3949"   "7108"   "230"    "10682"  "6319"   "10654" 
+## [17] "4598"   "4023"   "6309"   "9415"   "3156"   "51478"  "312"    "6721"  
+## [25] "5833"   "55902"  "467"    "127"    "23474"  "1891"   "875"    "2990"  
+## [33] "2194"   "3958"   "22809"  "308"    "94241"  "1119"   "2946"   "39"    
+## [41] "552"    "5359"   "1191"   "54206"  "57761"  "58191"  "51330"  "71"    
+## [49] "182"    "5641"   "26270"  "493869" "10957"  "118429" "114569" "928"   
+## [57] "5468"   "2731"   "6811"   "134429" "1499"   "27346"  "116496" "5165"  
+## [65] "5329"   "7869"   "2770"   "20"     "6311"   "4783"   "214"    "2171"  
+## [73] "6282"   "132864"
+## 
+## $HALLMARK_MITOTIC_SPINDLE
+##   [1] "9181"   "23332"  "3832"   "9493"   "57679"  "382"    "4650"   "4627"  
+##   [9] "10426"  "9793"   "29127"  "57580"  "50650"  "4926"   "6711"   "11004" 
+##  [17] "3799"   "7272"   "324"    "11190"  "5048"   "10435"  "9371"   "55704" 
+##  [25] "56992"  "332"    "116840" "4763"   "7248"   "996"    "11064"  "114791"
+##  [33] "24137"  "22919"  "55785"  "675"    "5347"   "5921"   "4751"   "8936"  
+##  [41] "7153"   "7204"   "9826"   "10300"  "9055"   "54443"  "55755"  "9126"  
+##  [49] "10844"  "9700"   "55201"  "201176" "9732"   "29901"  "3619"   "394"   
+##  [57] "2934"   "10276"  "10128"  "23637"  "2317"   "64411"  "121512" "29"    
+##  [65] "55835"  "4690"   "1063"   "9585"   "10163"  "4628"   "1062"   "9266"  
+##  [73] "4281"   "3831"   "57787"  "127829" "9702"   "8409"   "393"    "23580" 
+##  [81] "163786" "9113"   "4983"   "8976"   "4296"   "6654"   "25"     "7074"  
+##  [89] "23095"  "6453"   "134549" "8440"   "9787"   "613"    "10048"  "2037"  
+##  [97] "10801"  "11104"  "51174"  "22974"  "3797"   "357"    "85378"  "6709"  
+## [105] "23022"  "23647"  "9735"   "84376"  "25777"  "58526"  "1739"   "2316"  
+## [113] "79658"  "8476"   "23365"  "4082"   "51199"  "5108"   "10928"  "7430"  
+## [121] "85464"  "983"    "22930"  "10160"  "11346"  "54509"  "1894"   "2035"  
+## [129] "51735"  "3835"   "84333"  "6780"   "396"    "6790"   "26271"  "51203" 
+## [137] "5829"   "9564"   "23607"  "11214"  "10013"  "22994"  "3996"   "23192" 
+## [145] "5116"   "7840"   "11133"  "667"    "22920"  "151987" "9411"   "9462"  
+## [153] "9133"   "80119"  "5922"   "4739"   "8243"   "81"     "5311"   "7461"  
+## [161] "998"    "10403"  "9874"   "9344"   "6904"   "832"    "1794"   "2017"  
+## [169] "10051"  "10565"  "7277"   "4001"   "10006"  "6093"   "55125"  "699"   
+## [177] "50628"  "64857"  "253260" "10018"  "1778"   "6624"   "8874"   "140735"
+## [185] "4643"   "274"    "4853"   "5981"   "10611"  "89941"  "8470"   "11135" 
+## [193] "7414"   "6249"   "23012"  "7531"   "9771"   "55722"  "1453"  
+## 
+## $HALLMARK_WNT_BETA_CATENIN_SIGNALING
+##  [1] "4609"  "1499"  "3714"  "4851"  "28514" "8313"  "5664"  "8321"  "4855" 
+## [10] "51176" "8312"  "85407" "81029" "8454"  "182"   "9794"  "2648"  "2770" 
+## [19] "7475"  "5727"  "9612"  "27121" "3066"  "22943" "6932"  "7471"  "8650" 
+## [28] "6868"  "1856"  "5467"  "23385" "10014" "894"   "10023" "1454"  "3516" 
+## [37] "8325"  "7157"  "6502"  "23493" "23462" "79885"
+## 
+## $HALLMARK_TGF_BETA_SIGNALING
+##  [1] "7046"  "4092"  "7040"  "64750" "57154" "659"   "6498"  "6497"  "90"   
+## [10] "56937" "9612"  "5054"  "3726"  "4086"  "4091"  "23645" "7050"  "5045" 
+## [19] "4088"  "2280"  "6885"  "657"   "1499"  "28996" "7071"  "650"   "2022" 
+## [28] "324"   "5494"  "331"   "999"   "3397"  "7044"  "1028"  "51592" "11031"
+## [37] "7082"  "6574"  "1025"  "3399"  "9241"  "51742" "3460"  "3398"  "5499" 
+## [46] "6711"  "25937" "8412"  "7057"  "2339"  "3065"  "7323"  "4053"  "387"
+

Looks like we have a list of gene sets with associated Entrez IDs.

+

In our gene expression data frame, df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into Entrez IDs for GSVA.

+
+
+

4.2.2 Gene identifier conversion

+

We’re going to convert our identifiers in dge_df to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

+

The annotation package org.Hs.eg.db contains information for different identifiers (Carlson 2020). org.Hs.eg.db is specific to Homo sapiens – this is what the Hs in the package name is referencing.

+

We can see what types of IDs are available to us in an annotation package with keytypes().

+
keytypes(org.Hs.eg.db)
+
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
+##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
+## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
+## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
+## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"     
+## [26] "UNIPROT"
+

Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to Entrez IDs (ENTREZID), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS) to gene symbols (SYMBOL).

+

Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

+

The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds().

+

Let’s create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs.

+
# First let's create a mapped data frame we can join to the gene expression values
+mapped_df <- data.frame(
+  "entrez_id" = mapIds(
+    org.Hs.eg.db, # Replace with annotation package for the organism relevant to your data
+    keys = df$Gene,
+    column = "ENTREZID", # Replace with the type of gene identifiers you would like to map to
+    keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
+    multiVals = "first" # This will keep only the first mapped value for each Ensembl ID
+  )
+) %>%
+  # If an Ensembl gene identifier doesn't map to a Entrez gene identifier, drop that
+  # from the data frame
+  dplyr::filter(!is.na(entrez_id)) %>%
+  # Make an `Ensembl` column to store the rownames
+  tibble::rownames_to_column("Ensembl") %>%
+  # Now let's join the rest of the expression data
+  dplyr::inner_join(df, by = c("Ensembl" = "Gene"))
+
## 'select()' returned 1:many mapping between keys and columns
+

This 1:many mapping between keys and columns message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSEA later in this notebook, we keep only the first mapped IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.

+

Let’s see a preview of mapped_df.

+
head(mapped_df)
+
+ +
+

We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as rownames. In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers.

+

Let’s check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs.

+
any(duplicated(mapped_df$entrez_id))
+
## [1] TRUE
+

Looks like we do have duplicated Entrez IDs. Let’s find out which Entrez IDs have been duplicated.

+
dup_entrez_ids <- mapped_df %>%
+  dplyr::filter(duplicated(entrez_id)) %>%
+  dplyr::pull(entrez_id)
+
+dup_entrez_ids
+
## [1] "6013"   "5414"   "148753"
+

Now let’s take a look at the rows associated with the duplicated Entrez IDs.

+
mapped_df %>%
+  dplyr::filter(entrez_id %in% dup_entrez_ids)
+
+ +
+

We can see that the associated values vary for each row.

+

As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSVA steps later, so let’s keep the Entrez IDs associated with the larger mean expression value per sample.

+

We are removing values for three genes here, so it is unlikely to have a considerable impact on our results.

+

REVIEW

+
# First let's create a column with the mean expression value for each row -- we
+# will want to calculate the mean using the numeric expression values only so
+# here we use negative indexing to account for the two gene identifier columns
+# from the previous mapping step that we do not want included in the calculation
+# of the mean
+mapped_df$mean <- apply(mapped_df[, -c(1:2)], 1, mean)
+
+# Let's look at the rows associated with the duplicated Entrez IDs now that we
+# have a mean expression value
+mapped_df %>%
+  dplyr::filter(entrez_id %in% dup_entrez_ids) %>%
+  dplyr::select(Ensembl, entrez_id, mean, dplyr::everything())
+
+ +
+

Now we can filter out the duplicate gene identifiers using the mean expression value.

+
filtered_mapped_df <- mapped_df %>%
+  # Sort so that the highest absolute mean expression values are at the top
+  dplyr::arrange(dplyr::desc(abs(mean))) %>%
+  # Filter out the duplicated rows using `dplyr::distinct()`-- this will keep
+  # the first row with the duplicated value thus keeping the row with the
+  # highest absolute value of the mean expression
+  dplyr::distinct(entrez_id, .keep_all = TRUE)
+

Let’s check to see that we removed the duplicate Entrez IDs and kept the rows with the higher absolute value of the mean expression.

+
filtered_mapped_df %>%
+  dplyr::filter(entrez_id %in% dup_entrez_ids) %>%
+  dplyr::select(Ensembl, entrez_id, mean, dplyr::everything())
+
+ +
+

Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher absolute mean expression value!

+

As mentioned earlier, we need a matrix for GSVA. Let’s now convert our data frame into a matrix and prepare our object for GSVA.

+
matrix <- filtered_mapped_df %>%
+  # First let's remove the columns we no longer need
+  dplyr::select(-c("Ensembl", "mean")) %>%
+  # Now we need to store our gene identifiers as rownames
+  tibble::column_to_rownames("entrez_id") %>%
+  # Now we can convert our object into a matrix
+  as.matrix()
+

Note that if we had duplicate gene identifiers here, we would not be able to set them as rownames.

+
+
+
+

4.3 GSVA - Microarray

+

DRAFT

+

You may notice that GSVA has some commonalities with GSEA. Rather than ranking genes based on some statistic we selected ahead of time, GSVA fits a model and ranks genes based on their expression level relative to the sample distribution. This is a way of asking if a gene i is highly or lowly expressed in a sample j in the context of this experiment and ranking accordingly (Hänzelmann2013). The pathway-level score calculated is a way of asking how genes within a gene set vary as compared to genes that are outside of that gene set (Malhotra. 2018). (This is sometimes called a competitive test in gene set enrichment literature.) The intuition here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) (Hänzelmann2013).

+

The output is a gene set by sample matrix of GSVA scores.

+
+

4.3.1 Perform GSVA

+

REVIEW

+
gsva_results <- gsva(matrix,
+  hallmarks_list,
+  method = "gsva",
+  # Appropriate for our transformed data on log2-like scale
+  kcdf = "Gaussian",
+  # Minimum gene set size
+  min.sz = 5,
+  # Maximum gene set size
+  max.sz = 50,
+  # Compute Gaussian-distributed scores
+  mx.diff = TRUE,
+  # Print out information about each calculation step
+  verbose = TRUE
+)
+
## Estimating GSVA scores for 7 gene sets.
+## Estimating ECDFs with Gaussian kernels
+## 
+  |                                                                            
+  |                                                                      |   0%
+  |                                                                            
+  |==========                                                            |  14%
+  |                                                                            
+  |====================                                                  |  29%
+  |                                                                            
+  |==============================                                        |  43%
+  |                                                                            
+  |========================================                              |  57%
+  |                                                                            
+  |==================================================                    |  71%
+  |                                                                            
+  |============================================================          |  86%
+  |                                                                            
+  |======================================================================| 100%
+

DRAFT

+

Note that the gsva() function documentation says we can use kcdf = "Gaussian" if we had log-CPMs, log-RPKMs or log-TPMs, but we would use kcdf = "Poisson" on integer counts. See ?gsva for more options.

+
# Let's explore what the output of gsva() looks like
+head(gsva_results)
+
##                                            GSM918590   GSM918634   GSM918581
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING       0.23995535 -0.25656410  0.06715069
+## HALLMARK_NOTCH_SIGNALING                  0.38831354 -0.33494940 -0.04352226
+## HALLMARK_APICAL_SURFACE                   0.06141113 -0.17543159  0.04909823
+## HALLMARK_HEDGEHOG_SIGNALING               0.10109991 -0.30002677 -0.02979826
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.20164740 -0.40982481 -0.49766994
+## HALLMARK_ANGIOGENESIS                     0.51648841 -0.07780347 -0.03801051
+##                                            GSM918642    GSM918605   GSM918631
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      -0.18973988 -0.068364972  0.15418948
+## HALLMARK_NOTCH_SIGNALING                 -0.24737992 -0.032697333  0.05838874
+## HALLMARK_APICAL_SURFACE                  -0.12670464 -0.170681235  0.13611119
+## HALLMARK_HEDGEHOG_SIGNALING               0.06359428 -0.032894089 -0.34401503
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY  0.05111314 -0.305264942 -0.31491562
+## HALLMARK_ANGIOGENESIS                    -0.03100635 -0.006704894 -0.22719844
+##                                            GSM918614   GSM918580 GSM918607
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      -0.03781014  0.21530980 0.2501704
+## HALLMARK_NOTCH_SIGNALING                 -0.15996685  0.19049895 0.2648825
+## HALLMARK_APICAL_SURFACE                  -0.05686320  0.37283026 0.1850372
+## HALLMARK_HEDGEHOG_SIGNALING              -0.32873822  0.16688243 0.3495854
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY  0.17313757 -0.05078672 0.4714168
+## HALLMARK_ANGIOGENESIS                    -0.24981120  0.11340218 0.4578473
+##                                           GSM918584   GSM918629  GSM918641
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING       0.3700833 -0.15266799 0.06338861
+## HALLMARK_NOTCH_SIGNALING                  0.2489189 -0.11273487 0.29876626
+## HALLMARK_APICAL_SURFACE                   0.3167960 -0.13384276 0.19271595
+## HALLMARK_HEDGEHOG_SIGNALING               0.1928629  0.03330116 0.22781657
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.3083731 -0.07933328 0.13870709
+## HALLMARK_ANGIOGENESIS                    -0.3244357 -0.19594252 0.49198709
+##                                            GSM918639  GSM918604   GSM918651
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      -0.28580675 0.02905820  0.18240891
+## HALLMARK_NOTCH_SIGNALING                 -0.19314868 0.17951238  0.09517688
+## HALLMARK_APICAL_SURFACE                   0.04062882 0.01754503  0.16738236
+## HALLMARK_HEDGEHOG_SIGNALING              -0.54114593 0.27636956  0.32670291
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.12288562 0.54526936  0.08882182
+## HALLMARK_ANGIOGENESIS                    -0.22575019 0.31238423 -0.07768959
+##                                            GSM918616   GSM918637   GSM918589
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      -0.24094833  0.11530550  0.21663657
+## HALLMARK_NOTCH_SIGNALING                 -0.03318423 -0.03231959  0.18916012
+## HALLMARK_APICAL_SURFACE                  -0.20238222  0.16766468 -0.11605456
+## HALLMARK_HEDGEHOG_SIGNALING              -0.28053833 -0.40954355  0.02649241
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.26001133 -0.15532194  0.17816518
+## HALLMARK_ANGIOGENESIS                    -0.50833369 -0.16408875  0.24831780
+##                                           GSM918585   GSM918653   GSM918625
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING       0.2455032 -0.06063081 -0.02901912
+## HALLMARK_NOTCH_SIGNALING                 -0.1347319 -0.02332031 -0.28065654
+## HALLMARK_APICAL_SURFACE                   0.1575350  0.05191626  0.10143337
+## HALLMARK_HEDGEHOG_SIGNALING               0.2325810 -0.32720734  0.21196433
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.2326842 -0.24139015  0.20543708
+## HALLMARK_ANGIOGENESIS                     0.1472216 -0.26554086  0.26243287
+##                                            GSM918583   GSM918600   GSM918606
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING       0.08884055  0.08419407  0.19363225
+## HALLMARK_NOTCH_SIGNALING                 -0.21213244 -0.16022879  0.33105363
+## HALLMARK_APICAL_SURFACE                   0.12894159 -0.11635252  0.09091376
+## HALLMARK_HEDGEHOG_SIGNALING               0.02678475  0.16428888 -0.06286876
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.44249084  0.18669438 -0.10173529
+## HALLMARK_ANGIOGENESIS                    -0.20601494  0.16765132 -0.09751089
+##                                              GSM918635   GSM918578   GSM918636
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      -0.0009701468 -0.12345891  0.20947727
+## HALLMARK_NOTCH_SIGNALING                 -0.2699253479  0.04168101 -0.04326055
+## HALLMARK_APICAL_SURFACE                  -0.1287392448  0.05949457 -0.01444851
+## HALLMARK_HEDGEHOG_SIGNALING               0.0657207257  0.01973056 -0.33561342
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.0400655031 -0.19657049  0.12932956
+## HALLMARK_ANGIOGENESIS                    -0.2825959777  0.00234438 -0.15673331
+##                                             GSM918643    GSM918598   GSM918646
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING       0.064622642  0.227232947  0.19931892
+## HALLMARK_NOTCH_SIGNALING                 -0.134540399  0.137576856  0.32042287
+## HALLMARK_APICAL_SURFACE                   0.059989988  0.004333413  0.04145731
+## HALLMARK_HEDGEHOG_SIGNALING               0.148666199  0.273791523  0.03689475
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.117846601 -0.279183696  0.10928364
+## HALLMARK_ANGIOGENESIS                     0.005401437  0.285726853 -0.19618623
+##                                           GSM918644   GSM918624   GSM918612
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      0.26709595 -0.01791242 -0.13718753
+## HALLMARK_NOTCH_SIGNALING                 0.32442532  0.11217030  0.03628688
+## HALLMARK_APICAL_SURFACE                  0.18553834 -0.11039171  0.05008435
+## HALLMARK_HEDGEHOG_SIGNALING              0.08295024 -0.29125054  0.01819805
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY 0.20900008 -0.02159049  0.28326163
+## HALLMARK_ANGIOGENESIS                    0.17794015  0.18080336  0.17422076
+##                                           GSM918596   GSM918633   GSM918650
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING       0.1252027 -0.04837702 -0.02622123
+## HALLMARK_NOTCH_SIGNALING                  0.2695138 -0.44160380 -0.31403870
+## HALLMARK_APICAL_SURFACE                  -0.1271545 -0.33549131 -0.16655450
+## HALLMARK_HEDGEHOG_SIGNALING               0.1034041  0.20685711  0.16104246
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY  0.1068096  0.08939164 -0.29405137
+## HALLMARK_ANGIOGENESIS                     0.4853770 -0.35032341 -0.31660333
+##                                            GSM918592  GSM918615   GSM918608
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      -0.10138987 0.03524086  0.10958784
+## HALLMARK_NOTCH_SIGNALING                  0.15013551 0.18675150  0.15066160
+## HALLMARK_APICAL_SURFACE                   0.22816724 0.10250507 -0.14474001
+## HALLMARK_HEDGEHOG_SIGNALING               0.18693168 0.03511935  0.16683178
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.03376685 0.09896661  0.08493603
+## HALLMARK_ANGIOGENESIS                    -0.14247911 0.50761451  0.19774528
+##                                          GSM918587   GSM918621   GSM918611
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      0.1884040  0.05142786  0.21788622
+## HALLMARK_NOTCH_SIGNALING                 0.4037280 -0.03614978  0.04558556
+## HALLMARK_APICAL_SURFACE                  0.2892472 -0.08988133  0.22397793
+## HALLMARK_HEDGEHOG_SIGNALING              0.1644370  0.23065657  0.40264183
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY 0.4794348  0.07846292 -0.08585909
+## HALLMARK_ANGIOGENESIS                    0.3113657  0.30326245 -0.14084079
+##                                             GSM918618   GSM918652   GSM918638
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      -0.009274458 -0.17365594 -0.05672167
+## HALLMARK_NOTCH_SIGNALING                  0.493226658 -0.11749314 -0.30925571
+## HALLMARK_APICAL_SURFACE                   0.239913391  0.08891204 -0.24923953
+## HALLMARK_HEDGEHOG_SIGNALING              -0.219880216 -0.33404284  0.21369241
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY  0.457665298  0.02029094  0.22479921
+## HALLMARK_ANGIOGENESIS                     0.544891902 -0.22065713 -0.08093338
+##                                           GSM918647   GSM918649  GSM918595
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      0.22858827  0.03943721 -0.1229492
+## HALLMARK_NOTCH_SIGNALING                 0.15629744  0.06303533  0.1055035
+## HALLMARK_APICAL_SURFACE                  0.11592910  0.21343825 -0.2219838
+## HALLMARK_HEDGEHOG_SIGNALING              0.04674316 -0.02426089 -0.1235387
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY 0.02938937  0.20015244 -0.3350906
+## HALLMARK_ANGIOGENESIS                    0.01668848  0.07642845  0.3268799
+##                                             GSM918640   GSM918645   GSM918601
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      -0.004239444 -0.11713111 -0.20561596
+## HALLMARK_NOTCH_SIGNALING                 -0.119234008 -0.10332593 -0.25238060
+## HALLMARK_APICAL_SURFACE                  -0.136529958  0.22820653 -0.24230165
+## HALLMARK_HEDGEHOG_SIGNALING              -0.401156116 -0.03610303 -0.49768858
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.036005573 -0.05561319 -0.03412943
+## HALLMARK_ANGIOGENESIS                    -0.068341694 -0.10555162 -0.29151740
+##                                            GSM918586   GSM918628  GSM918630
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      -0.02031341 -0.05410251  0.1064781
+## HALLMARK_NOTCH_SIGNALING                 -0.18245187  0.37844039 -0.0432457
+## HALLMARK_APICAL_SURFACE                  -0.11837557  0.32308502 -0.1750377
+## HALLMARK_HEDGEHOG_SIGNALING              -0.01934523 -0.05519408 -0.2890555
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY  0.21795467  0.30456454 -0.2799407
+## HALLMARK_ANGIOGENESIS                     0.07227662  0.27775441 -0.4852257
+##                                            GSM918591    GSM918613   GSM918622
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      -0.16282445 -0.145722303  0.01617379
+## HALLMARK_NOTCH_SIGNALING                 -0.08894168  0.007187613  0.05101865
+## HALLMARK_APICAL_SURFACE                   0.01663291  0.055120201 -0.14814370
+## HALLMARK_HEDGEHOG_SIGNALING               0.15363884  0.031129847  0.15826829
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.13447644  0.032723969  0.13661635
+## HALLMARK_ANGIOGENESIS                    -0.13266748 -0.237639295 -0.10997168
+##                                            GSM918623   GSM918594   GSM918620
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING       0.02281898  0.13624559 -0.08201268
+## HALLMARK_NOTCH_SIGNALING                 -0.16531504  0.01307365  0.09843645
+## HALLMARK_APICAL_SURFACE                  -0.20505171 -0.14838082 -0.03424037
+## HALLMARK_HEDGEHOG_SIGNALING               0.03822634 -0.15417073  0.10116984
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.31685028  0.13309654  0.08516563
+## HALLMARK_ANGIOGENESIS                    -0.26908954  0.40852174  0.03289946
+##                                            GSM918602   GSM918582   GSM918627
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING       0.03996778  0.03433934 -0.05056402
+## HALLMARK_NOTCH_SIGNALING                  0.05409175  0.10565182  0.03646477
+## HALLMARK_APICAL_SURFACE                  -0.14011891  0.02739102  0.01324814
+## HALLMARK_HEDGEHOG_SIGNALING              -0.11494089  0.10061320  0.28787974
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY  0.35723106 -0.14978735 -0.11268113
+## HALLMARK_ANGIOGENESIS                    -0.24677766  0.26210185  0.19164039
+##                                            GSM918632  GSM918597   GSM918619
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      -0.14716548  0.3042655 -0.22712842
+## HALLMARK_NOTCH_SIGNALING                 -0.16395952  0.1095166 -0.04815091
+## HALLMARK_APICAL_SURFACE                  -0.00739304 -0.2597972  0.09286971
+## HALLMARK_HEDGEHOG_SIGNALING              -0.13611504  0.1448153  0.13624246
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.11188936 -0.4106785  0.18256071
+## HALLMARK_ANGIOGENESIS                    -0.21488153  0.4068976 -0.20253369
+##                                            GSM918648   GSM918579   GSM918588
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      -0.02344439  0.18901318 -0.09678623
+## HALLMARK_NOTCH_SIGNALING                 -0.06091883  0.03011819 -0.20017315
+## HALLMARK_APICAL_SURFACE                   0.07493110  0.12183165 -0.29503003
+## HALLMARK_HEDGEHOG_SIGNALING              -0.10746683 -0.08717761 -0.26188620
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.20173916 -0.29745470 -0.01635886
+## HALLMARK_ANGIOGENESIS                    -0.29811366 -0.33884829 -0.34726026
+##                                           GSM918593   GSM918609
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING      0.10470757 -0.03358649
+## HALLMARK_NOTCH_SIGNALING                 0.07884171 -0.17041320
+## HALLMARK_APICAL_SURFACE                  0.13077882 -0.14423027
+## HALLMARK_HEDGEHOG_SIGNALING              0.41290475  0.44047361
+## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY 0.19095962  0.31628352
+## HALLMARK_ANGIOGENESIS                    0.07781984  0.07979649
+
+
+
+

4.4 Visualizing Results

+
+

4.4.1 Violin plot

+

Now let’s make a violin plot to look at the distribution of scores. First we need to get this data in an appropriate format for ggplot2.

+

REVIEW

+
# The results are a matrix
+long_df <- gsva_results %>%
+  data.frame() %>%
+  # Gene set names are rownames
+  tibble::rownames_to_column("gene_set") %>%
+  # Get into long format
+  tidyr::pivot_longer(
+    cols = -gene_set,
+    names_to = "Kids_First_Biospecimen_ID",
+    values_to = "gsva_score"
+  ) %>%
+  dplyr::arrange(dplyr::desc(gsva_score))
+

Let’s make a violin plot so we can look at the distribution of scores for each pathway.

+

REVIEW

+
# Violin plot comparing GSVA scores of different random gene set sizes
+violin_plot <- long_df %>%
+  ggplot2::ggplot(ggplot2::aes(
+    x = gene_set,
+    y = gsva_score
+  )) +
+  # Make a violin plot that's a pretty blue!
+  ggplot2::geom_violin(fill = "#99CCFF", alpha = 0.5) +
+  # Add a point with the mean value
+  ggplot2::stat_summary(
+    geom = "point",
+    fun = "mean",
+    # Change the aesthetics of the points
+    size = 3,
+    color = "#0066CC",
+    shape = 18
+  ) +
+  # Flip the axes
+  ggplot2::coord_flip() +
+  ggplot2::labs(
+    title = "Gene Set GSVA Scores",
+    x = "gene set",
+    y = "GSVA score"
+  ) +
+  ggplot2::theme_bw()
+
+# Display plot
+violin_plot
+

+

DRAFT

+

What do you notice about these distributions? How might you use this information to inform your interpretation of GSVA scores?

+

Let’s save this plot to PNG.

+
ggplot2::ggsave(file.path(plots_dir, "GSE37418_gsva_violin_plot.png"),
+  plot = violin_plot
+)
+
## Saving 7 x 5 in image
+
+
+

4.4.2 Heatmap

+

Now let’s make a heatmap to look at the distribution of scores. First, we will need to prepare the metadata for plotting.

+

REVIEW

+
# Let's prepare an annotation data frame for plotting
+annotation_df <- metadata %>%
+  # We want to select the variables that we want for annotating the heatmap
+  dplyr::select(
+    refinebio_accession_code,
+    refinebio_sex
+  ) %>%
+  # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our dataset object
+  tibble::column_to_rownames("refinebio_accession_code")
+

Now let’s make a heatmap using the pheatmap() function.

+

REVIEW

+
heatmap_annotated <-
+  pheatmap(
+    gsva_results,
+    cluster_rows = TRUE, # We want to cluster the heatmap by rows (gene sets in this case)
+    cluster_cols = TRUE, # We also want to cluster the heatmap by columns (samples in this case)
+    show_rownames = TRUE, # We want to show the names of the gene sets
+    show_colnames = FALSE, # We don't want to show the rownames because there are too many samples for the labels to be clearly seen
+    annotation_col = annotation_df,
+    fontsize = 8,
+    width = 15,
+    main = "Annotated Heatmap",
+    colorRampPalette(c(
+      "deepskyblue",
+      "black",
+      "yellow"
+    ))(25),
+    scale = "row" # Scale values in the direction of genes (rows)
+  )
+
+# Display heatmap
+heatmap_annotated
+

+

Let’s save this plot to PNG.

+
ggplot2::ggsave(file.path(plots_dir, "GSE37418_gsva_annotated_heatmap.png"),
+  plot = heatmap_annotated
+)
+
## Saving 7 x 5 in image
+
+
+
+

4.5 Write results to file

+

Now let’s write our GSVA results to file.

+
gsva_results %>%
+  as.data.frame() %>%
+  tibble::rownames_to_column("pathway") %>%
+  readr::write_tsv(file.path(
+    results_dir,
+    "GSE37418_gsva_results.tsv"
+  ))
+
+
+
+

5 Resources for further learning

+

DRAFT

+ +
+
+

6 Session info

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ───────────────────────────────────────────────────────────────
+##  setting  value                       
+##  version  R version 4.0.2 (2020-06-22)
+##  os       Ubuntu 20.04 LTS            
+##  system   x86_64, linux-gnu           
+##  ui       X11                         
+##  language (EN)                        
+##  collate  en_US.UTF-8                 
+##  ctype    en_US.UTF-8                 
+##  tz       Etc/UTC                     
+##  date     2020-11-12                  
+## 
+## ─ Packages ───────────────────────────────────────────────────────────────────
+##  package              * version  date       lib source        
+##  annotate               1.66.0   2020-04-27 [1] Bioconductor  
+##  AnnotationDbi        * 1.50.3   2020-07-25 [1] Bioconductor  
+##  assertthat             0.2.1    2019-03-21 [1] RSPM (R 4.0.0)
+##  backports              1.1.10   2020-09-15 [1] RSPM (R 4.0.2)
+##  Biobase              * 2.48.0   2020-04-27 [1] Bioconductor  
+##  BiocGenerics         * 0.34.0   2020-04-27 [1] Bioconductor  
+##  BiocParallel           1.22.0   2020-04-27 [1] Bioconductor  
+##  bit                    4.0.4    2020-08-04 [1] RSPM (R 4.0.2)
+##  bit64                  4.0.5    2020-08-30 [1] RSPM (R 4.0.2)
+##  bitops                 1.0-6    2013-08-17 [1] RSPM (R 4.0.0)
+##  blob                   1.2.1    2020-01-20 [1] RSPM (R 4.0.0)
+##  cli                    2.1.0    2020-10-12 [1] RSPM (R 4.0.2)
+##  coda                   0.19-4   2020-09-30 [1] RSPM (R 4.0.2)
+##  colorspace             1.4-1    2019-03-18 [1] RSPM (R 4.0.0)
+##  crayon                 1.3.4    2017-09-16 [1] RSPM (R 4.0.0)
+##  DBI                    1.1.0    2019-12-15 [1] RSPM (R 4.0.0)
+##  DelayedArray           0.14.1   2020-07-14 [1] Bioconductor  
+##  digest                 0.6.25   2020-02-23 [1] RSPM (R 4.0.0)
+##  dplyr                  1.0.2    2020-08-18 [1] RSPM (R 4.0.2)
+##  ellipsis               0.3.1    2020-05-15 [1] RSPM (R 4.0.0)
+##  emmeans                1.5.1    2020-09-18 [1] RSPM (R 4.0.2)
+##  estimability           1.3      2018-02-11 [1] RSPM (R 4.0.0)
+##  evaluate               0.14     2019-05-28 [1] RSPM (R 4.0.0)
+##  fansi                  0.4.1    2020-01-08 [1] RSPM (R 4.0.0)
+##  farver                 2.0.3    2020-01-16 [1] RSPM (R 4.0.0)
+##  fastmap                1.0.1    2019-10-08 [1] RSPM (R 4.0.0)
+##  fastmatch              1.1-0    2017-01-28 [1] RSPM (R 4.0.0)
+##  fftw                   1.0-6    2020-02-24 [1] RSPM (R 4.0.2)
+##  generics               0.0.2    2018-11-29 [1] RSPM (R 4.0.0)
+##  GenomeInfoDb           1.24.2   2020-06-15 [1] Bioconductor  
+##  GenomeInfoDbData       1.2.3    2020-10-27 [1] Bioconductor  
+##  GenomicRanges          1.40.0   2020-04-27 [1] Bioconductor  
+##  getopt                 1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
+##  ggplot2                3.3.2    2020-06-19 [1] RSPM (R 4.0.1)
+##  glue                   1.4.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  graph                  1.66.0   2020-04-27 [1] Bioconductor  
+##  GSEABase               1.50.1   2020-05-29 [1] Bioconductor  
+##  GSVA                 * 1.36.3   2020-10-07 [1] Bioconductor  
+##  gtable                 0.3.0    2019-03-25 [1] RSPM (R 4.0.0)
+##  hms                    0.5.3    2020-01-08 [1] RSPM (R 4.0.0)
+##  htmltools              0.5.0    2020-06-16 [1] RSPM (R 4.0.1)
+##  httpuv                 1.5.4    2020-06-06 [1] RSPM (R 4.0.2)
+##  IRanges              * 2.22.2   2020-05-21 [1] Bioconductor  
+##  jsonlite               1.7.1    2020-09-07 [1] RSPM (R 4.0.2)
+##  knitr                  1.30     2020-09-22 [1] RSPM (R 4.0.2)
+##  labeling               0.3      2014-08-23 [1] RSPM (R 4.0.0)
+##  later                  1.1.0.1  2020-06-05 [1] RSPM (R 4.0.2)
+##  lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
+##  lifecycle              0.2.0    2020-03-06 [1] RSPM (R 4.0.0)
+##  limma                * 3.44.3   2020-06-12 [1] Bioconductor  
+##  magrittr             * 1.5      2014-11-22 [1] RSPM (R 4.0.0)
+##  Matrix                 1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
+##  matrixStats            0.57.0   2020-09-25 [1] RSPM (R 4.0.2)
+##  memoise                1.1.0    2017-04-21 [1] RSPM (R 4.0.0)
+##  mime                   0.9      2020-02-04 [1] RSPM (R 4.0.0)
+##  munsell                0.5.0    2018-06-12 [1] RSPM (R 4.0.0)
+##  mvtnorm                1.1-1    2020-06-09 [1] RSPM (R 4.0.0)
+##  nlme                   3.1-148  2020-05-24 [2] CRAN (R 4.0.2)
+##  optparse             * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
+##  org.Hs.eg.db         * 3.11.4   2020-11-10 [1] Bioconductor  
+##  pheatmap             * 1.0.12   2019-01-04 [1] RSPM (R 4.0.0)
+##  pillar                 1.4.6    2020-07-10 [1] RSPM (R 4.0.2)
+##  pkgconfig              2.0.3    2019-09-22 [1] RSPM (R 4.0.0)
+##  promises               1.1.1    2020-06-09 [1] RSPM (R 4.0.2)
+##  ps                     1.4.0    2020-10-07 [1] RSPM (R 4.0.2)
+##  purrr                  0.3.4    2020-04-17 [1] RSPM (R 4.0.0)
+##  qusage               * 2.22.0   2020-04-27 [1] Bioconductor  
+##  R.cache                0.14.0   2019-12-06 [1] RSPM (R 4.0.0)
+##  R.methodsS3            1.8.1    2020-08-26 [1] RSPM (R 4.0.2)
+##  R.oo                   1.24.0   2020-08-26 [1] RSPM (R 4.0.2)
+##  R.utils                2.10.1   2020-08-26 [1] RSPM (R 4.0.2)
+##  R6                     2.4.1    2019-11-12 [1] RSPM (R 4.0.0)
+##  RColorBrewer           1.1-2    2014-12-07 [1] RSPM (R 4.0.0)
+##  Rcpp                   1.0.5    2020-07-06 [1] RSPM (R 4.0.2)
+##  RCurl                  1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0)
+##  readr                  1.4.0    2020-10-05 [1] RSPM (R 4.0.2)
+##  rematch2               2.1.2    2020-05-01 [1] RSPM (R 4.0.0)
+##  rlang                  0.4.8    2020-10-08 [1] RSPM (R 4.0.2)
+##  rmarkdown              2.4      2020-09-30 [1] RSPM (R 4.0.2)
+##  RSQLite                2.2.1    2020-09-30 [1] RSPM (R 4.0.2)
+##  rstudioapi             0.11     2020-02-07 [1] RSPM (R 4.0.0)
+##  S4Vectors            * 0.26.1   2020-05-16 [1] Bioconductor  
+##  scales                 1.1.1    2020-05-11 [1] RSPM (R 4.0.0)
+##  sessioninfo            1.1.1    2018-11-05 [1] RSPM (R 4.0.0)
+##  shiny                  1.5.0    2020-06-23 [1] RSPM (R 4.0.2)
+##  shinythemes            1.1.2    2018-11-06 [1] RSPM (R 4.0.0)
+##  stringi                1.5.3    2020-09-09 [1] RSPM (R 4.0.2)
+##  stringr                1.4.0    2019-02-10 [1] RSPM (R 4.0.0)
+##  styler                 1.3.2    2020-02-23 [1] RSPM (R 4.0.0)
+##  SummarizedExperiment   1.18.2   2020-07-09 [1] Bioconductor  
+##  tibble                 3.0.4    2020-10-12 [1] RSPM (R 4.0.2)
+##  tidyr                  1.1.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  tidyselect             1.1.0    2020-05-11 [1] RSPM (R 4.0.0)
+##  vctrs                  0.3.4    2020-08-29 [1] RSPM (R 4.0.2)
+##  withr                  2.3.0    2020-09-22 [1] RSPM (R 4.0.2)
+##  xfun                   0.18     2020-09-29 [1] RSPM (R 4.0.2)
+##  XML                    3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2)
+##  xtable                 1.8-4    2019-04-21 [1] RSPM (R 4.0.0)
+##  XVector                0.28.0   2020-04-27 [1] Bioconductor  
+##  yaml                   2.2.1    2020-02-01 [1] RSPM (R 4.0.0)
+##  zlibbioc               1.34.0   2020-04-27 [1] Bioconductor  
+## 
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
+
+
+

References

+
+
+

Carlson M., 2020 org.Hs.eg.db: Genome wide annotation for human. http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html

+
+
+

Robinson G., M. Parker, T. A. Kranenburg, C. Lu, and X. Chen et al., 2012 Novel mutations target distinct subgroups of medulloblastoma. Nature 488: 43–48. https://doi.org/10.1038/nature11213

+
+
+
+ + + + +
+
+ +
+ + + + + + + + + + + + + + + + diff --git a/Snakefile b/Snakefile index 7fe64d3f..1015bea5 100644 --- a/Snakefile +++ b/Snakefile @@ -10,6 +10,7 @@ rule target: "02-microarray/gene-id-annotation_microarray_01_ensembl.html", "02-microarray/pathway-analysis_microarray_02_ora.html", "02-microarray/pathway-analysis_microarray_03_gsea.html", + "02-microarray/pathway-analysis_microarray_04_gsva.html", "02-microarray/ortholog-mapping_microarray_01_ensembl.html", "03-rnaseq/00-intro-to-rnaseq.html", "03-rnaseq/clustering_rnaseq_01_heatmap.html", diff --git a/components/dictionary.txt b/components/dictionary.txt index 14385da0..6edccf11 100644 --- a/components/dictionary.txt +++ b/components/dictionary.txt @@ -16,6 +16,8 @@ Brainarray’s Brems CCDL CCDL's +cDNA +centric cheatsheet cheatsheets ChIP @@ -24,6 +26,7 @@ Cmd ColorBrewer Compara ComplexHeatmap +CPMs CREB Crouser cytosolic @@ -63,6 +66,8 @@ glioma GC GSE GSEA +GSVA +Hänzelmann HCOP hexamer HGNC @@ -79,6 +84,7 @@ JPEG KEGG limma logFC +Malhotra medulloblastoma microarray’s molecularly