diff --git a/.github/workflows/docker-build-push.yml b/.github/workflows/docker-build-push.yml index b3832343..e334fa8f 100644 --- a/.github/workflows/docker-build-push.yml +++ b/.github/workflows/docker-build-push.yml @@ -91,3 +91,14 @@ jobs: git add -A git commit -m 'Render html and publish' || echo "No changes to commit" git push origin gh-pages || echo "No changes to push" + + # If we have a failure, Slack us + - name: Report failure to Slack + if: always() + uses: ravsamhq/notify-slack-action@v1.1 + with: + status: ${{ job.status }} + notify_when: 'failure' + env: + SLACK_WEBHOOK_URL: ${{ secrets.ACTION_MONITORING_SLACK }} + SLACK_MESSAGE: 'Build, Render, and Push failed' diff --git a/.github/workflows/docker-build.yml b/.github/workflows/docker-build.yml index aa065d43..b2a84002 100644 --- a/.github/workflows/docker-build.yml +++ b/.github/workflows/docker-build.yml @@ -42,3 +42,14 @@ jobs: tags: ccdl/refinebio-examples:latest cache-from: type=local,src=/tmp/.buildx-cache cache-to: type=local,dest=/tmp/.buildx-cache + + # If we have a failure, Slack us + - name: Report failure to Slack + if: always() + uses: ravsamhq/notify-slack-action@v1.1 + with: + status: ${{ job.status }} + notify_when: 'failure' + env: + SLACK_WEBHOOK_URL: ${{ secrets.ACTION_MONITORING_SLACK }} + SLACK_MESSAGE: 'Build Docker failed' diff --git a/.gitignore b/.gitignore index c9721416..66921de5 100644 --- a/.gitignore +++ b/.gitignore @@ -10,6 +10,7 @@ _site */plots/* */results/* */data/* +*/gene_sets/* # markdown spellcheck .spelling diff --git a/01-getting-started/getting-started.html b/01-getting-started/getting-started.html index 8e68ebbf..e72ead1b 100644 --- a/01-getting-started/getting-started.html +++ b/01-getting-started/getting-started.html @@ -1263,25 +1263,22 @@ }; - - + + + + - - + @@ -2865,15 +3680,20 @@ @@ -3004,7 +3833,7 @@

0.4 How to get the data for these

0.5 How to use R Markdown Documents

We use R Markdown throughout this tutorial. R Markdown documents are helpful for scientific code by allowing you to keep detailed notes, code, and output in one place.

When you execute code within the notebook, the results appear beneath the code. Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

-
print("The output from the code in this chunk will print below!")
+
print("The output from the code in this chunk will print below!")
## [1] "The output from the code in this chunk will print below!"

R Markdown documents also have the added benefit of producing HTML file output that is nicely rendered and easy to read. Saving one of our R Markdowns (the files that end in .Rmd) on your computer will create an HTML file containing the code and output to be saved alongside it (will end in .nb.html).

See this guide using to R Notebooks for more information about inserting and executing code chunks.

@@ -3045,23 +3874,28 @@

0.8 Additional resources from the

References

-

Huber W., V. J. Carey, R. Gentleman, S. Anders, and M. Carlson et al., 2015 Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12: 115–121.

+

Huber W., V. J. Carey, R. Gentleman, S. Anders, and M. Carlson et al., 2015 Orchestrating high-throughput genomic analysis with Bioconductor. Nature Methods 12: 115–121. https://doi.org/10.1038/nmeth.3252

-

R Core Team, 2019 R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

+

R Core Team, 2019 R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org

-

RStudio Team, 2020 RStudio: Integrated development environment for r. RStudio, PBC., Boston, MA.

+

RStudio Team, 2020 RStudio: Integrated development environment for R. RStudio, PBC., Boston, MA. http://www.rstudio.com/

Wickham H., M. Averick, J. Bryan, W. Chang, and L. D. McGowan et al., 2019 Welcome to the tidyverse. Journal of Open Source Software 4: 1686. https://doi.org/10.21105/joss.01686

-

Wickham H., J. Hester, and W. Chang, 2020 Devtools: Tools to make developing r packages easier.

+

Wickham H., J. Hester, and W. Chang, 2020 devtools: Tools to make developing R packages easier. https://CRAN.R-project.org/package=devtools

+ diff --git a/02-microarray/00-intro-to-microarray.Rmd b/02-microarray/00-intro-to-microarray.Rmd index b0b00b25..fa6e0cb9 100644 --- a/02-microarray/00-intro-to-microarray.Rmd +++ b/02-microarray/00-intro-to-microarray.Rmd @@ -9,7 +9,7 @@ output: Data analyses are generally not "one size fits all"; this is particularly true when with approaches used to analyze RNA-seq and microarray data. The characteristics of the data produced by these two technologies can be quite different. -This tutorial has example analyses [organized by technology](../01-getting-started/getting-started.html#about-how-this-tutorial-book-is-structured) so you can follow examples that are more closely tailored to the nature of the data at hand. +This tutorial has example analyses [organized by technology](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#about-how-this-tutorial-book-is-structured) so you can follow examples that are more closely tailored to the nature of the data at hand. @@ -28,7 +28,7 @@ This tutorial has example analyses [organized by technology](../01-getting-start ## Introduction to microarray technology -Microarrays measure gene expression using chips filled with oligonucleotide probes designed to hybridize to labeled RNA samples. +Microarrays measure gene expression using chips filled with oligonucleotide probes designed to hybridize to labeled RNA samples. After hybridization, the microarrays are scanned, and the fluorescence intensity for each probe is measured. The fluorescence intensity indicates the number of labeled fragments bound and therefore the relative quantity of the transcript the probe is designed for. @@ -36,7 +36,7 @@ The fluorescence intensity indicates the number of labeled fragments bound and t [based on diagram from @microarray-video] -There are many different kinds of microarray platforms, which can be broadly separated into single-color and [two-color arrays](https://www.ebi.ac.uk/training/online/course/functional-genomics-ii-common-technologies-and-data-analysis-methods/microarrays). +There are many different kinds of microarray platforms, which can be broadly separated into single-color and [two-color arrays](https://www.ebi.ac.uk/training/online/course/functional-genomics-ii-common-technologies-and-data-analysis-methods/microarrays). At this time, refine.bio only supports single-color arrays, so our examples and advice are generally from the perspective of using single-color array. The diagram above shows an overview of the single-color array process which includes extracting the total RNA from a sample, labeling the RNA with fluorescent dye, hybridizing the labels, and scanning the fluorescent image to analyze the fluorescence intensity. @@ -45,27 +45,31 @@ A longer list of specific arrays that are supported by refine.bio can be found [ As with all experimental methods, microarrays have strengths and limitations that you should consider in regards to your scientific questions. -### Microarray data **strengths**: +### Microarray data **strengths**: -- Microarray is generally less expensive than RNA-seq - you can afford more replicates and get higher statistical power [@Tarca2006]. -- Microarray has generally had a faster turn-around than RNA-seq [@LCSciences2014]. +- Microarrays historically were less expensive than RNA-seq allowing for more replicates and greater statistical power [@Tarca2006]. +- Microarrays generally had a faster turn-around than RNA-seq [@LCSciences2014]. -### Microarray data **limitations**: +As a result of these historical advantages, vast quantities of data have been generated worldwide using microarrays. +The microarray data compiled by refine.bio includes over 500,000 individual samples across over 25,000 experiments. +For many scientific questions, the best available gene expression data may be microarray based! -- If a transcript doesn't have a probe designed to it on a microarray, it won't be measured; standard microarrays can't be used for transcript discovery [@Mantione2014]. +### Microarray data **limitations**: + +- If a transcript doesn't have a probe designed to it on a microarray, it won't be measured; standard microarrays can't be used for transcript discovery [@Mantione2014]. - A chip's probe designs are only as up to date as the genome annotation at the time it was designed [@Mantione2014]. - As is true for all techniques that involve nucleotide hybridization (RNA-seq too); microarray probes come with some biases depending on their nucleotide sequence composition (like GC bias). -Refine.bio drops outdated probes based on [Brainarray’s annotation packages](http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/) and uses [SCAN](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3508193/pdf/nihms401888.pdf) normalization methods prior to your downloads to help address these probe nucleotide composition biases [@Dai2005; @Piccolo2012]. +refine.bio drops outdated probes based on [Brainarray’s annotation packages](http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/) and uses [SCAN](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3508193/pdf/nihms401888.pdf) normalization methods prior to your downloads to help address these probe nucleotide composition biases [@Dai2005; @Piccolo2012]. ## About quantile normalization Microarray chips are generally experimentally processed in groups of chips - this can lead to [experimental batch effects](https://en.wikipedia.org/wiki/Batch_effect#:~:text=In%20molecular%20biology%2C%20a%20batch,of%20interest%20in%20an%20experiment). To minimize this, all refine.bio microarray data downloads come [quantile-normalized](https://en.wikipedia.org/wiki/Quantile_normalization) which enables more confident comparisons of expression levels among experiments. -Different microarray chips are also a type of batch effect, but quantile normalization allows us to compare data from different chips that to a limited degree if we proceed with caution. +The use of different microarray chips is also a type of batch effect, but quantile normalization allows us to compare data from different chips to a limited degree, if we proceed with caution. See the refine.bio docs for more about the microarray processing steps, including the [quantile normalization](http://docs.refine.bio/en/latest/main_text.html#quantile-normalization). -## More resources on microarray technology: +## More resources on microarray technology: - [Getting started in gene expression microarray analysis](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000543) [@Slonim2009]. - [Microarray and its applications](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3467903/) [@Govindarajan2012]. @@ -79,13 +83,13 @@ See the refine.bio docs for more about the microarray processing steps, includin - A common and simple reason you may not see your gene of interest is that the microarray chip used in the experiment you are analyzing did not originally have probes designed to target that gene. -- Refine.bio uses [Brainarray packages](http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/) to annotate the microarray probe data for microarray platforms that have this available [@Dai2005]. +- refine.bio uses [Brainarray packages](http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/) to annotate the microarray probe data for microarray platforms that have this available [@Dai2005]. This annotation identifies which probes map to which genes according to the updated transcriptome annotation (which likely changed since the microarray’s probes were first designed). Some probes may have since become obsolete (they do not bind reliably to one location according to updated genome annotations), which may result in the gene they targeted being removed. If your gene of interest was covered by the original probes of the microarray chip and the version of the Brainarray package used maintains that it is still accurate, your gene of interest will show up in the Gene column. You can find your dataset’s microarray chip and Brainarray version information on the refine.bio dataset page and [by following these instructions](TODO: Put link to refine.bio docs FAQ when https://github.com/AlexsLemonade/refinebio-docs/issues/137 is addressed). - One additional reason you may not see a gene of interest applies only if you are refine.bio's [aggregate by species](https://docs.refine.bio/en/latest/main_text.html#aggregations) option. -When data is aggregated across different platforms, only the genes common to both/all experiments aggregated will be kept. +When data is aggregated across different platforms, only the genes common to all experiments aggregated will be kept. ## References diff --git a/02-microarray/00-intro-to-microarray.html b/02-microarray/00-intro-to-microarray.html index b9d85539..123fbfc1 100644 --- a/02-microarray/00-intro-to-microarray.html +++ b/02-microarray/00-intro-to-microarray.html @@ -2550,6 +2550,623 @@ PagedTableDoc.initAll(); }; + + + - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2971,26 +3797,26 @@

2.1 Obtain the .Rmd

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

-
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
-  dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
-  dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir)
-}
+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the path to the directory where plots will be saved
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

@@ -3040,20 +3866,25 @@

2.6 Check out our file structure!

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

-
# Define the file path to the data directory
-data_dir <- file.path("data", "GSE24862") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE24862.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE24862.tsv") # Replace with file path to your metadata
+
# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE24862")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE24862.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE24862.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

-
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+
# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-
# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

@@ -3071,25 +3902,26 @@

4 Clustering Heatmap - Microarray

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

-

In this analysis, we will be using the R package pheatmap for clustering and creating a heatmap (Slowikowski 2017).

-
if (!("pheatmap" %in% installed.packages())) {
-  # Install pheatmap
-  install.packages("pheatmap", update = FALSE)
-}
+

In this analysis, we will be using the R package pheatmap for clustering and creating a heatmap (Slowikowski 2017).

+
if (!("pheatmap" %in% installed.packages())) {
+  # Install pheatmap
+  install.packages("pheatmap", update = FALSE)
+}

Attach the pheatmap library:

-
# Attach the `pheatmap` library
-library(pheatmap)
-
-# We will need this so we can use the pipe: %>%
-library(magrittr)
+
# Attach the `pheatmap` library
+library(pheatmap)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read in both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

-
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-
## Parsed with column specification:
+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   .default = col_character(),
 ##   refinebio_age = col_logical(),
@@ -3107,82 +3939,83 @@ 

4.2 Import and set up data

## `contact_zip/postal_code` = col_double(), ## data_row_count = col_double(), ## taxid_ch1 = col_double() -## )
-
## See spec(...) for full column specifications.
-
# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
-  # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
-  tibble::column_to_rownames("Gene")
-
## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+
# Read in data TSV file
+df <- readr::read_tsv(data_file) %>%
+  # Here we are going to store the gene IDs as row names so that
+  # we have only numeric values to perform calculations on later
+  tibble::column_to_rownames("Gene")
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   .default = col_double(),
 ##   Gene = col_character()
 ## )
-## See spec(...) for full column specifications.
+## ℹ Use `spec()` for the full column specifications.

Let’s take a look at the metadata object that we read into the R environment.

-
head(metadata)
+
head(metadata)

Now let’s ensure that the metadata and data are in the same sample order.

-
# Make the data in the order of the metadata
-df <- df %>% dplyr::select(metadata$refinebio_accession_code)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$refinebio_accession_code)
+
# Make the data in the order of the metadata
+df <- df %>% dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$refinebio_accession_code)
## [1] TRUE

Now we are going to use a combination of functions from base R and the pheatmap package to look at how our samples and genes are clustering.

4.3 Choose genes of interest

-

Although you may want to create a heatmap including all of the genes in the set, alternatively, the heatmap could be created using only genes of interest. For this example, we will sort genes by variance, but there are many alternative criterion by which you may want to sort your genes e.g. fold change, t-statistic, membership to a particular gene ontology, so on.

-
# Calculate the variance for each gene
-variances <- apply(df, 1, var)
-
-# Determine the upper quartile variance cutoff value
-upper_var <- quantile(variances, 0.75)
-
-# Subset the data choosing only genes whose variances are in the upper quartile
-df_by_var <- data.frame(df) %>%
-  dplyr::filter(variances > upper_var)
+

Although you may want to create a heatmap including all of the genes in the dataset, this can produce a very large image that is hard to interpret. Alternatively, the heatmap could be created using only genes of interest. For this example, we will sort genes by variance and select genes in the upper quartile, but there are many alternative criteria by which you may want to sort your genes, e.g. fold change, t-statistic, membership in a particular gene ontology, and so on.

+
# Calculate the variance for each gene
+variances <- apply(df, 1, var)
+
+# Determine the upper quartile variance cutoff value
+upper_var <- quantile(variances, 0.75)
+
+# Filter the data choosing only genes whose variances are in the upper quartile
+df_by_var <- data.frame(df) %>%
+  dplyr::filter(variances > upper_var)

4.4 Create a heatmap

-

To further customize the heatmap, see a vignette for a guide at this link (Slowikowski 2017).

-
# Create and store the heatmap object
-heatmap <-
-  pheatmap(
-    df_by_var,
-    cluster_rows = TRUE, # We want to cluster the heatmap by rows (genes in this case)
-    cluster_cols = TRUE, # We also want to cluster the heatmap by columns (samples in this case),
-    show_rownames = FALSE, # We don't want to show the rownames because there are too many genes for the labels to be clearly seen
-    main = "Non-Annotated Heatmap",
-    colorRampPalette(c(
-      "deepskyblue",
-      "black",
-      "yellow"
-    ))(25),
-    scale = "row" # Scale values in the direction of genes (rows)
-  )
+

To further customize the heatmap, see a vignette for a guide at this link (Slowikowski 2017).

+
# Create and store the heatmap object
+heatmap <- pheatmap(
+  df_by_var,
+  cluster_rows = TRUE, # Cluster the rows of the heatmap (genes in this case)
+  cluster_cols = TRUE, # Cluster the columns of the heatmap (samples)
+  show_rownames = FALSE, # There are too many genes to clearly show the labels
+  main = "Non-Annotated Heatmap",
+  colorRampPalette(c(
+    "deepskyblue",
+    "black",
+    "yellow"
+  ))(25),
+  scale = "row" # Scale values in the direction of genes (rows)
+)

We’ve created a heatmap but although our genes and samples are clustered, there is not much information that we can gather here because we did not provide the pheatmap() function with annotation labels for our samples.

First let’s save our clustered heatmap.

4.4.1 Save heatmap as a PNG

You can easily switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.

-
# Open a PNG file
-png(file.path(
-  plots_dir,
-  "GSE24862_heatmap_non_annotated.png" # Replace file name with a relevant output plot name
-))
-
-# Print your heatmap
-heatmap
-
-# Close the PNG file:
-dev.off()
+
# Open a PNG file
+png(file.path(
+  plots_dir,
+  "GSE24862_heatmap_non_annotated.png" # Replace with a relevant file name
+))
+
+# Print your heatmap
+heatmap
+
+# Close the PNG file:
+dev.off()
## png 
 ##   2

Now, let’s add some annotation bars to our heatmap.

@@ -3190,44 +4023,47 @@

4.4.1 Save heatmap as a PNG

4.5 Prepare metadata for annotation

-

From the accompanying paper, we know that three PLX4032-sensitive parental cell lines (M229, M238 and M249) and three derived PLX4032-resistant (r) sub-lines (M229_r5, M238_r1, and M249_r4) were treated or not treated with the RAF-selective inhibitor, PLX4032 (Nazarian et al. 2010). We are going to annotate our heatmap with the variables that hold the refinebio_cell_line and refinebio_treatment data. We are also going to create a new column variable from our existing metadata called cell_line_type, that will distinguish whether the refinebio_cell_line is parental or resistant – since this is also a key aspect of the experimental design. Note that this step is very specific to our metadata, you may find that you also need to tailor the metadata for your own needs.

-
# Let's prepare an annotation data frame for plotting
-annotation_df <- metadata %>%
-  # We want to select the variables that we want for annotating the heatmap
-  dplyr::select(
-    refinebio_accession_code,
-    refinebio_cell_line,
-    refinebio_treatment
-  ) %>%
-  # Let's create a variable that specifically distinguishes whether the cell line is parental or resistant -- since this is a key aspect of the experimental design
-  dplyr::mutate(
-    cell_line_type =
-      dplyr::case_when(
-        stringr::str_detect(refinebio_cell_line, "_r") ~ "resistant",
-        TRUE ~ "parental"
-      )
-  ) %>%
-  # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our dataset object
-  tibble::column_to_rownames("refinebio_accession_code")
+

From the accompanying paper, we know that three PLX4032-sensitive parental cell lines (M229, M238 and M249) and three derived PLX4032-resistant (r) sub-lines (M229_r5, M238_r1, and M249_r4) were treated or not treated with the RAF-selective inhibitor, PLX4032 (Nazarian et al. 2010). We are going to annotate our heatmap with the variables that hold the refinebio_cell_line and refinebio_treatment data. We are also going to create a new column variable from our existing metadata called cell_line_type, that will distinguish whether the refinebio_cell_line is parental or resistant – since this is also a key aspect of the experimental design. Note that this step is very specific to our metadata, you may find that you also need to tailor the metadata for your own needs.

+
# Let's prepare an annotation data frame for plotting
+annotation_df <- metadata %>%
+  # We want to select the variables that we want for annotating the heatmap
+  dplyr::select(
+    refinebio_accession_code,
+    refinebio_cell_line,
+    refinebio_treatment
+  ) %>%
+  # Let's create a variable that specifically distinguishes whether
+  # the cell line is parental or resistant.
+  # This is a key aspect of the experimental design
+  dplyr::mutate(
+    cell_line_type =
+      dplyr::case_when(
+        stringr::str_detect(refinebio_cell_line, "_r") ~ "resistant",
+        TRUE ~ "parental"
+      )
+  ) %>%
+  # The `pheatmap()` function requires that the row names of our
+  # annotation object matches the column names of our dataset object
+  tibble::column_to_rownames("refinebio_accession_code")

4.5.1 Create annotated heatmap

You can create an annotated heatmap by providing our annotation object to the annotation_col argument of the pheatmap() function.

-
# Create and store the annotated heatmap object
-heatmap_annotated <-
-  pheatmap(
-    df_by_var,
-    cluster_rows = TRUE,
-    cluster_cols = TRUE,
-    show_rownames = FALSE,
-    annotation_col = annotation_df,
-    main = "Annotated Heatmap",
-    colorRampPalette(c(
-      "deepskyblue",
-      "black",
-      "yellow"
-    ))(25),
-    scale = "row" # Scale values in the direction of genes (rows)
-  )
+
# Create and store the annotated heatmap object
+heatmap_annotated <-
+  pheatmap(
+    df_by_var,
+    cluster_rows = TRUE,
+    cluster_cols = TRUE,
+    show_rownames = FALSE,
+    annotation_col = annotation_df,
+    main = "Annotated Heatmap",
+    colorRampPalette(c(
+      "deepskyblue",
+      "black",
+      "yellow"
+    ))(25),
+    scale = "row" # Scale values with respect to genes (rows)
+  )

Now that we have annotation bars on our heatmap, we have a better idea of the cell line and treatment groups that appear to cluster together. More specifically, we can see that the samples seem to cluster by their cell lines of origin, but not necessarily as much by whether or not they received the PLX4302 treatment.

Let’s save our annotated heatmap.

@@ -3235,17 +4071,17 @@

4.5.1 Create annotated heatmap

4.5.2 Save annotated heatmap as a PNG

You can easily switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.

-
# Open a PNG file
-png(file.path(
-  plots_dir,
-  "GSE24862_heatmap_annotated.png" # Replace file name with a relevant output plot name
-))
-
-# Print your heatmap
-heatmap_annotated
-
-# Close the PNG file:
-dev.off()
+
# Open a PNG file
+png(file.path(
+  plots_dir,
+  "GSE24862_heatmap_annotated.png" # Replace with a relevant plot file name
+))
+
+# Print your heatmap
+heatmap_annotated
+
+# Close the PNG file:
+dev.off()
## png 
 ##   2
@@ -3254,16 +4090,16 @@

4.5.2 Save annotated heatmap as a

5 Further learning resources about this analysis

+

diff --git a/02-microarray/differential-expression_microarray_01_2-groups.Rmd b/02-microarray/differential-expression_microarray_01_2-groups.Rmd index 8390ccc2..7e171ac2 100644 --- a/02-microarray/differential-expression_microarray_01_2-groups.Rmd +++ b/02-microarray/differential-expression_microarray_01_2-groups.Rmd @@ -44,7 +44,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -52,7 +52,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -81,8 +81,8 @@ You will get an email when it is ready. ## About the dataset we are using for this example -For this example analysis, we will use this [CREB overexpression zebrafish](https://www.refine.bio/experiments/GSE71270/creb-overexpression-induces-leukemia-in-zebrafish-by-blocking-myeloid-differentiation-process). -@Tregnago2016 measured microarray gene expression of zebrafish samples overexpressing human CREB, as well as control samples. +For this example analysis, we will use this [zebrafish gene expression dataset](https://www.refine.bio/experiments/GSE71270/creb-overexpression-induces-leukemia-in-zebrafish-by-blocking-myeloid-differentiation-process). +@Tregnago2016 used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples. In this analysis, we will test differential expression between the control and CREB-overexpressing groups. ## Place the dataset in your new `data/` folder @@ -124,19 +124,24 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "GSE71270") # Replace with accession number which will be the name of the folder the files will be in - -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "GSE71270.tsv") # Replace with file path to your dataset - -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_GSE71270.tsv") # Replace with file path to your metadata +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "GSE71270") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "GSE71270.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_GSE71270.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` @@ -164,7 +169,7 @@ From here you can customize this analysis example to fit your own scientific que See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. In this analysis, we will be using `limma` for differential expression [@Ritchie2015]. -We will also use `EnhancedVolcano` for plotting and `apeglm` for some log fold change estimates in the results table [@Blighe2020; @Zhu2018]. +We will also use `EnhancedVolcano` for plotting [@Blighe2020]. ```{r} if (!("limma" %in% installed.packages())) { @@ -175,15 +180,11 @@ if (!("EnhancedVolcano" %in% installed.packages())) { # Install this package if it isn't installed yet BiocManager::install("EnhancedVolcano", update = FALSE) } -if (!("apeglm" %in% installed.packages())) { - # Install this package if it isn't installed yet - BiocManager::install("apeglm", update = FALSE) -} ``` Attach the packages we need for this analysis. -```{r} +```{r message=FALSE} # Attach the library library(limma) @@ -214,8 +215,8 @@ We stored our file paths as objects named `metadata_file` and `data_file` in [th metadata <- readr::read_tsv(metadata_file) # Read in data TSV file -df <- readr::read_tsv(data_file) %>% - # Tuck away the Gene ID column as rownames +expression_df <- readr::read_tsv(data_file) %>% + # Tuck away the Gene ID column as row names tibble::column_to_rownames("Gene") ``` @@ -223,7 +224,7 @@ Let's ensure that the metadata and data are in the same sample order. ```{r} # Make the data in the order of the metadata -df <- df %>% +expression_df <- expression_df %>% dplyr::select(metadata$geo_accession) # Check if this is in the same order @@ -233,45 +234,74 @@ all.equal(colnames(df), metadata$geo_accession) ## Set up design matrix `limma` needs a numeric design matrix to signify which are CREB and control samples. -Here we are using the treatments supplied in the metadata to create a design matrix where the "none" samples are assigned `0` and the "amputated" samples are assigned `1`. -Note that the metadata variables that signify the treatment groups might be different across datasets and might not always be underneath the category. +Here we are using the treatments described in the metadata table in the `genotype/variation` column to create a design matrix where the "control" samples are assigned `0` and the "overexpressing the human CREB" samples are assigned `1`. +Note that the metadata columns that signify the treatment groups might be different across datasets, and will almost certainly have different contents. -The `genotype/variation` column contains group information we will be using for differential expression. -But the `/` it contains in its column name makes it more annoying to access. +While the `genotype/variation` column contains the group information we will be using for differential expression, the `/` it contains in its column name makes it more annoying to access. Accessing variable that have names with special characters like `/`, or spaces, require extra work-arounds to ignore R's normal interpretations of these characters. +Here we will rename it as just `genotype` to make our lives later much easier. + +We will also recode the contents of the column, as `overexpressing the human CREB"` is a bit of an unruly name. +To do this, we will use the `fct_recode()` function from the `forcats` package, simplifying `"overexpressing the human CREB"` to just `CREB`. +We will also use `fct_relevel()` to make sure our `control` samples appear first in the factor levels. + ```{r} +# These renaming steps will not be the same (or might not be needed at all) +# with a different dataset metadata <- metadata %>% - dplyr::rename("genotype" = `genotype/variation`) # This step will not be the same (or might not be needed at all) with a different dataset + # rename the column + dplyr::rename("genotype" = `genotype/variation`) %>% + # change the names and order of the genotypes (making the column a factor) + dplyr::mutate( + genotype = genotype %>% + # rename the "overexpressing..." genotype to "CREB" + forcats::fct_recode(CREB = "overexpressing the human CREB") %>% + # make "control" the first level of the factor + forcats::fct_relevel("control") + ) ``` Now we will create a model matrix based on our newly renamed `genotype` variable. ```{r} # Create the design matrix from the genotype information -des_mat <- model.matrix(~ metadata$genotype) +des_mat <- model.matrix(~genotype, data = metadata) + +# Look at the design matrix +head(des_mat) ``` +When we look at this design matrix, we see that there is now a `genotypeCREB` column that defines the group for each sample: 0 for control samples and 1 for the CREB samples. +(The model will also fit an intercept for all samples, so we can see that here as well.) + ## Perform differential expression -After applying our data to linear model, in this example we apply empirical Bayes smoothing and Benjamini-Hochberg multiple testing correction. -The `topTable()` function default is to use Benjamini-Hochberg but this can be changed to a different method using the `adjust.method` argument (see the `?topTable` help page for more about the options). +We will use the `lmFit()` function from the `limma` package to test each gene for differential expression between the two groups using a linear model. +After fitting our data to the linear model, in this example we apply empirical Bayes smoothing with the `eBayes()` function. + +Here's a [nifty article and example](http://varianceexplained.org/r/empirical_bayes_baseball/) about what the empirical Bayes smoothing is for [@bayes-estimates]. ```{r} # Apply linear model to data -fit <- lmFit(df, design = des_mat) +fit <- lmFit(expression_df, design = des_mat) # Apply empirical Bayes to smooth standard errors fit <- eBayes(fit) +``` +Because we are testing many different genes at once, we also want to perform some multiple test corrections, which we will do with the Benjamini-Hochberg method while making a table of results with `topTable()`. +The `topTable()` function default is to use Benjamini-Hochberg but this can be changed to a different method using the `adjust.method` argument (see the `?topTable` help page for more about the options). + +```{r} # Apply multiple testing correction and obtain stats -stats_df <- topTable(fit, number = nrow(df)) %>% +stats_df <- topTable(fit, number = nrow(expression_df)) %>% tibble::rownames_to_column("Gene") ``` -Let's take a peek at what our results table looks like. +Let's take a peek at our results table. -```{r} +```{r rownames.print = FALSE} head(stats_df) ``` @@ -285,14 +315,14 @@ To test if these results make sense, we can make a plot of one of top genes. Let's try extracting the data for `ENSDARG00000104315` and set up its own data frame for plotting purposes. ```{r} -top_gene_df <- df %>% - # Extract this gene from `df` +top_gene_df <- expression_df %>% + # Extract this gene from `expression_df` dplyr::filter(rownames(.) == "ENSDARG00000104315") %>% # Transpose so the gene is a column t() %>% - # Transpose made this a matrix, let's make it back into a data.frame like before + # Transpose made this a matrix, let's make it back into a data frame data.frame() %>% - # Store the sample ids as their own column instead of being row names + # Store the sample ids as their own column instead of as row names tibble::rownames_to_column("refinebio_accession_code") %>% # Join on the selected columns from metadata dplyr::inner_join(dplyr::select( diff --git a/02-microarray/differential-expression_microarray_01_2-groups.html b/02-microarray/differential-expression_microarray_01_2-groups.html index 3d0a488d..0c8a334b 100644 --- a/02-microarray/differential-expression_microarray_01_2-groups.html +++ b/02-microarray/differential-expression_microarray_01_2-groups.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2971,26 +3797,26 @@

2.1 Obtain the .Rmd

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

-
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
-  dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
-  dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir)
-}
+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

@@ -3005,7 +3831,7 @@

2.3 Obtain the dataset from refin

2.4 About the dataset we are using for this example

-

For this example analysis, we will use this CREB overexpression zebrafish. Tregnago et al. (2016) measured microarray gene expression of zebrafish samples overexpressing human CREB, as well as control samples. In this analysis, we will test differential expression between the control and CREB-overexpressing groups.

+

For this example analysis, we will use this zebrafish gene expression dataset. Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples. In this analysis, we will test differential expression between the control and CREB-overexpressing groups.

2.5 Place the dataset in your new data/ folder

@@ -3039,20 +3865,25 @@

2.6 Check out our file structure!

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

-
# Define the file path to the data directory
-data_dir <- file.path("data", "GSE71270") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE71270.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE71270.tsv") # Replace with file path to your metadata
+
# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE71270")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE71270.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE71270.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

-
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+
# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-
# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

@@ -3070,38 +3901,35 @@

4 Differential Expression - Micro

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

-

In this analysis, we will be using limma for differential expression (Ritchie et al. 2015). We will also use EnhancedVolcano for plotting and apeglm for some log fold change estimates in the results table (Zhu et al. 2018; Blighe et al. 2020).

-
if (!("limma" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("limma", update = FALSE)
-}
-if (!("EnhancedVolcano" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("EnhancedVolcano", update = FALSE)
-}
-if (!("apeglm" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("apeglm", update = FALSE)
-}
+

In this analysis, we will be using limma for differential expression (Ritchie et al. 2015). We will also use EnhancedVolcano for plotting (Blighe et al. 2020).

+
if (!("limma" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("limma", update = FALSE)
+}
+if (!("EnhancedVolcano" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("EnhancedVolcano", update = FALSE)
+}

Attach the packages we need for this analysis.

-
# Attach the library
-library(limma)
-
-# We will need this so we can use the pipe: %>%
-library(magrittr)
-
-# We'll use this for plotting
-library(ggplot2)
+
# Attach the library
+library(limma)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+# We'll use this for plotting
+library(ggplot2)

The jitter plot we make later on with geom_jitter() involves some randomness. As is good practice when our analysis involves randomness, we will set the seed.

-
set.seed(12345)
+
set.seed(12345)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

-
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-
## Parsed with column specification:
+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   .default = col_character(),
 ##   refinebio_age = col_logical(),
@@ -3121,13 +3949,14 @@ 

4.2 Import and set up data

## `contact_zip/postal_code` = col_double(), ## data_row_count = col_double(), ## taxid_ch1 = col_double() -## )
-
## See spec(...) for full column specifications.
-
# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
-  # Tuck away the Gene ID column as rownames
-  tibble::column_to_rownames("Gene")
-
## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+
# Read in data TSV file
+expression_df <- readr::read_tsv(data_file) %>%
+  # Tuck away the Gene ID column as row names
+  tibble::column_to_rownames("Gene")
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   Gene = col_character(),
 ##   GSM1831675 = col_double(),
@@ -3142,43 +3971,67 @@ 

4.2 Import and set up data

## GSM1831684 = col_double() ## )

Let’s ensure that the metadata and data are in the same sample order.

-
# Make the data in the order of the metadata
-df <- df %>%
-  dplyr::select(metadata$geo_accession)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$geo_accession)
-
## [1] TRUE
+
# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+  dplyr::select(metadata$geo_accession)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$geo_accession)
+
## [1] "target is NULL, current is character"

4.3 Set up design matrix

-

limma needs a numeric design matrix to signify which are CREB and control samples. Here we are using the treatments supplied in the metadata to create a design matrix where the “none” samples are assigned 0 and the “amputated” samples are assigned 1. Note that the metadata variables that signify the treatment groups might be different across datasets and might not always be underneath the category.

-

The genotype/variation column contains group information we will be using for differential expression. But the / it contains in its column name makes it more annoying to access.
-Accessing variable that have names with special characters like /, or spaces, require extra work-arounds to ignore R’s normal interpretations of these characters.

-
metadata <- metadata %>%
-  dplyr::rename("genotype" = `genotype/variation`) # This step will not be the same (or might not be needed at all) with a different dataset
+

limma needs a numeric design matrix to signify which are CREB and control samples. Here we are using the treatments described in the metadata table in the genotype/variation column to create a design matrix where the “control” samples are assigned 0 and the “overexpressing the human CREB” samples are assigned 1. Note that the metadata columns that signify the treatment groups might be different across datasets, and will almost certainly have different contents.

+

While the genotype/variation column contains the group information we will be using for differential expression, the / it contains in its column name makes it more annoying to access.
+Accessing variable that have names with special characters like /, or spaces, require extra work-arounds to ignore R’s normal interpretations of these characters. Here we will rename it as just genotype to make our lives later much easier.

+

We will also recode the contents of the column, as overexpressing the human CREB" is a bit of an unruly name. To do this, we will use the fct_recode() function from the forcats package, simplifying "overexpressing the human CREB" to just CREB. We will also use fct_relevel() to make sure our control samples appear first in the factor levels.

+
# These renaming steps will not be the same (or might not be needed at all)
+# with a different dataset
+metadata <- metadata %>%
+  # rename the column
+  dplyr::rename("genotype" = `genotype/variation`) %>%
+  # change the names and order of the genotypes (making the column a factor)
+  dplyr::mutate(
+    genotype = genotype %>%
+      # rename the "overexpressing..." genotype to "CREB"
+      forcats::fct_recode(CREB = "overexpressing the human CREB") %>%
+      # make "control" the first level of the factor
+      forcats::fct_relevel("control")
+  )

Now we will create a model matrix based on our newly renamed genotype variable.

-
# Create the design matrix from the genotype information
-des_mat <- model.matrix(~ metadata$genotype)
+
# Create the design matrix from the genotype information
+des_mat <- model.matrix(~genotype, data = metadata)
+
+# Look at the design matrix
+head(des_mat)
+
##   (Intercept) genotypeCREB
+## 1           1            1
+## 2           1            0
+## 3           1            0
+## 4           1            1
+## 5           1            0
+## 6           1            0
+

When we look at this design matrix, we see that there is now a genotypeCREB column that defines the group for each sample: 0 for control samples and 1 for the CREB samples. (The model will also fit an intercept for all samples, so we can see that here as well.)

4.4 Perform differential expression

-

After applying our data to linear model, in this example we apply empirical Bayes smoothing and Benjamini-Hochberg multiple testing correction. The topTable() function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method argument (see the ?topTable help page for more about the options).

-
# Apply linear model to data
-fit <- lmFit(df, design = des_mat)
-
-# Apply empirical Bayes to smooth standard errors
-fit <- eBayes(fit)
-
-# Apply multiple testing correction and obtain stats
-stats_df <- topTable(fit, number = nrow(df)) %>%
-  tibble::rownames_to_column("Gene")
+

We will use the lmFit() function from the limma package to test each gene for differential expression between the two groups using a linear model. After fitting our data to the linear model, in this example we apply empirical Bayes smoothing with the eBayes() function.

+

Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).

+
# Apply linear model to data
+fit <- lmFit(expression_df, design = des_mat)
+
+# Apply empirical Bayes to smooth standard errors
+fit <- eBayes(fit)
+

Because we are testing many different genes at once, we also want to perform some multiple test corrections, which we will do with the Benjamini-Hochberg method while making a table of results with topTable(). The topTable() function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method argument (see the ?topTable help page for more about the options).

+
# Apply multiple testing correction and obtain stats
+stats_df <- topTable(fit, number = nrow(expression_df)) %>%
+  tibble::rownames_to_column("Gene")
## Removing intercept from test coefficients
-

Let’s take a peek at what our results table looks like.

-
head(stats_df)
+

Let’s take a peek at our results table.

+
head(stats_df)

By default, results are ordered by largest B (the log odds value) to the smallest, which means your most differentially expressed genes should be toward the top.

@@ -3187,71 +4040,78 @@

4.4 Perform differential expressi

4.5 Check results by plotting one gene

To test if these results make sense, we can make a plot of one of top genes. Let’s try extracting the data for ENSDARG00000104315 and set up its own data frame for plotting purposes.

-
top_gene_df <- df %>%
-  # Extract this gene from `df`
-  dplyr::filter(rownames(.) == "ENSDARG00000104315") %>%
-  # Transpose so the gene is a column
-  t() %>%
-  # Transpose made this a matrix, let's make it back into a data.frame like before
-  data.frame() %>%
-  # Store the sample ids as their own column instead of being row names
-  tibble::rownames_to_column("refinebio_accession_code") %>%
-  # Join on the selected columns from metadata
-  dplyr::inner_join(dplyr::select(
-    metadata,
-    refinebio_accession_code,
-    genotype
-  ))
+
top_gene_df <- expression_df %>%
+  # Extract this gene from `expression_df`
+  dplyr::filter(rownames(.) == "ENSDARG00000104315") %>%
+  # Transpose so the gene is a column
+  t() %>%
+  # Transpose made this a matrix, let's make it back into a data frame
+  data.frame() %>%
+  # Store the sample ids as their own column instead of as row names
+  tibble::rownames_to_column("refinebio_accession_code") %>%
+  # Join on the selected columns from metadata
+  dplyr::inner_join(dplyr::select(
+    metadata,
+    refinebio_accession_code,
+    genotype
+  ))
## Joining, by = "refinebio_accession_code"

Let’s take a sneak peek at what our top_gene_df looks like.

-
top_gene_df
+
top_gene_df

Now let’s plot the data for ENSDARG00000104315 using our top_gene_df.

-
ggplot(top_gene_df, aes(x = genotype, y = ENSDARG00000104315, color = genotype)) +
-  geom_jitter(width = 0.2, height = 0) + # We'll make this a jitter plot
-  theme_classic() # This makes some aesthetic changes
-

+
ggplot(top_gene_df, aes(x = genotype, y = ENSDARG00000104315, color = genotype)) +
+  geom_jitter(width = 0.2, height = 0) + # We'll make this a jitter plot
+  theme_classic() # This makes some aesthetic changes
+

These results make sense. The overexpressing CREB group samples have much higher expression values for ENSDARG00000104315 than the control samples do.

4.6 Write results to file

The results in stats_df will be saved to our results/ directory.

-
readr::write_tsv(stats_df, file.path(
-  results_dir,
-  "GSE71270_limma_results.tsv" # Replace with a relevant output name
-))
+
readr::write_tsv(stats_df, file.path(
+  results_dir,
+  "GSE71270_limma_results.tsv" # Replace with a relevant output name
+))

4.7 Make a volcano plot

-

We’ll use the EnhancedVolcano package’s main function to plot our data (Zhu et al. 2018).

-
EnhancedVolcano::EnhancedVolcano(stats_df,
-  lab = stats_df$Gene, # This has to be a vector with our labels we want for our genes
-  x = "logFC", # This is the column name in `stats_df` that contains what we want on the x axis
-  y = "adj.P.Val" # This is the column name in `stats_df` that contains what we want on the y axis
-)
-

+

We’ll use the EnhancedVolcano package’s main function to plot our data (Zhu et al. 2018).

+
EnhancedVolcano::EnhancedVolcano(stats_df,
+  lab = stats_df$Gene, # This has to be a vector with our labels we want for our genes
+  x = "logFC", # This is the column name in `stats_df` that contains what we want on the x axis
+  y = "adj.P.Val" # This is the column name in `stats_df` that contains what we want on the y axis
+)
+
## Registered S3 methods overwritten by 'ggalt':
+##   method                  from   
+##   grid.draw.absoluteGrob  ggplot2
+##   grobHeight.absoluteGrob ggplot2
+##   grobWidth.absoluteGrob  ggplot2
+##   grobX.absoluteGrob      ggplot2
+##   grobY.absoluteGrob      ggplot2
+

In this plot, green points represent genes that meet the log2 fold change, by default the cutoff is absolute value of 1.
But there are no genes that meet the p value cutoff, which by default is 1e-05. We used the adjusted p values for our plot above, so you may want to adjust this with the pCutoff argument (Take a look at all the options for tailoring this plot using ?EnhancedVolcano).

Let’s make the same plot again, but adjust the pCutoff since we are using multiple-testing corrected p values, and this time we will assign the plot to our environment as volcano_plot.

-
volcano_plot <- EnhancedVolcano::EnhancedVolcano(stats_df,
-  lab = stats_df$Gene,
-  x = "logFC",
-  y = "adj.P.Val",
-  pCutoff = 0.01 # Because we are using adjusted p values, we can loosen this a bit
-)
-
-# Print out our plot
-volcano_plot
-

+
volcano_plot <- EnhancedVolcano::EnhancedVolcano(stats_df,
+  lab = stats_df$Gene,
+  x = "logFC",
+  y = "adj.P.Val",
+  pCutoff = 0.01 # Because we are using adjusted p values, we can loosen this a bit
+)
+
+# Print out our plot
+volcano_plot
+

Let’s save this plot to a PNG file.

-
ggsave(
-  plot = volcano_plot,
-  file.path(plots_dir, "GSE71270_volcano_plot.png")
-) # Replace with a plot name relevant to your data
+
ggsave(
+  plot = volcano_plot,
+  file.path(plots_dir, "GSE71270_volcano_plot.png")
+) # Replace with a plot name relevant to your data
## Saving 7 x 5 in image

@@ -3259,18 +4119,18 @@

4.7 Make a volcano plot

5 Resources for further learning

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

-
# Print session info
-sessioninfo::session_info()
-
## ─ Session info ───────────────────────────────────────────────────────────────
+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ─────────────────────────────────────────────────────
 ##  setting  value                       
 ##  version  R version 4.0.2 (2020-06-22)
 ##  os       Ubuntu 20.04 LTS            
@@ -3280,63 +4140,79 @@ 

6 Session info

## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2020-10-16 +## date 2020-12-17 ## -## ─ Packages ─────────────────────────────────────────────────────────────────── -## package * version date lib source -## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0) -## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2) -## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) -## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0) -## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) -## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) -## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2) -## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0) -## EnhancedVolcano 1.6.0 2020-04-27 [1] Bioconductor -## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0) -## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0) -## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0) -## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0) -## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0) -## ggplot2 * 3.3.2 2020-06-19 [1] RSPM (R 4.0.1) -## ggrepel 0.8.2 2020-03-08 [1] RSPM (R 4.0.2) -## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2) -## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0) -## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0) -## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1) -## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2) -## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2) -## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0) -## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0) -## limma * 3.44.3 2020-06-12 [1] Bioconductor -## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0) -## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0) -## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0) -## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) -## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0) -## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0) -## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0) -## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2) -## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2) -## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2) -## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) -## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2) -## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2) -## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0) -## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2) -## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2) -## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0) -## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0) -## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0) -## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2) -## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0) -## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0) -## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2) -## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0) -## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2) -## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) -## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2) -## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0) +## ─ Packages ───────────────────────────────────────────────────────── +## package * version date lib source +## ash 1.0-15 2015-09-01 [1] RSPM (R 4.0.0) +## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0) +## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2) +## beeswarm 0.2.3 2016-04-25 [1] RSPM (R 4.0.0) +## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2) +## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0) +## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) +## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) +## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2) +## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0) +## EnhancedVolcano 1.8.0 2020-10-27 [1] Bioconductor +## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0) +## extrafont 0.17 2014-12-08 [1] RSPM (R 4.0.0) +## extrafontdb 1.0 2012-06-11 [1] RSPM (R 4.0.0) +## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0) +## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0) +## forcats 0.5.0 2020-03-01 [1] RSPM (R 4.0.0) +## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0) +## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0) +## ggalt 0.4.0 2017-02-15 [1] RSPM (R 4.0.0) +## ggbeeswarm 0.6.0 2017-08-07 [1] RSPM (R 4.0.0) +## ggplot2 * 3.3.2 2020-06-19 [1] RSPM (R 4.0.1) +## ggrastr 0.2.1 2020-09-14 [1] RSPM (R 4.0.2) +## ggrepel 0.8.2 2020-03-08 [1] RSPM (R 4.0.2) +## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2) +## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0) +## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0) +## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1) +## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2) +## KernSmooth 2.23-17 2020-04-26 [2] CRAN (R 4.0.2) +## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2) +## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0) +## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0) +## limma * 3.46.0 2020-10-27 [1] Bioconductor +## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0) +## maps 3.3.0 2018-04-03 [1] RSPM (R 4.0.0) +## MASS 7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2) +## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0) +## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0) +## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) +## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0) +## proj4 1.0-10 2020-03-02 [1] RSPM (R 4.0.0) +## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2) +## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0) +## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0) +## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2) +## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2) +## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2) +## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) +## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0) +## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2) +## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) +## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0) +## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2) +## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2) +## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0) +## Rttf2pt1 1.3.8 2020-01-10 [1] RSPM (R 4.0.0) +## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0) +## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0) +## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2) +## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0) +## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0) +## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2) +## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0) +## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2) +## vipor 0.4.5 2017-03-22 [1] RSPM (R 4.0.0) +## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) +## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2) +## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library
@@ -3345,19 +4221,22 @@

6 Session info

References

-

Blighe K., S. Rana, and M. Lewis, 2020 EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling.

+

Blighe K., S. Rana, and M. Lewis, 2020 EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling. https://github.com/kevinblighe/EnhancedVolcano

-

Gonzalez I., 2014 Statistical analysis of rna-seq data.

+

Gonzalez I., 2014 Statistical analysis of RNA-Seq data. http://www.nathalievialaneix.eu/doc/pdf/tutorial-rnaseq.pdf

-

Klaus B., and S. Reisenauer, 2018 An end to end workflow for differential gene expression using affymetrix microarrays.

+

Klaus B., and S. Reisenauer, 2018 An end to end workflow for differential gene expression using Affymetrix microarrays. https://www.bioconductor.org/packages/devel/workflows/vignettes/maEndToEnd/inst/doc/MA-Workflow.html

Ritchie M. E., B. Phipson, D. Wu, Y. Hu, and C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007

+
+

Robinson D., Understanding empirical Bayes estimation (using baseball statistics). http://varianceexplained.org/r/empirical_bayes_baseball/

+
-

Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896.

+

Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98

Zhu A., J. G. Ibrahim, and M. I. Love, 2018 Heavy-tailed prior distributions for sequence count data: Removing the noise and preserving large differences. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895

@@ -3365,6 +4244,11 @@

References

+
diff --git a/02-microarray/differential-expression_microarray_02_several-groups.Rmd b/02-microarray/differential-expression_microarray_02_several-groups.Rmd index 2b1fef32..cdefb521 100644 --- a/02-microarray/differential-expression_microarray_02_several-groups.Rmd +++ b/02-microarray/differential-expression_microarray_02_several-groups.Rmd @@ -44,7 +44,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -52,7 +52,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -124,19 +124,24 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "GSE37418") # Replace with accession number which will be the name of the folder the files will be in +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "GSE37418") -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "GSE37418.tsv") # Replace with file path to your dataset +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "GSE37418.tsv") -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_GSE37418.tsv") # Replace with file path to your metadata +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_GSE37418.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` @@ -174,7 +179,7 @@ if (!("limma" %in% installed.packages())) { Attach the packages we need for this analysis. -```{r} +```{r message=FALSE} # Attach the library library(limma) @@ -205,8 +210,8 @@ We stored our file paths as objects named `metadata_file` and `data_file` in [th metadata <- readr::read_tsv(metadata_file) # Read in data TSV file -df <- readr::read_tsv(data_file) %>% - # Tuck away the gene ID column as rownames +expression_df <- readr::read_tsv(data_file) %>% + # Tuck away the gene ID column as row names tibble::column_to_rownames("Gene") ``` @@ -237,17 +242,17 @@ Let's take a look at the subgroup summary again. metadata %>% dplyr::count(subgroup) ``` -Note that the `U` and the `SHH OUTLIER` samples are gone and only the four groups we are interested in are left. +Note that the `U` and the `SHH OUTLIER` subgroups are gone and only the four groups we are interested in are left. -But, we still need to filter these samples out from the expression data that's stored in `df`. +But we still need to filter these samples out from the expression data that's stored in `expression_df`. ```{r} # Make the data in the order of the metadata -df <- df %>% +expression_df <- expression_df %>% dplyr::select(filtered_metadata$geo_accession) # Check if this is in the same order -all.equal(colnames(df), filtered_metadata$geo_accession) +all.equal(colnames(expression_df), filtered_metadata$geo_accession) ``` ## Create the design matrix @@ -255,46 +260,46 @@ all.equal(colnames(df), filtered_metadata$geo_accession) `limma` needs a numeric design matrix to signify which samples are of which subtype of medulloblastoma. Now we will create a model matrix based on our `subgroup` variable. We are using a `+ 0` in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. -If you have a control group, you might want that to be the intercept. +If you have a control group, you might want to leave off the `+ 0` so the model includes an intercept representing the control group expression level, with the remaining coefficients the changes relative to that expression level. ```{r} # Create the design matrix -des_mat <- model.matrix(~ filtered_metadata$subgroup + 0) +des_mat <- model.matrix(~ subgroup + 0, data = filtered_metadata) ``` Let's take a look at the design matrix we created. ```{r} -# Print out the design matrix +# Print out part of the design matrix head(des_mat) ``` -The design matrix column names are a bit messy, so we will neaten them up by dropping the `filtered_metadata$subgroup` designation they all have. +The design matrix column names are a bit messy, so we will neaten them up by dropping the `subgroup` designation they all have. ```{r} # Make the column names less messy -colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "filtered_metadata\\$subgroup") +colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "subgroup") ``` -Side note: If you are wondering why there are two `\` above in `"filtered_metadata\\$subgroup"`, that's called an [escape character](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html#escaping). -There's a whole universe of things called [regular expressions (regex)](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html) that can be super handy for string manipulations. - ## Perform differential expression Now we are ready to actually start fitting our differential expression model to the data. To accommodate our design that has more than 2 groups this time, we will need to do this in a couple steps. -First we need to fit our basic linear model to the data, then apply empirical Bayes smoothing. +We will use the `lmFit()` function from the `limma` package to test each gene for differential expression between the two groups using a linear model. +After fitting our data to the linear model, in this example we apply empirical Bayes smoothing using the `eBayes()` function. + +Here's a [nifty article and example](http://varianceexplained.org/r/empirical_bayes_baseball/) about what the empirical Bayes smoothing is for [@bayes-estimates]. ```{r} # Apply linear model to data -fit <- lmFit(df, design = des_mat) +fit <- lmFit(expression_df, design = des_mat) # Apply empirical Bayes to smooth standard errors fit <- eBayes(fit) ``` -Now that we have our basic model fitting, we will want to make the contrasts among all our groups. +Now that we have our basic model fitting, we will want to investigate the contrasts among all our groups. Depending on your scientific questions, you will need to customize the next steps. Consulting the [limma users guide](https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf) for how to set up your model based on your hypothesis is a good idea. @@ -311,10 +316,10 @@ contrast_matrix <- makeContrasts( ) ``` -Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulate would look like `G3 = G3 - Control` for each one. +Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulae would look like `G3 = G3 - Control` for each one. We highly recommend consulting the [limma users guide](https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf) for figuring out what your `makeContrasts()` and `model.matrix()` setups should look like [@Ritchie2015]. -Now that we have the contrasts matrix set up, we can use it to re-fit the model and re-smooth it with `eBayes()`. +Now that we have the contrasts matrix set up, we can use it to re-fit the model with `contrasts.fit()` and re-smooth it with `eBayes()`. ```{r} # Fit the model according to the contrasts matrix @@ -324,22 +329,21 @@ contrasts_fit <- contrasts.fit(fit, contrast_matrix) contrasts_fit <- eBayes(contrasts_fit) ``` -Here's a [nifty article and example](http://varianceexplained.org/r/empirical_bayes_baseball/) about what the empirical Bayes smoothing is for [@bayes-estimates]. Now let's create the results table based on the contrasts fitted model. -This step will provide the Benjamini-Hochberg multiple testing correction. +This step will also apply the Benjamini-Hochberg multiple testing correction. The `topTable()` function default is to use Benjamini-Hochberg but this can be changed to a different method using the `adjust.method` argument (see the `?topTable` help page for more about the options). ```{r} # Apply multiple testing correction and obtain stats -stats_df <- topTable(contrasts_fit, number = nrow(df)) %>% +stats_df <- topTable(contrasts_fit, number = nrow(expression_df)) %>% tibble::rownames_to_column("Gene") ``` Let's take a peek at our results table. -```{r} +```{r rownames.print = FALSE} head(stats_df) ``` @@ -358,8 +362,8 @@ Based on the results in `stats_df`, we should expect this gene to be much higher First we will need to set up the data for this gene and the subgroup labels into a data frame for plotting. ```{r} -top_gene_df <- df %>% - # Extract this gene from `df` +top_gene_df <- expression_df %>% + # Extract this gene from `expression_df` dplyr::filter(rownames(.) == "ENSG00000128683") %>% # Transpose so the gene is a column t() %>% @@ -377,7 +381,7 @@ top_gene_df <- df %>% Let's take a sneak peek at our `top_gene_df`. -```{r} +```{r rownames.print = FALSE} head(top_gene_df) ``` @@ -406,14 +410,14 @@ readr::write_tsv(stats_df, file.path( ## Make volcano plots -We'll use the `ggplot2` to make a set of volcano plots. +We'll use `ggplot2` to make a set of volcano plots. But first, we need to set up our data for plotting. We will need the p values from the individual contrasts as well as the log fold changes. We can obtain the contrast p values from the `contrasts_fit` object and make it a longer format that the `ggplot()` function will want for plotting. ```{r} -# Let's extract the contrast p values for each and convert them to -log10() +# Let's extract the contrast p values for each and transform them with -log10() contrast_p_vals_df <- -log10(contrasts_fit$p.value) %>% # Make this into a data frame as.data.frame() %>% @@ -446,8 +450,9 @@ We can perform an `inner_join()` of both these datasets using both their `Gene` plot_df <- log_fc_df %>% dplyr::inner_join(contrast_p_vals_df, by = c("Gene", "contrast"), - # This argument will automatically tack this on the end of the column names - # from the respective data frames - this way we can keep track of which columns are from which + # This argument will add the given suffixes to the column names + # from the respective data frames, helping us keep track of which columns + # hold which types of values suffix = c("_log_fc", "_p_val") ) ``` @@ -463,8 +468,8 @@ Let's declare what we consider to be significant levels for fold change and for By saving this as its own variable, we only need to change these cutoffs in one place if we want to adjust later. ```{r} -# This is equivalent to p value < 0.05 -p_val_cutoff <- 1.301 +# Convert p value cutoff to negative log 10 scale +p_val_cutoff <- -log10(0.05) # Absolute value cutoff for fold changes abs_fc_cutoff <- 5 @@ -477,7 +482,8 @@ We will use some logic with `dplyr::case_when()` to do this. plot_df <- plot_df %>% dplyr::mutate( signif_label = dplyr::case_when( - abs(logFoldChange) > abs_fc_cutoff & neg_log10_p_val > p_val_cutoff ~ "p-val and FC", + abs(logFoldChange) > abs_fc_cutoff & neg_log10_p_val > p_val_cutoff + ~ "p-val and FC", abs(logFoldChange) > abs_fc_cutoff ~ "FC", neg_log10_p_val > p_val_cutoff ~ "p-val", TRUE ~ "NS" @@ -493,12 +499,12 @@ volcanoes_plot <- ggplot( aes( x = logFoldChange, # Fold change as x value y = neg_log10_p_val, # -log10(p value) for the contrasts - color = signif_label - ) # Color code by significance cutoffs variable we made + color = signif_label # Color code by significance cutoffs variable we made + ) ) + - # Making this a scatter plot with dots that are 30% opaque using the `alpha` argument + # Make a scatter plot with points that are 30% opaque using `alpha` geom_point(alpha = 0.3) + - # Using our `p_val_cutoff` for our line here + # Draw our `p_val_cutoff` for line here geom_hline(yintercept = p_val_cutoff, linetype = "dashed") + # Using our `abs_fc_cutoff` for our lines here geom_vline(xintercept = c(-abs_fc_cutoff, abs_fc_cutoff), linetype = "dashed") + diff --git a/02-microarray/differential-expression_microarray_02_several-groups.html b/02-microarray/differential-expression_microarray_02_several-groups.html index 20aa9471..f337b48c 100644 --- a/02-microarray/differential-expression_microarray_02_several-groups.html +++ b/02-microarray/differential-expression_microarray_02_several-groups.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2971,26 +3797,26 @@

2.1 Obtain the .Rmd

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

-
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
-  dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
-  dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir)
-}
+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

@@ -3005,7 +3831,7 @@

2.3 Obtain the dataset from refin

2.4 About the dataset we are using for this example

-

For this example analysis, we will use this medulloblastoma samples. Robinson et al. (2012) measured microarray gene expression of 71 medulloblastoma tumor samples. In this analysis, we will test differential expression across the medulloblastoma subtypes.

+

For this example analysis, we will use this medulloblastoma samples. Robinson et al. (2012) measured microarray gene expression of 71 medulloblastoma tumor samples. In this analysis, we will test differential expression across the medulloblastoma subtypes.

2.5 Place the dataset in your new data/ folder

@@ -3039,20 +3865,25 @@

2.6 Check out our file structure!

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

-
# Define the file path to the data directory
-data_dir <- file.path("data", "GSE37418") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE37418.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE37418.tsv") # Replace with file path to your metadata
+
# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE37418")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE37418.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE37418.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

-
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+
# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-
# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

@@ -3070,30 +3901,31 @@

4 Differential Expression - Micro

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

-

In this analysis, we will be using limma for differential expression (Ritchie et al. 2015).

-
if (!("limma" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("limma", update = FALSE)
-}
+

In this analysis, we will be using limma for differential expression (Ritchie et al. 2015).

+
if (!("limma" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("limma", update = FALSE)
+}

Attach the packages we need for this analysis.

-
# Attach the library
-library(limma)
-
-# We will need this so we can use the pipe: %>%
-library(magrittr)
-
-# We'll use this for plotting
-library(ggplot2)
+
# Attach the library
+library(limma)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+# We'll use this for plotting
+library(ggplot2)

The jitter plot we make later on with geom_jitter() involves some randomness. As is good practice when our analysis involves randomness, we will set the seed.

-
set.seed(12345)
+
set.seed(12345)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

-
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-
## Parsed with column specification:
+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
 ## cols(
 ##   .default = col_character(),
 ##   refinebio_age = col_logical(),
@@ -3114,23 +3946,24 @@ 

4.2 Import and set up data

## `contact_zip/postal_code` = col_double(), ## data_row_count = col_double(), ## taxid_ch1 = col_double() -## )
-
## See spec(...) for full column specifications.
-
# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
-  # Tuck away the gene ID column as rownames
-  tibble::column_to_rownames("Gene")
-
## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+
# Read in data TSV file
+expression_df <- readr::read_tsv(data_file) %>%
+  # Tuck away the gene ID column as row names
+  tibble::column_to_rownames("Gene")
+
## 
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
 ## cols(
 ##   .default = col_double(),
 ##   Gene = col_character()
 ## )
-## See spec(...) for full column specifications.
+## ℹ Use `spec()` for the full column specifications.

4.3 Removing groups that are too small

We will be using the subgroup variable labels in our metadata to test differentially expression across. Let’s take a look at how many samples of each subgroup we have.

-
metadata %>% dplyr::count(subgroup)
+
metadata %>% dplyr::count(subgroup)
-

Note that the U and the SHH OUTLIER samples are gone and only the four groups we are interested in are left.

-

But, we still need to filter these samples out from the expression data that’s stored in df.

-
# Make the data in the order of the metadata
-df <- df %>%
-  dplyr::select(filtered_metadata$geo_accession)
-
-# Check if this is in the same order
-all.equal(colnames(df), filtered_metadata$geo_accession)
+

Note that the U and the SHH OUTLIER subgroups are gone and only the four groups we are interested in are left.

+

But we still need to filter these samples out from the expression data that’s stored in expression_df.

+
# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+  dplyr::select(filtered_metadata$geo_accession)
+
+# Check if this is in the same order
+all.equal(colnames(expression_df), filtered_metadata$geo_accession)
## [1] TRUE

4.4 Create the design matrix

-

limma needs a numeric design matrix to signify which samples are of which subtype of medulloblastoma. Now we will create a model matrix based on our subgroup variable. We are using a + 0 in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. If you have a control group, you might want that to be the intercept.

-
# Create the design matrix
-des_mat <- model.matrix(~ filtered_metadata$subgroup + 0)
+

limma needs a numeric design matrix to signify which samples are of which subtype of medulloblastoma. Now we will create a model matrix based on our subgroup variable. We are using a + 0 in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. If you have a control group, you might want to leave off the + 0 so the model includes an intercept representing the control group expression level, with the remaining coefficients the changes relative to that expression level.

+
# Create the design matrix
+des_mat <- model.matrix(~ subgroup + 0, data = filtered_metadata)

Let’s take a look at the design matrix we created.

-
# Print out the design matrix
-head(des_mat)
-
##   filtered_metadata$subgroupG3 filtered_metadata$subgroupG4
-## 1                            0                            1
-## 2                            0                            1
-## 3                            0                            0
-## 4                            1                            0
-## 5                            0                            1
-## 6                            0                            0
-##   filtered_metadata$subgroupSHH filtered_metadata$subgroupWNT
-## 1                             0                             0
-## 2                             0                             0
-## 3                             1                             0
-## 4                             0                             0
-## 5                             0                             0
-## 6                             1                             0
-

The design matrix column names are a bit messy, so we will neaten them up by dropping the filtered_metadata$subgroup designation they all have.

-
# Make the column names less messy
-colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "filtered_metadata\\$subgroup")
-

Side note: If you are wondering why there are two \ above in "filtered_metadata\\$subgroup", that’s called an escape character. There’s a whole universe of things called regular expressions (regex) that can be super handy for string manipulations.

+
# Print out part of the design matrix
+head(des_mat)
+
##   subgroupG3 subgroupG4 subgroupSHH subgroupWNT
+## 1          0          1           0           0
+## 2          0          1           0           0
+## 3          0          0           1           0
+## 4          1          0           0           0
+## 5          0          1           0           0
+## 6          0          0           1           0
+

The design matrix column names are a bit messy, so we will neaten them up by dropping the subgroup designation they all have.

+
# Make the column names less messy
+colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "subgroup")

4.5 Perform differential expression

Now we are ready to actually start fitting our differential expression model to the data. To accommodate our design that has more than 2 groups this time, we will need to do this in a couple steps.

-

First we need to fit our basic linear model to the data, then apply empirical Bayes smoothing.

-
# Apply linear model to data
-fit <- lmFit(df, design = des_mat)
-
-# Apply empirical Bayes to smooth standard errors
-fit <- eBayes(fit)
-

Now that we have our basic model fitting, we will want to make the contrasts among all our groups. Depending on your scientific questions, you will need to customize the next steps. Consulting the limma users guide for how to set up your model based on your hypothesis is a good idea.

+

We will use the lmFit() function from the limma package to test each gene for differential expression between the two groups using a linear model. After fitting our data to the linear model, in this example we apply empirical Bayes smoothing using the eBayes() function.

+

Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).

+
# Apply linear model to data
+fit <- lmFit(expression_df, design = des_mat)
+
+# Apply empirical Bayes to smooth standard errors
+fit <- eBayes(fit)
+

Now that we have our basic model fitting, we will want to investigate the contrasts among all our groups. Depending on your scientific questions, you will need to customize the next steps. Consulting the limma users guide for how to set up your model based on your hypothesis is a good idea.

In this contrasts matrix, we are comparing each subtype to all the other subtypes.
We’re dividing by three in this expression so that each group is compared to the average of the other three groups (makeContrasts() doesn’t allow you to use functions like mean(); it wants a formula).

-
contrast_matrix <- makeContrasts(
-  "G3vsOther" = G3 - (G4 + SHH + WNT) / 3,
-  "G4vsOther" = G4 - (G3 + SHH + WNT) / 3,
-  "SHHvsOther" = SHH - (G3 + G4 + WNT) / 3,
-  "WNTvsOther" = WNT - (G3 + G4 + SHH) / 3,
-  levels = des_mat
-)
-

Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulate would look like G3 = G3 - Control for each one. We highly recommend consulting the limma users guide for figuring out what your makeContrasts() and model.matrix() setups should look like (Ritchie et al. 2015).

-

Now that we have the contrasts matrix set up, we can use it to re-fit the model and re-smooth it with eBayes().

-
# Fit the model according to the contrasts matrix
-contrasts_fit <- contrasts.fit(fit, contrast_matrix)
-
-# Re-smooth the Bayes
-contrasts_fit <- eBayes(contrasts_fit)
-

Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).

+
contrast_matrix <- makeContrasts(
+  "G3vsOther" = G3 - (G4 + SHH + WNT) / 3,
+  "G4vsOther" = G4 - (G3 + SHH + WNT) / 3,
+  "SHHvsOther" = SHH - (G3 + G4 + WNT) / 3,
+  "WNTvsOther" = WNT - (G3 + G4 + SHH) / 3,
+  levels = des_mat
+)
+

Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulae would look like G3 = G3 - Control for each one. We highly recommend consulting the limma users guide for figuring out what your makeContrasts() and model.matrix() setups should look like (Ritchie et al. 2015).

+

Now that we have the contrasts matrix set up, we can use it to re-fit the model with contrasts.fit() and re-smooth it with eBayes().

+
# Fit the model according to the contrasts matrix
+contrasts_fit <- contrasts.fit(fit, contrast_matrix)
+
+# Re-smooth the Bayes
+contrasts_fit <- eBayes(contrasts_fit)

Now let’s create the results table based on the contrasts fitted model.

-

This step will provide the Benjamini-Hochberg multiple testing correction. The topTable() function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method argument (see the ?topTable help page for more about the options).

-
# Apply multiple testing correction and obtain stats
-stats_df <- topTable(contrasts_fit, number = nrow(df)) %>%
-  tibble::rownames_to_column("Gene")
+

This step will also apply the Benjamini-Hochberg multiple testing correction. The topTable() function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method argument (see the ?topTable help page for more about the options).

+
# Apply multiple testing correction and obtain stats
+stats_df <- topTable(contrasts_fit, number = nrow(expression_df)) %>%
+  tibble::rownames_to_column("Gene")

Let’s take a peek at our results table.

-
head(stats_df)
+
head(stats_df)

For each gene, each group’s fold change in expression, compared to the average of the other groups is reported.

@@ -3233,134 +4058,136 @@

4.5 Perform differential expressi

4.6 Check results by plotting one gene

To test if these results make sense, we can make a plot of one of top genes. Let’s try extracting the data for ENSG00000128683 and set up its own data frame for plotting purposes. Based on the results in stats_df, we should expect this gene to be much higher in the WNT samples.

First we will need to set up the data for this gene and the subgroup labels into a data frame for plotting.

-
top_gene_df <- df %>%
-  # Extract this gene from `df`
-  dplyr::filter(rownames(.) == "ENSG00000128683") %>%
-  # Transpose so the gene is a column
-  t() %>%
-  # Transpose made this a matrix, let's make it back into a data.frame like before
-  data.frame() %>%
-  # Store the sample ids as their own column instead of being row names
-  tibble::rownames_to_column("refinebio_accession_code") %>%
-  # Join on the selected columns from metadata
-  dplyr::inner_join(dplyr::select(
-    metadata,
-    refinebio_accession_code,
-    subgroup
-  ))
+
top_gene_df <- expression_df %>%
+  # Extract this gene from `expression_df`
+  dplyr::filter(rownames(.) == "ENSG00000128683") %>%
+  # Transpose so the gene is a column
+  t() %>%
+  # Transpose made this a matrix, let's make it back into a data.frame like before
+  data.frame() %>%
+  # Store the sample ids as their own column instead of being row names
+  tibble::rownames_to_column("refinebio_accession_code") %>%
+  # Join on the selected columns from metadata
+  dplyr::inner_join(dplyr::select(
+    metadata,
+    refinebio_accession_code,
+    subgroup
+  ))
## Joining, by = "refinebio_accession_code"

Let’s take a sneak peek at our top_gene_df.

-
head(top_gene_df)
+
head(top_gene_df)

Now let’s plot the data for ENSG00000128683 using our top_gene_df. We should expect this gene to be expressed at much higher levels in the WNT group samples.

-
ggplot(top_gene_df, aes(x = subgroup, y = ENSG00000128683, color = subgroup)) +
-  geom_jitter(width = 0.2, height = 0) + # We'll make this a jitter plot
-  theme_classic() # This makes some aesthetic changes
-

+
ggplot(top_gene_df, aes(x = subgroup, y = ENSG00000128683, color = subgroup)) +
+  geom_jitter(width = 0.2, height = 0) + # We'll make this a jitter plot
+  theme_classic() # This makes some aesthetic changes
+

Yes! These results make sense. The WNT samples have much higher expression of ENSG00000128683 than the other samples.

4.7 Write results to file

The results in stats_df will be saved to our results/ directory.

-
readr::write_tsv(stats_df, file.path(
-  results_dir,
-  "GSE37418_limma_results.tsv" # Replace with a relevant output name
-))
+
readr::write_tsv(stats_df, file.path(
+  results_dir,
+  "GSE37418_limma_results.tsv" # Replace with a relevant output name
+))

4.8 Make volcano plots

-

We’ll use the ggplot2 to make a set of volcano plots. But first, we need to set up our data for plotting. We will need the p values from the individual contrasts as well as the log fold changes.

+

We’ll use ggplot2 to make a set of volcano plots. But first, we need to set up our data for plotting. We will need the p values from the individual contrasts as well as the log fold changes.

We can obtain the contrast p values from the contrasts_fit object and make it a longer format that the ggplot() function will want for plotting.

-
# Let's extract the contrast p values for each and convert them to -log10()
-contrast_p_vals_df <- -log10(contrasts_fit$p.value) %>%
-  # Make this into a data frame
-  as.data.frame() %>%
-  # Store genes as their own column
-  tibble::rownames_to_column("Gene") %>%
-  # Make this into long format
-  tidyr::pivot_longer(dplyr::contains("vsOther"),
-    names_to = "contrast",
-    values_to = "neg_log10_p_val"
-  )
+
# Let's extract the contrast p values for each and transform them with -log10()
+contrast_p_vals_df <- -log10(contrasts_fit$p.value) %>%
+  # Make this into a data frame
+  as.data.frame() %>%
+  # Store genes as their own column
+  tibble::rownames_to_column("Gene") %>%
+  # Make this into long format
+  tidyr::pivot_longer(dplyr::contains("vsOther"),
+    names_to = "contrast",
+    values_to = "neg_log10_p_val"
+  )

Now let’s extract the log fold changes from stats_df.

-
# Let's extract the fold changes from `stats_df`
-log_fc_df <- stats_df %>%
-  # We only want to keep the `Gene` column as well
-  dplyr::select("Gene", dplyr::contains("vsOther")) %>%
-  # Make this a longer format
-  tidyr::pivot_longer(dplyr::contains("vsOther"),
-    names_to = "contrast",
-    values_to = "logFoldChange"
-  )
+
# Let's extract the fold changes from `stats_df`
+log_fc_df <- stats_df %>%
+  # We only want to keep the `Gene` column as well
+  dplyr::select("Gene", dplyr::contains("vsOther")) %>%
+  # Make this a longer format
+  tidyr::pivot_longer(dplyr::contains("vsOther"),
+    names_to = "contrast",
+    values_to = "logFoldChange"
+  )

We can perform an inner_join() of both these datasets using both their Gene and contrast columns.

-
plot_df <- log_fc_df %>%
-  dplyr::inner_join(contrast_p_vals_df,
-    by = c("Gene", "contrast"),
-    # This argument will automatically tack this on the end of the column names
-    # from the respective data frames - this way we can keep track of which columns are from which
-    suffix = c("_log_fc", "_p_val")
-  )
+
plot_df <- log_fc_df %>%
+  dplyr::inner_join(contrast_p_vals_df,
+    by = c("Gene", "contrast"),
+    # This argument will add the given suffixes to the column names
+    # from the respective data frames, helping us keep track of which columns
+    # hold which types of values
+    suffix = c("_log_fc", "_p_val")
+  )

Let’s print out a preview of plot_df.

-
# Print out what this looks like
-head(plot_df)
+
# Print out what this looks like
+head(plot_df)

Let’s declare what we consider to be significant levels for fold change and for -log10 p-values. By saving this as its own variable, we only need to change these cutoffs in one place if we want to adjust later.

-
# This is equivalent to p value < 0.05
-p_val_cutoff <- 1.301
-
-# Absolute value cutoff for fold changes
-abs_fc_cutoff <- 5
+
# Convert p value cutoff to negative log 10 scale
+p_val_cutoff <- -log10(0.05)
+
+# Absolute value cutoff for fold changes
+abs_fc_cutoff <- 5

Now we can use these cutoffs to make a new variable that declares which genes we consider significant. We will use some logic with dplyr::case_when() to do this.

-
plot_df <- plot_df %>%
-  dplyr::mutate(
-    signif_label = dplyr::case_when(
-      abs(logFoldChange) > abs_fc_cutoff & neg_log10_p_val > p_val_cutoff ~ "p-val and FC",
-      abs(logFoldChange) > abs_fc_cutoff ~ "FC",
-      neg_log10_p_val > p_val_cutoff ~ "p-val",
-      TRUE ~ "NS"
-    )
-  )
+
plot_df <- plot_df %>%
+  dplyr::mutate(
+    signif_label = dplyr::case_when(
+      abs(logFoldChange) > abs_fc_cutoff & neg_log10_p_val > p_val_cutoff
+      ~ "p-val and FC",
+      abs(logFoldChange) > abs_fc_cutoff ~ "FC",
+      neg_log10_p_val > p_val_cutoff ~ "p-val",
+      TRUE ~ "NS"
+    )
+  )

Now we’re ready to plot the volcanoes!

-
volcanoes_plot <- ggplot(
-  plot_df,
-  aes(
-    x = logFoldChange, # Fold change as x value
-    y = neg_log10_p_val, # -log10(p value) for the contrasts
-    color = signif_label
-  ) # Color code by significance cutoffs variable we made
-) +
-  # Making this a scatter plot with dots that are 30% opaque using the `alpha` argument
-  geom_point(alpha = 0.3) +
-  # Using our `p_val_cutoff` for our line here
-  geom_hline(yintercept = p_val_cutoff, linetype = "dashed") +
-  # Using our `abs_fc_cutoff` for our lines here
-  geom_vline(xintercept = c(-abs_fc_cutoff, abs_fc_cutoff), linetype = "dashed") +
-  # The default colors aren't great, we'll specify our own here
-  scale_colour_manual(values = c("#67a9cf", "darkgray", "gray", "#a1d76a")) +
-  # Let's be more specific about what this p value is in our y axis label
-  ylab("Contrast -log10(p value)") +
-  # This makes separate plots for each contrast!
-  facet_wrap(~contrast) +
-  # Just for making it prettier!
-  theme_classic()
-
-# Print out the plot!
-volcanoes_plot
+
volcanoes_plot <- ggplot(
+  plot_df,
+  aes(
+    x = logFoldChange, # Fold change as x value
+    y = neg_log10_p_val, # -log10(p value) for the contrasts
+    color = signif_label # Color code by significance cutoffs variable we made
+  )
+) +
+  # Make a scatter plot with points that are 30% opaque using `alpha`
+  geom_point(alpha = 0.3) +
+  # Draw our `p_val_cutoff` for line here
+  geom_hline(yintercept = p_val_cutoff, linetype = "dashed") +
+  # Using our `abs_fc_cutoff` for our lines here
+  geom_vline(xintercept = c(-abs_fc_cutoff, abs_fc_cutoff), linetype = "dashed") +
+  # The default colors aren't great, we'll specify our own here
+  scale_colour_manual(values = c("#67a9cf", "darkgray", "gray", "#a1d76a")) +
+  # Let's be more specific about what this p value is in our y axis label
+  ylab("Contrast -log10(p value)") +
+  # This makes separate plots for each contrast!
+  facet_wrap(~contrast) +
+  # Just for making it prettier!
+  theme_classic()
+
+# Print out the plot!
+volcanoes_plot

Here the green points might be of interest. We recommend ColorBrewer for finding different color sets if you don’t like the ones we used.

Let’s save these volcanoes to a PNG file.

-
ggsave(
-  plot = volcanoes_plot,
-  file.path(plots_dir, "GSE37418_results_volcano_plots.png")
-)
+
ggsave(
+  plot = volcanoes_plot,
+  file.path(plots_dir, "GSE37418_results_volcano_plots.png")
+)
## Saving 7 x 5 in image

@@ -3368,9 +4195,9 @@

4.8 Make volcano plots

5 Resources for further learning

@@ -3378,9 +4205,9 @@

5 Resources for further learning<

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

-
# Print session info
-sessioninfo::session_info()
-
## ─ Session info ───────────────────────────────────────────────────────────────
+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ─────────────────────────────────────────────────────
 ##  setting  value                       
 ##  version  R version 4.0.2 (2020-06-22)
 ##  os       Ubuntu 20.04 LTS            
@@ -3390,13 +4217,13 @@ 

6 Session info

## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2020-10-16 +## date 2020-12-16 ## -## ─ Packages ─────────────────────────────────────────────────────────────────── +## ─ Packages ───────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0) ## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2) -## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) +## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2) ## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) @@ -3416,22 +4243,22 @@

6 Session info

## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2) ## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0) ## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0) -## limma * 3.44.3 2020-06-12 [1] Bioconductor +## limma * 3.46.0 2020-10-27 [1] Bioconductor ## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0) ## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0) ## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0) ## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0) +## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2) ## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0) ## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0) ## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2) ## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2) ## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2) ## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) -## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2) -## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2) +## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) ## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0) -## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2) +## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2) ## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2) ## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0) ## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0) @@ -3439,7 +4266,7 @@

6 Session info

## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0) ## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0) -## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2) +## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2) ## tidyr 1.1.2 2020-08-27 [1] RSPM (R 4.0.2) ## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0) ## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2) @@ -3451,23 +4278,28 @@

6 Session info

## [2] /usr/local/lib/R/library
-

Gonzalez I., 2014 Statistical analysis of rna-seq data.

+

Gonzalez I., 2014 Statistical analysis of RNA-Seq data. http://www.nathalievialaneix.eu/doc/pdf/tutorial-rnaseq.pdf

-

Klaus B., and S. Reisenauer, 2018 An end to end workflow for differential gene expression using affymetrix microarrays.

+

Klaus B., and S. Reisenauer, 2018 An end to end workflow for differential gene expression using Affymetrix microarrays. https://www.bioconductor.org/packages/devel/workflows/vignettes/maEndToEnd/inst/doc/MA-Workflow.html

Ritchie M. E., B. Phipson, D. Wu, Y. Hu, and C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007

-

Robinson G., M. Parker, T. A. Kranenburg, C. Lu, and X. Chen et al., 2012 Novel mutations target distinct subgroups of medulloblastoma. Nature 488: 43–48.

+

Robinson G., M. Parker, T. A. Kranenburg, C. Lu, and X. Chen et al., 2012 Novel mutations target distinct subgroups of medulloblastoma. Nature 488: 43–48. https://doi.org/10.1038/nature11213

-

Robinson D., Understanding empirical bayes estimation (using baseball statistics)

+

Robinson D., Understanding empirical Bayes estimation (using baseball statistics). http://varianceexplained.org/r/empirical_bayes_baseball/

+ diff --git a/02-microarray/dimension-reduction_microarray_01_pca.Rmd b/02-microarray/dimension-reduction_microarray_01_pca.Rmd index 2dbee8f6..3b9896b5 100644 --- a/02-microarray/dimension-reduction_microarray_01_pca.Rmd +++ b/02-microarray/dimension-reduction_microarray_01_pca.Rmd @@ -44,7 +44,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -52,7 +52,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -114,7 +114,7 @@ Your new analysis folder should contain: - The gene expression - The metadata TSV - A folder for `plots` (currently empty) -- A folder for `results` (currently empty) +- A folder for `results` (currently empty) Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): @@ -128,19 +128,24 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "GSE37382") # Replace with accession number which will be the name of the folder the files will be in - -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "GSE37382.tsv") # Replace with file path to your dataset - -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv") # Replace with file path to your metadata +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "GSE37382") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "GSE37382.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` @@ -170,7 +175,7 @@ See our Getting Started page with [instructions for package installation](https: Attach the packages we need for this analysis: -```{r} +```{r message=FALSE} # Attach the library library(ggplot2) @@ -191,14 +196,14 @@ metadata <- readr::read_tsv(metadata_file) # Read in data TSV file df <- readr::read_tsv(data_file) %>% - # Tuck away the gene ID column as rownames + # Tuck away the gene ID column as row names, leaving only numeric values tibble::column_to_rownames("Gene") ``` Let's ensure that the metadata and data are in the same sample order. ```{r} -# Make the data in the order of the metadata +# Make the sure the columns (samples) are in the same order as the metadata df <- df %>% dplyr::select(metadata$geo_accession) @@ -213,24 +218,28 @@ Now we are going to use a combination of functions from base R and the `ggplot2` In this code chunk, we are going to perform Principal Component Analysis (PCA) on our data and create a data frame using the PCA scores and the variables from our metadata that we are going to use to annotate our plot later. We are using the base R `prcomp()` function to perform Principal Component Analysis (PCA) here. +The `prcomp()` function calculates principal component scores for each row of a matrix, but our data is arranged with each sample in a column, so we will need to transpose the data frame first. +In most cases, we will want to use the `scale = TRUE` argument so that all of the expression measurements have the same variance. +This prevents the PCA results from being dominated by a few highly variable genes. ```{r} # Perform Principal Component Analysis (PCA) using the `prcomp()` function pca <- prcomp( - t(df), # We have to transpose our data frame so we are obtaining PCA scores for samples instead of genes - scale = TRUE # This tells R that we want the variables scaled to have unit variance + t(df), # transpose our data frame to obtain PC scores for samples, not genes + scale = TRUE # we want the data scaled to have unit variance for each gene ) ``` -Let's take a preview at the PCA results. -We are using indexing to only print out the first 10 PCs: `[, 1:10]`. +Let's take a preview of the PCA results. +Each row will be a sample, as in the transposed data matrix we used as input, and each column is one of the new principal component (PC) values. +We are using indexing to only print out the first 5 PC columns: `[, 1:5]`. ```{r} # We can access the results from our `pca` object using `$x` -head(pca$x[, 1:10]) +head(pca$x[, 1:5]) ``` -In total, we do have 285 principal component values, because we provided 285 sample's data. +In total, we have 285 principal component values, because we provided 285 samples' data (we will always have as many PCs as the smaller dimension of the input data matrix). ## Explore Variance in PCA Results @@ -246,33 +255,36 @@ The `summary()` function reports the proportion of variance explained by each pr pca_summary <- summary(pca) ``` -By accessing the `importance` element, which contains the proportion of variance explained by each principal component, with `pca_summary$importance`, we can use indexing to only look at the first `n` PCs. +The `importance` element of the summary object contains the proportion of variance explained by each principal component along with other statistics, with `pca_summary$importance`, we can use indexing to only look at the first `n` PCs. ```{r} -# Now access the importance information for the first 10 PCs -- we can access this information `pca_summary$importance` -pca_summary$importance[, 1:10] +# Now access the importance information for the first 5 PCs +pca_summary$importance[, 1:5] ``` -Now that we've seen the proportion of variance for the first ten PCs, let's prepare and plot the PC scores for the first two principal components, the components responsible for the most explained proportion of variance in our dataset. +Now that we've seen the proportion of variance for the first set of PCs, let's prepare and plot the PC scores for the first two principal components, the components that explain the largest proportion of the expression variance in our dataset. +(Note though, that in this case, they explain less than 15% of the total variance!) ## Prepare a final data frame with PCA results for plotting In the next chunk, we are going to extract the first two principal components from our `pca` object to prepare a data frame for plotting. ```{r} -# Make the first two principal components into a data frame for plotting with `ggplot2` +# Make the first two PCs into a data frame for plotting with `ggplot2` pca_df <- data.frame(pca$x[, 1:2]) %>% - # Turn samples_ids stored as rownames into column + # Turn samples IDs stored as row names into a column tibble::rownames_to_column("refinebio_accession_code") %>% - # Bring only the variables that we want from the metadata into this data frame -- here we are going to join by `refinebio_accession_code` values - dplyr::inner_join(dplyr::select(metadata, refinebio_accession_code, histology, subgroup), + # Bring only the variables that we want from the metadata into this data frame + # here we are going to join by `refinebio_accession_code` values + dplyr::inner_join( + dplyr::select(metadata, refinebio_accession_code, histology, subgroup), by = "refinebio_accession_code" ) ``` ## Plot PCA Results -Now let's plot the PC scores for the first two principal components since we know that they are responsible for the most explained proportion of variance in our dataset. +Now let's plot the PC scores for the first two principal components. Let's also label the data points based on their genotype subgroup since medulloblastoma has been found to comprise of subgroups that each have molecularly distinct profiles [@Northcott2012]. @@ -283,17 +295,18 @@ pca_plot <- ggplot( aes( x = PC1, y = PC2, - color = subgroup # This will label points with different colors for each `subgroup` + color = subgroup # label points with different colors for each `subgroup` ) ) + - geom_point() + # This tells R that we want a scatterplot - theme_classic() # This tells R to return a classic-looking plot with no gridlines + geom_point() + # Plot individual points to make a scatterplot + theme_classic() # Format as a classic-looking plot with no gridlines -# Print out plot here +# Print out the plot here pca_plot ``` -Looks like Group 4 and SHH groups somewhat cluster with each other but Group 3 seems to be less distinct as there are some samples clustering with Group 4 as well. +Looks like Group 4 and SHH groups cluster with each other somewhat, but Group 3 seems to be less distinct, as there are some samples clustering with Group 4 as well. +Most of the differences that we see between the groups are along the first axis of variation, PC1. We can add another label to our plot to get more information about our dataset. Let's also label the data points based on the histological subtype that each sample belongs to. @@ -305,19 +318,19 @@ pca_plot <- ggplot( aes( x = PC1, y = PC2, - color = subgroup, # This will label points with different colors for each `subgroup` - shape = histology # This will label points with different colors for each `histology` group + color = subgroup, # Draw points with different colors for each `subgroup` + shape = histology # Use a different shape for each `histology` group ) ) + geom_point() + theme_classic() -# Print out plot here +# Print out the plot here pca_plot ``` Adding the histological subtype label to our plot made our plot more informative, but the diffuse Group 3 data doesn't appear to be related to a histology subtype. -We could test out other variables as annotation labels to get a further understanding of the cluster behavior of each subgroup. +We could test out other variables as annotation labels to get a further understanding of the cluster behavior of each subgroup, or plot other PC values to see if they might also reveal some structure in the data. ## Save annotated PCA plot as a PNG @@ -327,17 +340,18 @@ You can easily switch this to save to a JPEG or TIFF by changing the file name w ```{r} # Save plot using `ggsave()` function -ggsave(file.path( - plots_dir, - "GSE37382_pca_scatterplot.png" # Replace with name relevant your plotted data -), -plot = pca_plot # Here we are giving the function the plot object that we want saved to file +ggsave( + file.path( + plots_dir, + "GSE37382_pca_scatterplot.png" # Replace with a good file name for your plot + ), + plot = pca_plot # The plot object that we want saved to file ) ``` # Resources for further learning -- [Overall PCA Explanation by Matt Brems](https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c) [@Brems2017] +- [Overall PCA Explanation by Matt Brems](https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c) [@Brems2017] - [A visual explanation of PCA](http://setosa.io/ev/principal-component-analysis/) [@pca-visually-explained] - [Guidelines on choosing dimension reduction methods](https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1006907&type=printable) [@Nguyen2019] - [More on `ggplot2`](http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html) [@Prabhakaran2016] @@ -345,7 +359,7 @@ plot = pca_plot # Here we are giving the function the plot object that we want s # Session info At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of software and packages you used to run this. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info diff --git a/02-microarray/dimension-reduction_microarray_01_pca.html b/02-microarray/dimension-reduction_microarray_01_pca.html index 843c4993..e550936b 100644 --- a/02-microarray/dimension-reduction_microarray_01_pca.html +++ b/02-microarray/dimension-reduction_microarray_01_pca.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2971,26 +3797,26 @@

2.1 Obtain the .Rmd

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

-
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
-  dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
-  dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir)
-}
+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

@@ -3040,20 +3866,25 @@

2.6 Check out our file structure!

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

-
# Define the file path to the data directory
-data_dir <- file.path("data", "GSE37382") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE37382.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv") # Replace with file path to your metadata
+
# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE37382")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE37382.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

-
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+
# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-
# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

@@ -3072,19 +3903,20 @@

4 PCA Visualization - Microarray<

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

Attach the packages we need for this analysis:

-
# Attach the library
-library(ggplot2)
-
-# We will need this so we can use the pipe: %>%
-library(magrittr)
+
# Attach the library
+library(ggplot2)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

-
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-
## Parsed with column specification:
+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   .default = col_character(),
 ##   refinebio_age = col_double(),
@@ -3104,154 +3936,147 @@ 

4.2 Import and set up data

## `contact_zip/postal_code` = col_double(), ## data_row_count = col_double(), ## taxid_ch1 = col_double() -## )
-
## See spec(...) for full column specifications.
-
# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
-  # Tuck away the gene ID  column as rownames
-  tibble::column_to_rownames("Gene")
-
## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+
# Read in data TSV file
+df <- readr::read_tsv(data_file) %>%
+  # Tuck away the gene ID column as row names, leaving only numeric values
+  tibble::column_to_rownames("Gene")
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   .default = col_double(),
 ##   Gene = col_character()
 ## )
-## See spec(...) for full column specifications.
+## ℹ Use `spec()` for the full column specifications.

Let’s ensure that the metadata and data are in the same sample order.

-
# Make the data in the order of the metadata
-df <- df %>%
-  dplyr::select(metadata$geo_accession)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$geo_accession)
+
# Make the sure the columns (samples) are in the same order as the metadata
+df <- df %>%
+  dplyr::select(metadata$geo_accession)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$geo_accession)
## [1] TRUE

Now we are going to use a combination of functions from base R and the ggplot2 package to perform and visualize the results of the Principal Component Analysis (PCA) dimension reduction technique on our medulloblastoma samples.

4.3 Perform Principal Components Analysis

-

In this code chunk, we are going to perform Principal Component Analysis (PCA) on our data and create a data frame using the PCA scores and the variables from our metadata that we are going to use to annotate our plot later. We are using the base R prcomp() function to perform Principal Component Analysis (PCA) here.

-
# Perform Principal Component Analysis (PCA) using the `prcomp()` function
-pca <- prcomp(
-  t(df), # We have to transpose our data frame so we are obtaining PCA scores for samples instead of genes
-  scale = TRUE # This tells R that we want the variables scaled to have unit variance
-)
-

Let’s take a preview at the PCA results. We are using indexing to only print out the first 10 PCs: [, 1:10].

-
# We can access the results from our `pca` object using `$x`
-head(pca$x[, 1:10])
-
##                   PC1        PC2        PC3       PC4       PC5        PC6
-## GSM917111 -18.1468442 -63.659225   1.136901 -5.242875  10.29114 -17.724403
-## GSM917250 -51.2350460  45.762776  -2.183100  7.948304 -21.26658 -24.668972
-## GSM917281 -42.4774592   2.216351  19.941918 -5.664602 -49.00085 -13.306383
-## GSM917062  -7.6116070 -25.887243 -65.257099 -6.226487  32.73786  -7.159006
-## GSM917288 -54.9540801  51.804918  42.332093 28.506307  54.17483  46.601027
-## GSM917230  -0.0325771 -11.070407  -6.555240 28.661922 -20.85879  17.266892
-##                  PC7        PC8        PC9       PC10
-## GSM917111   6.527599 -15.023332   7.640187 -16.155061
-## GSM917250  -8.472525  -1.592417  -6.242858  -5.730141
-## GSM917281  39.499111   3.513968 -18.804197  -3.578259
-## GSM917062 -10.549933 -47.861391   3.256229 -29.137628
-## GSM917288  24.893176   5.242399  67.374223  -8.003784
-## GSM917230  19.109068  33.763989 -17.114146   1.061423
-

In total, we do have 285 principal component values, because we provided 285 sample’s data.

+

In this code chunk, we are going to perform Principal Component Analysis (PCA) on our data and create a data frame using the PCA scores and the variables from our metadata that we are going to use to annotate our plot later. We are using the base R prcomp() function to perform Principal Component Analysis (PCA) here. The prcomp() function calculates principal component scores for each row of a matrix, but our data is arranged with each sample in a column, so we will need to transpose the data frame first. In most cases, we will want to use the scale = TRUE argument so that all of the expression measurements have the same variance. This prevents the PCA results from being dominated by a few highly variable genes.

+
# Perform Principal Component Analysis (PCA) using the `prcomp()` function
+pca <- prcomp(
+  t(df), # transpose our data frame to obtain PC scores for samples, not genes
+  scale = TRUE # we want the data scaled to have unit variance for each gene
+)
+

Let’s take a preview of the PCA results. Each row will be a sample, as in the transposed data matrix we used as input, and each column is one of the new principal component (PC) values. We are using indexing to only print out the first 5 PC columns: [, 1:5].

+
# We can access the results from our `pca` object using `$x`
+head(pca$x[, 1:5])
+
##                   PC1        PC2        PC3       PC4       PC5
+## GSM917111 -18.1468442 -63.659225   1.136901 -5.242875  10.29114
+## GSM917250 -51.2350460  45.762776  -2.183100  7.948304 -21.26658
+## GSM917281 -42.4774592   2.216351  19.941918 -5.664602 -49.00085
+## GSM917062  -7.6116070 -25.887243 -65.257099 -6.226487  32.73786
+## GSM917288 -54.9540801  51.804918  42.332093 28.506307  54.17483
+## GSM917230  -0.0325771 -11.070407  -6.555240 28.661922 -20.85879
+

In total, we have 285 principal component values, because we provided 285 samples’ data (we will always have as many PCs as the smaller dimension of the input data matrix).

4.4 Explore Variance in PCA Results

-

Before visualizing and interpreting the results, it can be useful to understand the proportion of variance explained by each principal component. The principal components are automatically ordered by the variance they explain, meaning PC1 would always be the principal component that explains the most variance in your dataset. If the largest variance component, PC1, explained 96% of the variance in your data and very clearly showed a difference between sample batches you would be very concerned about your dataset! On the other hand, if a separation of batches was apparent in a different principal component that explained a low proportion of variance and the first few PCs explained most of the variance and appeared to correspond to something like tissue type and treatment, you would be less concerned (CCDL 2020).

+

Before visualizing and interpreting the results, it can be useful to understand the proportion of variance explained by each principal component. The principal components are automatically ordered by the variance they explain, meaning PC1 would always be the principal component that explains the most variance in your dataset. If the largest variance component, PC1, explained 96% of the variance in your data and very clearly showed a difference between sample batches you would be very concerned about your dataset! On the other hand, if a separation of batches was apparent in a different principal component that explained a low proportion of variance and the first few PCs explained most of the variance and appeared to correspond to something like tissue type and treatment, you would be less concerned (Childhood Cancer Data Lab 2020).

The summary() function reports the proportion of variance explained by each principal component.

-
# Save the summary of the PCA results using the `summary()` function
-pca_summary <- summary(pca)
-

By accessing the importance element, which contains the proportion of variance explained by each principal component, with pca_summary$importance, we can use indexing to only look at the first n PCs.

-
# Now access the importance information for the first 10 PCs -- we can access this information `pca_summary$importance`
-pca_summary$importance[, 1:10]
-
##                             PC1      PC2      PC3      PC4      PC5      PC6
-## Standard deviation     45.50502 33.87890 30.50038 27.51175 23.99828 22.88496
-## Proportion of Variance  0.09560  0.05299  0.04295  0.03494  0.02659  0.02418
-## Cumulative Proportion   0.09560  0.14858  0.19153  0.22647  0.25306  0.27724
-##                             PC7      PC8      PC9     PC10
-## Standard deviation     21.17880 18.73075 18.13267 17.95956
-## Proportion of Variance  0.02071  0.01620  0.01518  0.01489
-## Cumulative Proportion   0.29795  0.31414  0.32932  0.34421
-

Now that we’ve seen the proportion of variance for the first ten PCs, let’s prepare and plot the PC scores for the first two principal components, the components responsible for the most explained proportion of variance in our dataset.

+
# Save the summary of the PCA results using the `summary()` function
+pca_summary <- summary(pca)
+

The importance element of the summary object contains the proportion of variance explained by each principal component along with other statistics, with pca_summary$importance, we can use indexing to only look at the first n PCs.

+
# Now access the importance information for the first 5 PCs
+pca_summary$importance[, 1:5]
+
##                             PC1      PC2      PC3      PC4      PC5
+## Standard deviation     45.50502 33.87890 30.50038 27.51175 23.99828
+## Proportion of Variance  0.09560  0.05299  0.04295  0.03494  0.02659
+## Cumulative Proportion   0.09560  0.14858  0.19153  0.22647  0.25306
+

Now that we’ve seen the proportion of variance for the first set of PCs, let’s prepare and plot the PC scores for the first two principal components, the components that explain the largest proportion of the expression variance in our dataset. (Note though, that in this case, they explain less than 15% of the total variance!)

4.5 Prepare a final data frame with PCA results for plotting

In the next chunk, we are going to extract the first two principal components from our pca object to prepare a data frame for plotting.

-
# Make the first two principal components into a data frame for plotting with `ggplot2`
-pca_df <- data.frame(pca$x[, 1:2]) %>%
-  # Turn samples_ids stored as rownames into column
-  tibble::rownames_to_column("refinebio_accession_code") %>%
-  # Bring only the variables that we want from the metadata into this data frame -- here we are going to join by `refinebio_accession_code` values
-  dplyr::inner_join(dplyr::select(metadata, refinebio_accession_code, histology, subgroup),
-    by = "refinebio_accession_code"
-  )
+
# Make the first two PCs into a data frame for plotting with `ggplot2`
+pca_df <- data.frame(pca$x[, 1:2]) %>%
+  # Turn samples IDs stored as row names into a column
+  tibble::rownames_to_column("refinebio_accession_code") %>%
+  # Bring only the variables that we want from the metadata into this data frame
+  # here we are going to join by `refinebio_accession_code` values
+  dplyr::inner_join(
+    dplyr::select(metadata, refinebio_accession_code, histology, subgroup),
+    by = "refinebio_accession_code"
+  )

4.6 Plot PCA Results

-

Now let’s plot the PC scores for the first two principal components since we know that they are responsible for the most explained proportion of variance in our dataset.

-

Let’s also label the data points based on their genotype subgroup since medulloblastoma has been found to comprise of subgroups that each have molecularly distinct profiles (Northcott et al. 2012).

-
# Make a scatterplot using `ggplot2` functionality
-pca_plot <- ggplot(
-  pca_df,
-  aes(
-    x = PC1,
-    y = PC2,
-    color = subgroup # This will label points with different colors for each `subgroup`
-  )
-) +
-  geom_point() + # This tells R that we want a scatterplot
-  theme_classic() # This tells R to return a classic-looking plot with no gridlines
-
-# Print out plot here
-pca_plot
+

Now let’s plot the PC scores for the first two principal components.

+

Let’s also label the data points based on their genotype subgroup since medulloblastoma has been found to comprise of subgroups that each have molecularly distinct profiles (Northcott et al. 2012).

+
# Make a scatterplot using `ggplot2` functionality
+pca_plot <- ggplot(
+  pca_df,
+  aes(
+    x = PC1,
+    y = PC2,
+    color = subgroup # label points with different colors for each `subgroup`
+  )
+) +
+  geom_point() + # Plot individual points to make a scatterplot
+  theme_classic() # Format as a classic-looking plot with no gridlines
+
+# Print out the plot here
+pca_plot

-

Looks like Group 4 and SHH groups somewhat cluster with each other but Group 3 seems to be less distinct as there are some samples clustering with Group 4 as well.

+

Looks like Group 4 and SHH groups cluster with each other somewhat, but Group 3 seems to be less distinct, as there are some samples clustering with Group 4 as well. Most of the differences that we see between the groups are along the first axis of variation, PC1.

We can add another label to our plot to get more information about our dataset. Let’s also label the data points based on the histological subtype that each sample belongs to.

-
# Make a scatterplot with ggplot2
-pca_plot <- ggplot(
-  pca_df,
-  aes(
-    x = PC1,
-    y = PC2,
-    color = subgroup, # This will label points with different colors for each `subgroup`
-    shape = histology # This will label points with different colors for each `histology` group
-  )
-) +
-  geom_point() +
-  theme_classic()
-
-# Print out plot here
-pca_plot
+
# Make a scatterplot with ggplot2
+pca_plot <- ggplot(
+  pca_df,
+  aes(
+    x = PC1,
+    y = PC2,
+    color = subgroup, # Draw points with different colors for each `subgroup`
+    shape = histology # Use a different shape for each `histology` group
+  )
+) +
+  geom_point() +
+  theme_classic()
+
+# Print out the plot here
+pca_plot

-

Adding the histological subtype label to our plot made our plot more informative, but the diffuse Group 3 data doesn’t appear to be related to a histology subtype. We could test out other variables as annotation labels to get a further understanding of the cluster behavior of each subgroup.

+

Adding the histological subtype label to our plot made our plot more informative, but the diffuse Group 3 data doesn’t appear to be related to a histology subtype. We could test out other variables as annotation labels to get a further understanding of the cluster behavior of each subgroup, or plot other PC values to see if they might also reveal some structure in the data.

4.7 Save annotated PCA plot as a PNG

Now that we have an annotated PCA plot, let’s save it!

You can easily switch this to save to a JPEG or TIFF by changing the file name within the ggsave() function to the respective file suffix.

-
# Save plot using `ggsave()` function
-ggsave(file.path(
-  plots_dir,
-  "GSE37382_pca_scatterplot.png" # Replace with name relevant your plotted data
-),
-plot = pca_plot # Here we are giving the function the plot object that we want saved to file
-)
+
# Save plot using `ggsave()` function
+ggsave(
+  file.path(
+    plots_dir,
+    "GSE37382_pca_scatterplot.png" # Replace with a good file name for your plot
+  ),
+  plot = pca_plot # The plot object that we want saved to file
+)
## Saving 7 x 5 in image

5 Resources for further learning

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

-
# Print session info
-sessioninfo::session_info()
-
## ─ Session info ───────────────────────────────────────────────────────────────
+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ─────────────────────────────────────────────────────
 ##  setting  value                       
 ##  version  R version 4.0.2 (2020-06-22)
 ##  os       Ubuntu 20.04 LTS            
@@ -3261,13 +4086,13 @@ 

6 Session info

## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2020-10-16 +## date 2020-12-14 ## -## ─ Packages ─────────────────────────────────────────────────────────────────── +## ─ Packages ───────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0) ## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2) -## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) +## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2) ## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) @@ -3291,16 +4116,16 @@

6 Session info

## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0) ## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0) +## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2) ## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0) ## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0) ## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2) ## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2) ## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2) ## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) -## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2) -## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2) +## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) ## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0) -## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2) +## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2) ## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2) ## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0) ## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0) @@ -3308,7 +4133,7 @@

6 Session info

## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0) ## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0) -## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2) +## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2) ## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0) ## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2) ## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) @@ -3322,10 +4147,10 @@

6 Session info

References

-

Brems M., 2017 A one-stop shop for principal component analysis

+

Brems M., 2017 A one-stop shop for principal component analysis. https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c

-

CCDL, 2020 OpenPBTA: Cluster validation.

+

Childhood Cancer Data Lab, 2020 OpenPBTA: Cluster validation. https://github.com/AlexsLemonade/training-modules/blob/3dbc6f3f53c680ec6aa2f513851c1cd4635cc31c/machine-learning/02-openpbta_consensus_clustering.Rmd#L310

Nguyen L. H., and S. Holmes, 2019 Ten quick tips for effective dimensionality reduction. PLOS Computational Biology 15. https://doi.org/10.1371/journal.pcbi.1006907

@@ -3334,14 +4159,19 @@

References

Northcott P., D. Shih, J. Peacock, L. Garzia, and S. Morrissy et al., 2012 Subgroup specific structural variation across 1,000 medulloblastoma genomes. Nature 488. https://doi.org/10.1038/nature11327

-

Powell V., and L. Lehe, Principal component analysis explained visually

+

Powell V., and L. Lehe, Principal component analysis explained visually. https://setosa.io/ev/principal-component-analysis/

-

Prabhakaran S., 2016 The complete ggplot2 tutorial.

+

Prabhakaran S., 2016 The complete ggplot2 tutorial. http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html

+ diff --git a/02-microarray/dimension-reduction_microarray_02_umap.Rmd b/02-microarray/dimension-reduction_microarray_02_umap.Rmd index 353ce5c4..5a35687e 100644 --- a/02-microarray/dimension-reduction_microarray_02_umap.Rmd +++ b/02-microarray/dimension-reduction_microarray_02_umap.Rmd @@ -44,7 +44,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -52,7 +52,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -70,7 +70,7 @@ Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/GSE Click the "Download Now" button on the right side of this screen. - + Fill out the pop up window with your email and our Terms and Conditions: @@ -127,19 +127,24 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "GSE37382") # Replace with accession number which will be the name of the folder the files will be in - -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "GSE37382.tsv") # Replace with file path to your dataset - -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv") # Replace with file path to your metadata +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "GSE37382") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "GSE37382.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` @@ -180,7 +185,7 @@ if (!("umap" %in% installed.packages())) { Attach the packages we need for this analysis: -```{r} +```{r message=FALSE} # Attach the `umap` library library(umap) @@ -211,14 +216,14 @@ metadata <- readr::read_tsv(metadata_file) # Read in data TSV file df <- readr::read_tsv(data_file) %>% - # Tuck away the gene ID column as rownames + # Tuck away the gene ID column as row names, leaving only numeric values tibble::column_to_rownames("Gene") ``` Let's ensure that the metadata and data are in the same sample order. ```{r} -# Make the data in the order of the metadata +# Make sure the columns (samples) are in the same order as the metadata df <- df %>% dplyr::select(metadata$geo_accession) @@ -226,35 +231,40 @@ df <- df %>% all.equal(colnames(df), metadata$geo_accession) ``` -## Perform Uniform Manifold Approximation (UMAP) - Now we are going to use a combination of functions from the `umap` and the `ggplot2` packages to perform and visualize the results of the Uniform Manifold Approximation (UMAP) dimension reduction technique on our medulloblastoma samples. +## Perform Uniform Manifold Approximation (UMAP) + In this code chunk, we are going to perform Uniform Manifold Approximation (UMAP) on our data and create a data frame using the UMAP scores and the variables from our metadata that we are going to use to annotate our plot later. +The `umap()` function calculates scores for each row of a matrix, but our data is arranged with each sample in a column, so we will need to transpose the data frame first. ```{r} -# Perform Uniform Manifold Approximation (UMAP) using the `umap::umap()` function +# Perform UMAP using the `umap::umap()` function umap_data <- umap::umap( - t(df) # We have to transpose our data frame so we are obtaining UMAP scores for samples instead of genes + t(df) # transpose our data frame to obtain PC scores for samples, not genes ) # Make into data frame for plotting with `ggplot2` -umap_df <- data.frame(umap_data$layout) %>% # The umap values we need for plotting are stored in this `layout` element - # Turn samples_ids stored as rownames into column +# The UMAP values we need for plotting are stored in the `layout` element +umap_df <- data.frame(umap_data$layout) %>% + # Turn sample IDs stored as row names into a column tibble::rownames_to_column("refinebio_accession_code") %>% - # Bring only the variables that we want from the metadata into this data frame; match by sample ids - dplyr::inner_join(dplyr::select(metadata, refinebio_accession_code, histology, subgroup), + # Add on the variables that we want from the metadata into this data frame; + # match by sample IDs + dplyr::inner_join( + dplyr::select(metadata, refinebio_accession_code, histology, subgroup), by = "refinebio_accession_code" ) ``` -Let's take a preview at the data frame we created in the chunk above. +Let's take a look at the data frame we created in the chunk above. ```{r} -head(umap_df) +umap_df ``` -Now, let's plot the UMAP scores. +Here we can see that UMAP took the data from thousands of genes, and reduced it to just two variables, `X1` and `X2`. +Now, let's plot those UMAP scores. ```{r} # Make a scatterplot using `ggplot2` functionality @@ -265,10 +275,10 @@ umap_plot <- ggplot( y = X2 ) ) + - geom_point() + # This tells R that we want a scatterplot - theme_classic() # This tells R to return a classic-looking plot with no gridlines + geom_point() + # Plot individual points to make a scatterplot + theme_classic() # Format as a classic-looking plot with no gridlines -# Print out plot here +# Print out the plot here umap_plot ``` @@ -283,17 +293,17 @@ umap_plot <- ggplot( aes( x = X1, y = X2, - color = subgroup # This will label points with different colors for each `subgroup` + color = subgroup # label points with different colors for each `subgroup` ) ) + geom_point() + theme_classic() -# Print out plot here +# Print out the plot here umap_plot ``` -It looks like Group 4 and SHH groups somewhat cluster with each other but Group 3 seems to also be clustering with Group 4. +It looks like SHH clusters pretty distinctly, with Group 3 and Group 4 being more similar and grouping together (with some division). We can add another label to our plot to potentially gain insight on the clustering behavior of our data. Let's also label the data points based on the histological subtype that each sample belongs to. @@ -305,8 +315,8 @@ umap_plot <- ggplot( aes( x = X1, y = X2, - color = subgroup, # This will label points with different colors for each `subgroup` - shape = histology # This will label points with different colors for each `histology` group + color = subgroup, # Draw points with different colors for each `subgroup` + shape = histology # Use a different shape for each `histology` group ) ) + geom_point() + @@ -319,17 +329,31 @@ umap_plot Our histological subtype groups don't appear to be clustering in a discernible pattern. We could test out other variables as annotation labels to get a further understanding of the cluster behavior of each subgroup. +### Interpretation of UMAP plot and results + +1. Note that the coordinates of UMAP output for any given cell can change dramatically depending on parameters, even if ran with the same parameters (Also why setting the seed is important). + +This means that you should not rely too heavily on the exact values of UMAP's output. + + - One particular limitation of UMAP is that while observed clusters have some meaning, the distance *between* clusters usually does not (nor does cluster density). The fact that two clusters are near each other should NOT be interpreted to mean that they are more related to each other than to more distant clusters. (There is some disagreement about whether UMAP distances have more meaning, but it is also probably safer to assume they don't.) + +2. Playing with the parameters so you can fine-tune them is a good way to give you more information about a particular analysis as well as the data itself. +Feel free to try playing with the parameters on your own in the code chunks above! + +In summary, a good rule of thumb to remember is: if the results of an analysis can be completely changed by changing its parameters, you should be very cautious when it comes to the conclusions you draw from it as well as having good rationale for the parameters you choose (_adapted from @CCDL2020 training materials_). + ## Save annotated UMAP plot as a PNG You can easily switch this to save to a JPEG or TIFF by changing the file name within the `ggsave()` function to the respective file suffix. ```{r} # Save plot using `ggsave()` function -ggsave(file.path( - plots_dir, - "GSE37382_umap_scatterplot.png" # Replace with name relevant your plotted data -), -plot = umap_plot # Here we are giving the function the plot object that we want saved to file +ggsave( + file.path( + plots_dir, + "GSE37382_umap_plot.png" # Replace with a good file name for your plot + ), + plot = umap_plot # The plot object that we want saved to file ) ``` diff --git a/02-microarray/dimension-reduction_microarray_02_umap.html b/02-microarray/dimension-reduction_microarray_02_umap.html index b929afb2..12b59c5f 100644 --- a/02-microarray/dimension-reduction_microarray_02_umap.html +++ b/02-microarray/dimension-reduction_microarray_02_umap.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2971,26 +3797,26 @@

2.1 Obtain the .Rmd

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

-
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
-  dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
-  dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir)
-}
+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

@@ -3040,20 +3866,25 @@

2.6 Check out our file structure!

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

-
# Define the file path to the data directory
-data_dir <- file.path("data", "GSE37382") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE37382.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv") # Replace with file path to your metadata
+
# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE37382")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE37382.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

-
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+
# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-
# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

@@ -3072,31 +3903,32 @@

4 UMAP Visualization - Microarray

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

-

In this analysis, we will be using the R package umap (Konopka 2020) for the production of UMAP dimension reduction values and the R package ggplot2 (Prabhakaran 2016) for plotting the UMAP values.

-
if (!("umap" %in% installed.packages())) {
-  # Install umap package
-  BiocManager::install("umap", update = FALSE)
-}
+

In this analysis, we will be using the R package umap (Konopka 2020) for the production of UMAP dimension reduction values and the R package ggplot2 (Prabhakaran 2016) for plotting the UMAP values.

+
if (!("umap" %in% installed.packages())) {
+  # Install umap package
+  BiocManager::install("umap", update = FALSE)
+}

Attach the packages we need for this analysis:

-
# Attach the `umap` library
-library(umap)
-
-# Attach the `ggplot2` library
-library(ggplot2)
-
-# We will need this so we can use the pipe: %>%
-library(magrittr)
+
# Attach the `umap` library
+library(umap)
+
+# Attach the `ggplot2` library
+library(ggplot2)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)

The UMAP algorithm utilizes random sampling so we are going to set the seed to make our results reproducible.

-
# Set the seed so our results are reproducible:
-set.seed(12345)
+
# Set the seed so our results are reproducible:
+set.seed(12345)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

-
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-
## Parsed with column specification:
+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   .default = col_character(),
 ##   refinebio_age = col_double(),
@@ -3116,130 +3948,148 @@ 

4.2 Import and set up data

## `contact_zip/postal_code` = col_double(), ## data_row_count = col_double(), ## taxid_ch1 = col_double() -## )
-
## See spec(...) for full column specifications.
-
# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
-  # Tuck away the gene ID  column as rownames
-  tibble::column_to_rownames("Gene")
-
## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+
# Read in data TSV file
+df <- readr::read_tsv(data_file) %>%
+  # Tuck away the gene ID  column as row names, leaving only numeric values
+  tibble::column_to_rownames("Gene")
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   .default = col_double(),
 ##   Gene = col_character()
 ## )
-## See spec(...) for full column specifications.
+## ℹ Use `spec()` for the full column specifications.

Let’s ensure that the metadata and data are in the same sample order.

-
# Make the data in the order of the metadata
-df <- df %>%
-  dplyr::select(metadata$geo_accession)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$geo_accession)
+
# Make sure the columns (samples) are in the same order as the metadata
+df <- df %>%
+  dplyr::select(metadata$geo_accession)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$geo_accession)
## [1] TRUE
+

Now we are going to use a combination of functions from the umap and the ggplot2 packages to perform and visualize the results of the Uniform Manifold Approximation (UMAP) dimension reduction technique on our medulloblastoma samples.

4.3 Perform Uniform Manifold Approximation (UMAP)

-

Now we are going to use a combination of functions from the umap and the ggplot2 packages to perform and visualize the results of the Uniform Manifold Approximation (UMAP) dimension reduction technique on our medulloblastoma samples.

-

In this code chunk, we are going to perform Uniform Manifold Approximation (UMAP) on our data and create a data frame using the UMAP scores and the variables from our metadata that we are going to use to annotate our plot later.

-
# Perform Uniform Manifold Approximation (UMAP) using the `umap::umap()` function
-umap_data <- umap::umap(
-  t(df) # We have to transpose our data frame so we are obtaining UMAP scores for samples instead of genes
-)
-
-# Make into data frame for plotting with `ggplot2`
-umap_df <- data.frame(umap_data$layout) %>% # The umap values we need for plotting are stored in this `layout` element
-  # Turn samples_ids stored as rownames into column
-  tibble::rownames_to_column("refinebio_accession_code") %>%
-  # Bring only the variables that we want from the metadata into this data frame; match by sample ids
-  dplyr::inner_join(dplyr::select(metadata, refinebio_accession_code, histology, subgroup),
-    by = "refinebio_accession_code"
-  )
-

Let’s take a preview at the data frame we created in the chunk above.

-
head(umap_df)
+

In this code chunk, we are going to perform Uniform Manifold Approximation (UMAP) on our data and create a data frame using the UMAP scores and the variables from our metadata that we are going to use to annotate our plot later. The umap() function calculates scores for each row of a matrix, but our data is arranged with each sample in a column, so we will need to transpose the data frame first.

+
# Perform UMAP using the `umap::umap()` function
+umap_data <- umap::umap(
+  t(df) # transpose our data frame to obtain PC scores for samples, not genes
+)
+
+# Make into data frame for plotting with `ggplot2`
+# The UMAP values we need for plotting are stored in the `layout` element
+umap_df <- data.frame(umap_data$layout) %>%
+  # Turn sample IDs stored as row names into a column
+  tibble::rownames_to_column("refinebio_accession_code") %>%
+  # Add on the variables that we want from the metadata into this data frame;
+  # match by sample IDs
+  dplyr::inner_join(
+    dplyr::select(metadata, refinebio_accession_code, histology, subgroup),
+    by = "refinebio_accession_code"
+  )
+

Let’s take a look at the data frame we created in the chunk above.

+
umap_df
-

Now, let’s plot the UMAP scores.

-
# Make a scatterplot using `ggplot2` functionality
-umap_plot <- ggplot(
-  umap_df,
-  aes(
-    x = X1,
-    y = X2
-  )
-) +
-  geom_point() + # This tells R that we want a scatterplot
-  theme_classic() # This tells R to return a classic-looking plot with no gridlines
-
-# Print out plot here
-umap_plot
-

+

Here we can see that UMAP took the data from thousands of genes, and reduced it to just two variables, X1 and X2. Now, let’s plot those UMAP scores.

+
# Make a scatterplot using `ggplot2` functionality
+umap_plot <- ggplot(
+  umap_df,
+  aes(
+    x = X1,
+    y = X2
+  )
+) +
+  geom_point() + # Plot individual points to make a scatterplot
+  theme_classic() # Format as a classic-looking plot with no gridlines
+
+# Print out the plot here
+umap_plot
+

It’s hard to interpret our UMAP results without some metadata labels on our plot.

-

Let’s label the data points based on their genotype subgroup since this is central to the subgroup specific based hypothesis in the original paper (Northcott et al. 2012).

-
# Make a scatterplot with ggplot2
-umap_plot <- ggplot(
-  umap_df,
-  aes(
-    x = X1,
-    y = X2,
-    color = subgroup # This will label points with different colors for each `subgroup`
-  )
-) +
-  geom_point() +
-  theme_classic()
-
-# Print out plot here
-umap_plot
-

-

It looks like Group 4 and SHH groups somewhat cluster with each other but Group 3 seems to also be clustering with Group 4.

+

Let’s label the data points based on their genotype subgroup since this is central to the subgroup specific based hypothesis in the original paper (Northcott et al. 2012).

+
# Make a scatterplot with ggplot2
+umap_plot <- ggplot(
+  umap_df,
+  aes(
+    x = X1,
+    y = X2,
+    color = subgroup # label points with different colors for each `subgroup`
+  )
+) +
+  geom_point() +
+  theme_classic()
+
+# Print out the plot here
+umap_plot
+

+

It looks like SHH clusters pretty distinctly, with Group 3 and Group 4 being more similar and grouping together (with some division).

We can add another label to our plot to potentially gain insight on the clustering behavior of our data. Let’s also label the data points based on the histological subtype that each sample belongs to.

-
# Make a scatterplot with ggplot2
-umap_plot <- ggplot(
-  umap_df,
-  aes(
-    x = X1,
-    y = X2,
-    color = subgroup, # This will label points with different colors for each `subgroup`
-    shape = histology # This will label points with different colors for each `histology` group
-  )
-) +
-  geom_point() +
-  theme_classic()
-
-# Print out plot here
-umap_plot
-

+
# Make a scatterplot with ggplot2
+umap_plot <- ggplot(
+  umap_df,
+  aes(
+    x = X1,
+    y = X2,
+    color = subgroup, # Draw points with different colors for each `subgroup`
+    shape = histology # Use a different shape for each `histology` group
+  )
+) +
+  geom_point() +
+  theme_classic()
+
+# Print out plot here
+umap_plot
+

Our histological subtype groups don’t appear to be clustering in a discernible pattern. We could test out other variables as annotation labels to get a further understanding of the cluster behavior of each subgroup.

+
+

4.3.1 Interpretation of UMAP plot and results

+
    +
  1. Note that the coordinates of UMAP output for any given cell can change dramatically depending on parameters, even if ran with the same parameters (Also why setting the seed is important). This means that you should not rely too heavily on the exact values of UMAP’s output.
  2. +
+
    +
  • One particular limitation of UMAP is that while observed clusters have some meaning, the distance between clusters usually does not (nor does cluster density). The fact that two clusters are near each other should NOT be interpreted to mean that they are more related to each other than to more distant clusters. (There is some disagreement about whether UMAP distances have more meaning, but it is also probably safer to assume they don’t.)
  • +
+
    +
  1. Playing with the parameters so you can fine-tune them is a good way to give you more information about a particular analysis as well as the data itself. Feel free to try playing with the parameters on your own in the code chunks above!
  2. +
+

In summary, a good rule of thumb to remember is: if the results of an analysis can be completely changed by changing its parameters, you should be very cautious when it comes to the conclusions you draw from it as well as having good rationale for the parameters you choose (adapted from Childhood Cancer Data Lab (2020) training materials).

+

4.4 Save annotated UMAP plot as a PNG

You can easily switch this to save to a JPEG or TIFF by changing the file name within the ggsave() function to the respective file suffix.

-
# Save plot using `ggsave()` function
-ggsave(file.path(
-  plots_dir,
-  "GSE37382_umap_scatterplot.png" # Replace with name relevant your plotted data
-),
-plot = umap_plot # Here we are giving the function the plot object that we want saved to file
-)
+
# Save plot using `ggsave()` function
+ggsave(
+  file.path(
+    plots_dir,
+    "GSE37382_umap_plot.png" # Replace with a good file name for your plot
+  ),
+  plot = umap_plot # The plot object that we want saved to file
+)
## Saving 7 x 5 in image

5 Resources for further learning

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

-
# Print session info
-sessioninfo::session_info()
-
## ─ Session info ───────────────────────────────────────────────────────────────
+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ─────────────────────────────────────────────────────
 ##  setting  value                       
 ##  version  R version 4.0.2 (2020-06-22)
 ##  os       Ubuntu 20.04 LTS            
@@ -3249,14 +4099,14 @@ 

6 Session info

## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2020-10-16 +## date 2020-12-14 ## -## ─ Packages ─────────────────────────────────────────────────────────────────── +## ─ Packages ───────────────────────────────────────────────────────── ## package * version date lib source ## askpass 1.1 2019-01-13 [1] RSPM (R 4.0.0) ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0) ## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2) -## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) +## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2) ## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) @@ -3284,6 +4134,7 @@

6 Session info

## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0) ## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0) +## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2) ## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0) ## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0) ## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2) @@ -3291,10 +4142,10 @@

6 Session info

## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2) ## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) ## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2) -## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2) +## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) ## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0) ## reticulate 1.16 2020-05-27 [1] RSPM (R 4.0.2) -## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2) +## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2) ## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2) ## RSpectra 0.16-0 2019-12-01 [1] RSPM (R 4.0.2) ## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0) @@ -3303,7 +4154,7 @@

6 Session info

## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0) ## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0) -## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2) +## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2) ## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0) ## umap * 0.2.6.0 2020-06-16 [1] RSPM (R 4.0.2) ## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2) @@ -3317,27 +4168,35 @@

6 Session info

References

+
-

Konopka T., 2020 Uniform manifold approximation and projection.

+

Konopka T., 2020 Uniform manifold approximation and projection. https://cran.r-project.org/web/packages/umap/umap.pdf

-

McInnes L., 2018 How umap works

+

McInnes L., 2018 How UMAP works. https://umap-learn.readthedocs.io/en/latest/how_umap_works.html#

-

McInnes L., J. Healy, and J. Melville, 2018 UMAP: Uniform manifold approximation and projection for dimension reduction

+

McInnes L., J. Healy, and J. Melville, 2018 UMAP: Uniform manifold approximation and projection for dimension reduction. https://arxiv.org/abs/1802.03426

Northcott P., D. Shih, J. Peacock, L. Garzia, and S. Morrissy et al., 2012 Subgroup specific structural variation across 1,000 medulloblastoma genomes. Nature 488. https://doi.org/10.1038/nature11327

-

Prabhakaran S., 2016 The complete ggplot2 tutorial.

+

Prabhakaran S., 2016 The complete ggplot2 tutorial. http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html

-

R CRAN Team, 2019 Uniform manifold approximation and projection in r.

+

R CRAN Team, 2019 Uniform manifold approximation and projection in R. https://cran.r-project.org/web/packages/umap/vignettes/umap.html

+
diff --git a/02-microarray/gene-id-annotation_microarray_01_ensembl.Rmd b/02-microarray/gene-id-annotation_microarray_01_ensembl.Rmd index 9feacc38..d2210306 100644 --- a/02-microarray/gene-id-annotation_microarray_01_ensembl.Rmd +++ b/02-microarray/gene-id-annotation_microarray_01_ensembl.Rmd @@ -11,7 +11,7 @@ output: # Purpose of this analysis -The purpose of this notebook is to provide an example of mapping gene IDs for microarray data obtained from refine.bio using `AnnotationDbi` packages [@Carlson2020-package]. +The purpose of this notebook is to provide an example of mapping gene IDs for microarray data obtained from refine.bio using `AnnotationDbi` packages [@Pages2020-package]. ⬇️ [**Jump to the analysis code**](#analysis) ⬇️ @@ -44,7 +44,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -52,7 +52,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -83,18 +83,19 @@ You will get an email when it is ready. For this example analysis, we will use this [mouse glioma stem cells dataset](https://www.refine.bio/experiments/GSE13490/cancer-stem-cells-are-enriched-in-the-side-population-cells-in-a-mouse-model-of-glioma). -This dataset has 15 microarray mouse glioma model samples. -The samples were obtained from parental biological replicates and from resistant sub-line biological replicates that were transplanted into recipient mice. +This dataset has 15 microarrays measuring gene expression in a transgenic mouse model of glioma. +The authors compared cells from side populations and non-side populations in both tumor samples and normal neural stem cells. + ## Place the dataset in your new `data/` folder -refine.bio will send you a download button in the email when it is ready. -Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. +refine.bio will send you a download button in the email when it is ready. +Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. Double clicking should unzip this for you and create a folder of the same name. - + -For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). +For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). The `` folder has the data and metadata TSV files you will need for this example analysis. Experiment accession IDs usually look something like `GSE1235` or `SRP12345`. @@ -125,19 +126,24 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "GSE13490") # Replace with accession number which will be the name of the folder the files will be in - -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "GSE13490.tsv") # Replace with file path to your dataset - -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_GSE13490.tsv") # Replace with file path to your metadata +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "GSE13490") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "GSE13490.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_GSE13490.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` @@ -146,7 +152,7 @@ file.exists(metadata_file) If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. -If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). # Using a different refine.bio dataset with this analysis? @@ -154,12 +160,12 @@ If you'd like to adapt an example analysis to use a different dataset from [refi We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences. -refine.bio data comes with gene level data with Ensembl IDs. +refine.bio data comes with gene level data identified by Ensembl IDs. Although this example notebook uses Ensembl IDs from Mouse, (Mus musculus), to obtain gene symbols this notebook can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers. -For different species, wherever the abbreviation `org.Mm.eg.db` or `Mm` is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens `org.Hs.eg.db` or `Hs` would be used. +For different species, wherever the abbreviation `org.Mm.eg.db` or `Mm` is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens `org.Hs.eg.db` or `Hs` would be used. In the case of our [RNA-seq gene identifier annotation example notebook](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html), a Zebrafish (Danio rerio) dataset is used, meaning `org.Dr.eg.db` or `Dr` would also need to be used there. -A full list of the annotation R packages from Bioconductor is at this [link](https://bioconductor.org/packages/release/BiocViews.html#___AnnotationData) [@annotation-packages]. +A full list of the annotation R packages from Bioconductor is at this [link](https://bioconductor.org/packages/release/BiocViews.html#___AnnotationData). *** @@ -168,18 +174,21 @@ A full list of the annotation R packages from Bioconductor is at this [link](htt # Obtaining Annotation for Ensembl IDs - Microarray -Ensembl IDs can be used to obtain various different annotations at the gene/transcript level. +refine.bio uses Ensembl IDs as the primary gene identifier in its data sets. +While this is a consistent and useful identifier, a string of apparently random letters and numbers is not the most user-friendly or informative for interpretation. +Luckily, we can use the Ensembl IDs that we have to obtain various different annotations at the gene/transcript level. Let's get ready to use the Ensembl IDs from our mouse dataset to obtain the associated gene symbols. ## Install libraries See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. -In this analysis, we will be using the `org.Mm.eg.db` R package [@Carlson2019]. +In this analysis, we will be using the `org.Mm.eg.db` R package [@Carlson2019-mouse], which is part of the Bioconductor `AnnotationDbi` framework [@Pages2020-package]. +Bioconductor compiles annotations from various sources, and these packages provide convenient methods to access and translate among those annotations. [Other species can be used](#using-a-different-refinebio-dataset-with-this-analysis). ```{r} -# Install the mouse package +# Install the mouse annotation package if (!("org.Mm.eg.db" %in% installed.packages())) { # Install this package if it isn't installed yet BiocManager::install("org.Mm.eg.db", update = FALSE) @@ -187,8 +196,9 @@ if (!("org.Mm.eg.db" %in% installed.packages())) { ``` Attach the packages we need for this analysis. +Note that attaching `org.Mm.eg.db` will automatically also attach `AnnotationDbi`. -```{r} +```{r message=FALSE} # Attach the library library(org.Mm.eg.db) @@ -208,8 +218,8 @@ We stored our file paths as objects named `metadata_file` and `data_file` in [th metadata <- readr::read_tsv(metadata_file) # Read in data TSV file -df <- readr::read_tsv(data_file) %>% - # Tuck away the Gene ID column as rownames +expression_df <- readr::read_tsv(data_file) %>% + # Tuck away the Gene ID column as row names tibble::column_to_rownames("Gene") ``` @@ -217,20 +227,21 @@ Let's ensure that the metadata and data are in the same sample order. ```{r} # Make the data in the order of the metadata -df <- df %>% +expression_df <- expression_df %>% dplyr::select(metadata$geo_accession) # Check if this is in the same order -all.equal(colnames(df), metadata$geo_accession) +all.equal(colnames(expression_df), metadata$geo_accession) # Bring back the "Gene" column in preparation for mapping -df <- df %>% +expression_df <- expression_df %>% tibble::rownames_to_column("Gene") ``` ## Map Ensembl IDs to gene symbols +The main work of translating among annotations will be done with the the `AnnotationDbi` function `mapIds()`. The `mapIds()` function has a `multiVals` argument which denotes what to do when there are multiple mapped values for a single gene identifier. The default behavior is to return just the first mapped value. It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. @@ -241,20 +252,20 @@ In the next chunk, we will run the `mapIds()` function and supply the `multiVals ```{r} # Map ensembl IDs to their associated gene symbols mapped_list <- mapIds( - org.Mm.eg.db, # Replace with annotation package for the organism relevant to your data - keys = df$Gene, - column = "SYMBOL", # Replace with the type of gene identifiers you would like to map to - keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data + org.Mm.eg.db, # Replace with annotation package for your organism + keys = expression_df$Gene, + keytype = "ENSEMBL", # Replace with the gene identifiers used in your data + column = "SYMBOL", # The type of gene identifiers you would like to map to multiVals = "list" ) ``` ## Explore gene ID conversion -Now, let's take a look at our `mapped_list` to see how the mapping went. +Now, let's take a look at our mapped object to see how the mapping went. ```{r} -# Let's use the `head()` function to take a preview at our mapped list +# Let's use the `head()` function for a preview of our mapped list head(mapped_list) ``` @@ -263,11 +274,11 @@ However, the data is now in a `list` object, making it a little more difficult t We are going to turn our list object into a data frame object in the next chunk. ```{r} -# Let's make our object a bit more manageable for exploration by turning it into a data frame +# Let's make our list a bit more manageable by turning it into a data frame mapped_df <- mapped_list %>% tibble::enframe(name = "Ensembl", value = "Symbol") %>% - # enframe makes a `list` column, so we will convert that to simpler format with `unnest() - # This will result in one row of our data frame per list item + # enframe() makes a `list` column; we will simplify it with unnest() + # This will result in a data frame with one row per list item tidyr::unnest(cols = Symbol) ``` @@ -278,27 +289,27 @@ head(mapped_df) ``` We can see that our data frame has a new column `Symbol`. -Let's get a summary of the gene symbols returned in the `Symbol` column of our mapped data frame. +Let's get a summary of the gene symbols in the `Symbol` column of our mapped data frame. ```{r} -# We can use the `summary()` function to get a better idea of the distribution of symbols in the `Symbol` column -summary(mapped_df$Symbol) +# Use the `summary()` function to show the distribution of Symbol values +# We need to use `as.factor()` here to get the counts of unique values +# `maxsum = 10` limits the summary to 10 distinct values +summary(as.factor(mapped_df$Symbol), maxsum = 10) ``` -There are 998 NAs in our data frame, which means that 998 out of the 17918 Ensembl IDs did not map to gene symbols. -998 out of 17918 is not too bad a rate, in our opinion, but note that different gene identifier types will have different mapping rates and that is to be expected. +There are 942 `NA`s in the `Symbol` column, which means that 942 out of the 17918 Ensembl IDs did not map to gene symbols. +942 out of 17918 is not too bad a rate, in our opinion, but note that different gene identifier types will have different mapping rates and that is to be expected. Regardless, it is always good to be aware of how many genes you are potentially "losing" if you rely on this new gene identifier you've mapped to for downstream analyses. -However, if you have almost all NAs it is possible that the function was executed incorrectly or you may want to consider using a different gene identifier, if possible. +However, if you have almost all `NA`s it is possible that the function was executed incorrectly or you may want to consider using a different gene identifier, if possible. Now let's check to see if we have any genes that were mapped to multiple symbols. ```{r} multi_mapped <- mapped_df %>% - # Let's group by the Ensembl IDs in the `Ensembl` column - dplyr::group_by(Ensembl) %>% - # Create a new variable containing the number of symbols mapped to each ID - dplyr::mutate(gene_symbol_count = dplyr::n()) %>% + # Let's count the number of times each Ensembl ID appears in `Ensembl` column + dplyr::count(Ensembl, name = "gene_symbol_count") %>% # Arrange by the genes with the highest number of symbols mapped dplyr::arrange(desc(gene_symbol_count)) %>% # Filter to include only the rows with multi mappings @@ -319,20 +330,20 @@ This will remove all instances of multiple mappings and return a list of only th Use `?mapIds` to see more options or strategies. ```{r} -# Map ensembl IDs to their associated gene symbols +# Map Ensembl IDs to their associated gene symbols filtered_mapped_df <- data.frame( "Symbol" = mapIds( - org.Mm.eg.db, # Replace with annotation package for the organism relevant to your data - keys = df$Gene, - column = "SYMBOL", # Replace with the type of gene identifiers you would like to map to - keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data - multiVals = "filter" # This will drop any `Gene`s that have multiple matches + org.Mm.eg.db, # Replace with annotation package for your organism + keys = expression_df$Gene, + keytype = "ENSEMBL", # Replace with the gene identifiers used in your data + column = "SYMBOL", # The type of gene identifiers you would like to map to + multiVals = "filter" # This will drop any genes that have multiple matches ) ) %>% # Make an `Ensembl` column to store the rownames tibble::rownames_to_column("Ensembl") %>% - # Join the remaining data from `df` using the Ensembl IDs - dplyr::inner_join(df, by = c("Ensembl" = "Gene")) + # Join the remaining data from `expression_df` using the Ensembl IDs + dplyr::inner_join(expression_df, by = c("Ensembl" = "Gene")) ``` Now, let's write our filtered and mapped results to file! @@ -351,12 +362,12 @@ readr::write_tsv(filtered_mapped_df, file.path( # Resources for further learning - Marc Carlson has prepared a nice [Introduction to Bioconductor Annotation Packages](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf) [@Carlson2020-vignette] -- See our [RNA-seq gene ID conversion notebook](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html) as another applicable example, since the steps for this workflow do not change with technology [@gene-id-annotation-rna-seq]. +- See our [RNA-seq gene ID conversion notebook](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html) as another applicable example, since the steps for this workflow do not change with technology. # Session info At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of software and packages you used to run this. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info diff --git a/02-microarray/gene-id-annotation_microarray_01_ensembl.html b/02-microarray/gene-id-annotation_microarray_01_ensembl.html index 72ae98b4..a272805f 100644 --- a/02-microarray/gene-id-annotation_microarray_01_ensembl.html +++ b/02-microarray/gene-id-annotation_microarray_01_ensembl.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2955,7 +3781,7 @@

October 2020

1 Purpose of this analysis

-

The purpose of this notebook is to provide an example of mapping gene IDs for microarray data obtained from refine.bio using AnnotationDbi packages (Carlson 2020a).

+

The purpose of this notebook is to provide an example of mapping gene IDs for microarray data obtained from refine.bio using AnnotationDbi packages (Pagès et al. 2020).

⬇️ Jump to the analysis code ⬇️

@@ -2971,26 +3797,26 @@

2.1 Obtain the .Rmd

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

-
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
-  dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
-  dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir)
-}
+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

@@ -3006,7 +3832,7 @@

2.3 Obtain the dataset from refin

2.4 About the dataset we are using for this example

For this example analysis, we will use this mouse glioma stem cells dataset.

-

This dataset has 15 microarray mouse glioma model samples. The samples were obtained from parental biological replicates and from resistant sub-line biological replicates that were transplanted into recipient mice.

+

This dataset has 15 microarrays measuring gene expression in a transgenic mouse model of glioma. The authors compared cells from side populations and non-side populations in both tumor samples and normal neural stem cells.

2.5 Place the dataset in your new data/ folder

@@ -3040,20 +3866,25 @@

2.6 Check out our file structure!

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

-
# Define the file path to the data directory
-data_dir <- file.path("data", "GSE13490") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE13490.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE13490.tsv") # Replace with file path to your metadata
+
# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE13490")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE13490.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE13490.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

-
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+
# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-
# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

@@ -3062,73 +3893,39 @@

2.6 Check out our file structure!

3 Using a different refine.bio dataset with this analysis?

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

-

refine.bio data comes with gene level data with Ensembl IDs. Although this example notebook uses Ensembl IDs from Mouse, (Mus musculus), to obtain gene symbols this notebook can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers.

-

For different species, wherever the abbreviation org.Mm.eg.db or Mm is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens org.Hs.eg.db or Hs would be used. In the case of our RNA-seq gene identifier annotation example notebook, a Zebrafish (Danio rerio) dataset is used, meaning org.Dr.eg.db or Dr would also need to be used there. A full list of the annotation R packages from Bioconductor is at this link (R Bioconductor Team 2003).

+

refine.bio data comes with gene level data identified by Ensembl IDs. Although this example notebook uses Ensembl IDs from Mouse, (Mus musculus), to obtain gene symbols this notebook can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers.

+

For different species, wherever the abbreviation org.Mm.eg.db or Mm is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens org.Hs.eg.db or Hs would be used. In the case of our RNA-seq gene identifier annotation example notebook, a Zebrafish (Danio rerio) dataset is used, meaning org.Dr.eg.db or Dr would also need to be used there. A full list of the annotation R packages from Bioconductor is at this link.


 

4 Obtaining Annotation for Ensembl IDs - Microarray

-

Ensembl IDs can be used to obtain various different annotations at the gene/transcript level. Let’s get ready to use the Ensembl IDs from our mouse dataset to obtain the associated gene symbols.

+

refine.bio uses Ensembl IDs as the primary gene identifier in its data sets. While this is a consistent and useful identifier, a string of apparently random letters and numbers is not the most user-friendly or informative for interpretation. Luckily, we can use the Ensembl IDs that we have to obtain various different annotations at the gene/transcript level. Let’s get ready to use the Ensembl IDs from our mouse dataset to obtain the associated gene symbols.

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

-

In this analysis, we will be using the org.Mm.eg.db R package (Carlson 2019). Other species can be used.

-
# Install the mouse package
-if (!("org.Mm.eg.db" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("org.Mm.eg.db", update = FALSE)
-}
-

Attach the packages we need for this analysis.

-
# Attach the library
-library(org.Mm.eg.db)
-
## Loading required package: AnnotationDbi
-
## Loading required package: stats4
-
## Loading required package: BiocGenerics
-
## Loading required package: parallel
-
## 
-## Attaching package: 'BiocGenerics'
-
## The following objects are masked from 'package:parallel':
-## 
-##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
-##     clusterExport, clusterMap, parApply, parCapply, parLapply,
-##     parLapplyLB, parRapply, parSapply, parSapplyLB
-
## The following objects are masked from 'package:stats':
-## 
-##     IQR, mad, sd, var, xtabs
-
## The following objects are masked from 'package:base':
-## 
-##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
-##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
-##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
-##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
-##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
-##     union, unique, unsplit, which, which.max, which.min
-
## Loading required package: Biobase
-
## Welcome to Bioconductor
-## 
-##     Vignettes contain introductory material; view with
-##     'browseVignettes()'. To cite Bioconductor, see
-##     'citation("Biobase")', and for packages 'citation("pkgname")'.
-
## Loading required package: IRanges
-
## Loading required package: S4Vectors
-
## 
-## Attaching package: 'S4Vectors'
-
## The following object is masked from 'package:base':
-## 
-##     expand.grid
-
## 
-
# We will need this so we can use the pipe: %>%
-library(magrittr)
+

In this analysis, we will be using the org.Mm.eg.db R package (Carlson 2019), which is part of the Bioconductor AnnotationDbi framework (Pagès et al. 2020). Bioconductor compiles annotations from various sources, and these packages provide convenient methods to access and translate among those annotations. Other species can be used.

+
# Install the mouse annotation package
+if (!("org.Mm.eg.db" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("org.Mm.eg.db", update = FALSE)
+}
+

Attach the packages we need for this analysis. Note that attaching org.Mm.eg.db will automatically also attach AnnotationDbi.

+
# Attach the library
+library(org.Mm.eg.db)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

-
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-
## Parsed with column specification:
+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   .default = col_character(),
 ##   refinebio_age = col_logical(),
@@ -3148,13 +3945,14 @@ 

4.2 Import and set up data

## channel_count = col_double(), ## data_row_count = col_double(), ## taxid_ch1 = col_double() -## )
-
## See spec(...) for full column specifications.
-
# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
-  # Tuck away the Gene ID column as rownames
-  tibble::column_to_rownames("Gene")
-
## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+
# Read in data TSV file
+expression_df <- readr::read_tsv(data_file) %>%
+  # Tuck away the Gene ID column as row names
+  tibble::column_to_rownames("Gene")
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   Gene = col_character(),
 ##   GSM340064 = col_double(),
@@ -3174,36 +3972,36 @@ 

4.2 Import and set up data

## GSM340078 = col_double() ## )

Let’s ensure that the metadata and data are in the same sample order.

-
# Make the data in the order of the metadata
-df <- df %>%
-  dplyr::select(metadata$geo_accession)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$geo_accession)
+
# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+  dplyr::select(metadata$geo_accession)
+
+# Check if this is in the same order
+all.equal(colnames(expression_df), metadata$geo_accession)
## [1] TRUE
-
# Bring back the "Gene" column in preparation for mapping
-df <- df %>%
-  tibble::rownames_to_column("Gene")
+
# Bring back the "Gene" column in preparation for mapping
+expression_df <- expression_df %>%
+  tibble::rownames_to_column("Gene")

4.3 Map Ensembl IDs to gene symbols

-

The mapIds() function has a multiVals argument which denotes what to do when there are multiple mapped values for a single gene identifier. The default behavior is to return just the first mapped value. It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. Use ?mapIds to see more options or strategies.

+

The main work of translating among annotations will be done with the the AnnotationDbi function mapIds(). The mapIds() function has a multiVals argument which denotes what to do when there are multiple mapped values for a single gene identifier. The default behavior is to return just the first mapped value. It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. Use ?mapIds to see more options or strategies.

In the next chunk, we will run the mapIds() function and supply the multiVals argument with the "list" option in order to get a large list with all the mapped values found for each gene identifier.

-
# Map ensembl IDs to their associated gene symbols
-mapped_list <- mapIds(
-  org.Mm.eg.db, # Replace with annotation package for the organism relevant to your data
-  keys = df$Gene,
-  column = "SYMBOL", # Replace with the type of gene identifiers you would like to map to
-  keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
-  multiVals = "list"
-)
+
# Map ensembl IDs to their associated gene symbols
+mapped_list <- mapIds(
+  org.Mm.eg.db, # Replace with annotation package for your organism
+  keys = expression_df$Gene,
+  keytype = "ENSEMBL", # Replace with the gene identifiers used in your data
+  column = "SYMBOL", # The type of gene identifiers you would like to map to
+  multiVals = "list"
+)
## 'select()' returned 1:many mapping between keys and columns

4.4 Explore gene ID conversion

-

Now, let’s take a look at our mapped_list to see how the mapping went.

-
# Let's use the `head()` function to take a preview at our mapped list
-head(mapped_list)
+

Now, let’s take a look at our mapped object to see how the mapping went.

+
# Let's use the `head()` function for a preview of our mapped list
+head(mapped_list)
## $ENSMUSG00000000001
 ## [1] "Gnai3"
 ## 
@@ -3222,42 +4020,44 @@ 

4.4 Explore gene ID conversion

It looks like we have gene symbols that were successfully mapped to the Ensembl IDs we provided. However, the data is now in a list object, making it a little more difficult to explore. We are going to turn our list object into a data frame object in the next chunk.

-
# Let's make our object a bit more manageable for exploration by turning it into a data frame
-mapped_df <- mapped_list %>%
-  tibble::enframe(name = "Ensembl", value = "Symbol") %>%
-  # enframe makes a `list` column, so we will convert that to simpler format with `unnest()
-  # This will result in one row of our data frame per list item
-  tidyr::unnest(cols = Symbol)
+
# Let's make our list a bit more manageable by turning it into a data frame
+mapped_df <- mapped_list %>%
+  tibble::enframe(name = "Ensembl", value = "Symbol") %>%
+  # enframe() makes a `list` column; we will simplify it with unnest()
+  # This will result in a data frame with one row per list item
+  tidyr::unnest(cols = Symbol)

Now let’s take a peek at our data frame.

-
head(mapped_df)
+
head(mapped_df)
-

We can see that our data frame has a new column Symbol. Let’s get a summary of the gene symbols returned in the Symbol column of our mapped data frame.

-
# We can use the `summary()` function to get a better idea of the distribution of symbols in the `Symbol` column
-summary(mapped_df$Symbol)
-
##    Length     Class      Mode 
-##     17977 character character
-

There are 998 NAs in our data frame, which means that 998 out of the 17918 Ensembl IDs did not map to gene symbols. 998 out of 17918 is not too bad a rate, in our opinion, but note that different gene identifier types will have different mapping rates and that is to be expected. Regardless, it is always good to be aware of how many genes you are potentially “losing” if you rely on this new gene identifier you’ve mapped to for downstream analyses.

-

However, if you have almost all NAs it is possible that the function was executed incorrectly or you may want to consider using a different gene identifier, if possible.

+

We can see that our data frame has a new column Symbol. Let’s get a summary of the gene symbols in the Symbol column of our mapped data frame.

+
# Use the `summary()` function to show the distribution of Symbol values
+# We need to use `as.factor()` here to get the counts of unique values
+# `maxsum = 10` limits the summary to 10 distinct values
+summary(as.factor(mapped_df$Symbol), maxsum = 10)
+
##       Cyp2c39          Pms2 0610005C13Rik 0610009B22Rik 0610009L18Rik 
+##             2             2             1             1             1 
+## 0610010F05Rik 0610012G03Rik 0610030E20Rik       (Other)          NA's 
+##             1             1             1         17021           942
+

There are 942 NAs in the Symbol column, which means that 942 out of the 17918 Ensembl IDs did not map to gene symbols. 942 out of 17918 is not too bad a rate, in our opinion, but note that different gene identifier types will have different mapping rates and that is to be expected. Regardless, it is always good to be aware of how many genes you are potentially “losing” if you rely on this new gene identifier you’ve mapped to for downstream analyses.

+

However, if you have almost all NAs it is possible that the function was executed incorrectly or you may want to consider using a different gene identifier, if possible.

Now let’s check to see if we have any genes that were mapped to multiple symbols.

-
multi_mapped <- mapped_df %>%
-  # Let's group by the Ensembl IDs in the `Ensembl` column
-  dplyr::group_by(Ensembl) %>%
-  # Create a new variable containing the number of symbols mapped to each ID
-  dplyr::mutate(gene_symbol_count = dplyr::n()) %>%
-  # Arrange by the genes with the highest number of symbols mapped
-  dplyr::arrange(desc(gene_symbol_count)) %>%
-  # Filter to include only the rows with multi mappings
-  dplyr::filter(gene_symbol_count > 1)
-
-# Let's look at the first 6 rows of our `multi_mapped` object
-head(multi_mapped)
+
multi_mapped <- mapped_df %>%
+  # Let's count the number of times each Ensembl ID appears in `Ensembl` column
+  dplyr::count(Ensembl, name = "gene_symbol_count") %>%
+  # Arrange by the genes with the highest number of symbols mapped
+  dplyr::arrange(desc(gene_symbol_count)) %>%
+  # Filter to include only the rows with multi mappings
+  dplyr::filter(gene_symbol_count > 1)
+
+# Let's look at the first 6 rows of our `multi_mapped` object
+head(multi_mapped)

Looks like we have some cases where 3 gene symbols mapped to a single Ensembl ID. We have a total of 130 out of 17984 Ensembl IDs with multiple mappings to gene symbols. If we are not too worried about the 130 IDs with multiple mappings, we can filter them out for the purpose of having 1:1 mappings for our downstream analysis.

@@ -3265,45 +4065,45 @@

4.4 Explore gene ID conversion

4.5 Map Ensembl IDs to gene symbols – filtering out multi mappings

The next code chunk we will rerun the mapIds() function, this time supplying the "filter" option to the multiVals argument. This will remove all instances of multiple mappings and return a list of only the gene identifiers and symbols that had 1:1 mapping. Use ?mapIds to see more options or strategies.

-
# Map ensembl IDs to their associated gene symbols
-filtered_mapped_df <- data.frame(
-  "Symbol" = mapIds(
-    org.Mm.eg.db, # Replace with annotation package for the organism relevant to your data
-    keys = df$Gene,
-    column = "SYMBOL", # Replace with the type of gene identifiers you would like to map to
-    keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
-    multiVals = "filter" # This will drop any `Gene`s that have multiple matches
-  )
-) %>%
-  # Make an `Ensembl` column to store the rownames
-  tibble::rownames_to_column("Ensembl") %>%
-  # Join the remaining data from `df` using the Ensembl IDs
-  dplyr::inner_join(df, by = c("Ensembl" = "Gene"))
+
# Map Ensembl IDs to their associated gene symbols
+filtered_mapped_df <- data.frame(
+  "Symbol" = mapIds(
+    org.Mm.eg.db, # Replace with annotation package for your organism
+    keys = expression_df$Gene,
+    keytype = "ENSEMBL", # Replace with the gene identifiers used in your data
+    column = "SYMBOL", # The type of gene identifiers you would like to map to
+    multiVals = "filter" # This will drop any genes that have multiple matches
+  )
+) %>%
+  # Make an `Ensembl` column to store the rownames
+  tibble::rownames_to_column("Ensembl") %>%
+  # Join the remaining data from `expression_df` using the Ensembl IDs
+  dplyr::inner_join(expression_df, by = c("Ensembl" = "Gene"))
## 'select()' returned 1:many mapping between keys and columns

Now, let’s write our filtered and mapped results to file!

4.6 Write mapped results to file

-
# Write mapped and annotated data frame to output file
-readr::write_tsv(filtered_mapped_df, file.path(
-  results_dir,
-  "GSE13490_Gene_Symbols.tsv" # Replace with a relevant output file name
-))
+
# Write mapped and annotated data frame to output file
+readr::write_tsv(filtered_mapped_df, file.path(
+  results_dir,
+  "GSE13490_Gene_Symbols.tsv" # Replace with a relevant output file name
+))

5 Resources for further learning

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

-
# Print session info
-sessioninfo::session_info()
-
## ─ Session info ───────────────────────────────────────────────────────────────
+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ─────────────────────────────────────────────────────
 ##  setting  value                       
 ##  version  R version 4.0.2 (2020-06-22)
 ##  os       Ubuntu 20.04 LTS            
@@ -3313,19 +4113,19 @@ 

6 Session info

## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2020-10-16 +## date 2020-12-21 ## -## ─ Packages ─────────────────────────────────────────────────────────────────── +## ─ Packages ───────────────────────────────────────────────────────── ## package * version date lib source -## AnnotationDbi * 1.50.3 2020-07-25 [1] Bioconductor +## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0) ## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2) -## Biobase * 2.48.0 2020-04-27 [1] Bioconductor -## BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor +## Biobase * 2.50.0 2020-10-27 [1] Bioconductor +## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor ## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2) ## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2) ## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0) -## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) +## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) @@ -3338,16 +4138,17 @@

6 Session info

## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2) ## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0) ## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1) -## IRanges * 2.22.2 2020-05-21 [1] Bioconductor +## IRanges * 2.24.1 2020-12-12 [1] Bioconductor ## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2) ## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2) ## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0) ## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0) ## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0) ## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0) -## org.Mm.eg.db * 3.11.4 2020-10-06 [1] Bioconductor +## org.Mm.eg.db * 3.12.0 2020-12-16 [1] Bioconductor ## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0) +## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2) ## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0) ## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0) ## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2) @@ -3355,18 +4156,18 @@

6 Session info

## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2) ## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) ## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2) -## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2) +## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) ## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0) -## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2) +## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2) ## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2) -## RSQLite 2.2.0 2020-01-07 [1] RSPM (R 4.0.2) +## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2) ## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0) -## S4Vectors * 0.26.1 2020-05-16 [1] Bioconductor +## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0) ## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0) ## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0) -## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2) +## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2) ## tidyr 1.1.2 2020-08-27 [1] RSPM (R 4.0.2) ## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0) ## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2) @@ -3380,24 +4181,23 @@

6 Session info

References

-
-

Carlson M., 2019 Genome wide annotation for mouse

-
-
-

Carlson M., 2020a AnnotationDbi

+
+

Carlson M., 2019 Genome wide annotation for mouse. https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html

-

Carlson M., 2020b AnnotationDbi: Introduction to bioconductor annotation packages

+

Carlson M., 2020 AnnotationDbi: Introduction to bioconductor annotation packages. https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf

-
-

CCDL, 2020 Obtaining annotation for ensembl ids - rna-seq.

-
-
-

R Bioconductor Team, 2003 Packages found under annotationdata

+
+

Pagès H., M. Carlson, S. Falcon, and N. Li, 2020 AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor. https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html

+
diff --git a/02-microarray/ortholog-mapping_microarray_01_ensembl.Rmd b/02-microarray/ortholog-mapping_microarray_01_ensembl.Rmd index 4a1f8266..0d4b5a6c 100644 --- a/02-microarray/ortholog-mapping_microarray_01_ensembl.Rmd +++ b/02-microarray/ortholog-mapping_microarray_01_ensembl.Rmd @@ -1,7 +1,7 @@ --- title: "Ortholog Mapping - Microarray" author: "CCDL for ALSF" -date: "October 2020" +date: "December 2020" output: html_notebook: toc: true @@ -43,7 +43,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -51,7 +51,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -81,7 +81,7 @@ You will get an email when it is ready. ## About the dataset we are using for this example For this example analysis, we will use this [CREB overexpression zebrafish dataset](https://www.refine.bio/experiments/GSE71270/creb-overexpression-induces-leukemia-in-zebrafish-by-blocking-myeloid-differentiation-process). -@Tregnago2016 measured microarray gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples. +@Tregnago2016 used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples. ## Place the dataset in your new `data/` folder @@ -122,19 +122,24 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "GSE13490") # Replace with accession number which will be the name of the folder the files will be in - -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "GSE13490.tsv") # Replace with file path to your dataset - -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_GSE13490.tsv") # Replace with file path to your metadata +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "GSE71270") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "GSE71270.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_GSE71270.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` @@ -164,7 +169,7 @@ See our Getting Started page with [instructions for package installation](https: Attach a package we need for this analysis. -```{r} +```{r message=FALSE} # We will need this so we can use the pipe: %>% library(magrittr) ``` @@ -178,36 +183,50 @@ The [HGNC Comparison of Orthology Predictions (HCOP)](https://www.genenames.org/ In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes, including zebrafish, which we will use in this notebook [@hcop-help]. -First, we need to download the file from the server holding the HGNC data. -Go to this [directory page of the HGNC Comparison of Orthology Predictions (HCOP) files](ftp://ftp.ebi.ac.uk/pub/databases/genenames/hcop/). +We can download the human to zebrafish translation file we need for this example using the `download.file()` command. +For this notebook, we want to download the file named `human_zebrafish_hcop_fifteen_column.txt.gz`. -This is where the files that reflect the data provided via the [HGNC database](https://www.genenames.org/) are maintained. -Ortholog species files with the '6 Column' output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the '15 Column' output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions. +First we'll declare a sensible file path for this. -*Note:* If you are using Safari (or the above FTP server link does not open in a web browser), you may need to go to the [link for the HCOP search tool](https://www.genenames.org/tools/hcop/) and scroll down to "Bulk Downloads" to choose a file to download. -Here, you can find the same files you would find at the server linked above. +```{r} +# Declare what we want the downloaded file to be called and its location +zebrafish_hgnc_file <- file.path( + data_dir, + # The name the file will have locally + "human_zebrafish_hcop_fifteen_column.txt.gz" +) +``` -To download a file, click the file name. -For this notebook, you will want to download the file named `human_zebrafish_hcop_fifteen_column.txt.gz`. -If you are using a different dataset, you can replace `zebrafish` in `human_zebrafish_hcop_fifteen_column.txt.gz` with the name of the species you have data for, and click on that file to download. +Using the file path we just declared, we can use the `destfile` argument to download the file we need to this directory and use this file name. - +We are downloading this orthology predictions file from the [HGNC database](https://www.genenames.org/). +If you are looking for a different species, see the [directory page of the HGNC Comparison of Orthology Predictions (HCOP) files](http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/) and find the file name of the species you are looking for. -Next, move the `human_zebrafish_hcop_fifteen_column.txt.gz` file into your `data/` folder. +```{r} +download.file( + paste0( + "http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/", + # Replace with the file name for the species conversion you want + "human_zebrafish_hcop_fifteen_column.txt.gz" + ), + # The file will be saved to the name and location we defined earlier + destfile = zebrafish_hgnc_file +) +``` -*Note:* If you are using Safari, this file will automatically be decompressed, so the name of the file would instead be `human_zebrafish_hcop_fifteen_column.txt` (don't forget to change the file name in the chunk below if this is the case). +If you are using a different dataset, in the last chunk you can replace `zebrafish` in `human_zebrafish_hcop_fifteen_column.txt.gz` with the name of the species you have data for (if you see it listed in the directory). +Don't forget to change the destination file as well to reflect what you download! -Now let's double check that the file is in the right place. +Ortholog species files with the '6 Column' output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the '15 Column' output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions. -```{r} -# Define the file path to organism orthology file downloaded from the HGNC database -zebrafish_hgnc_file <- file.path("data", "human_zebrafish_hcop_fifteen_column.txt.gz") +Now let's double check that the zebrafish ortholog file is in the right place. +```{r} # Check if the organism orthology file file is in the `data` directory file.exists(zebrafish_hgnc_file) ``` -In the next chunk, we will read in the orthology file that was just downloaded. +Now we can read in the orthology file that we downloaded. ```{r} # Read in the data from HGNC @@ -239,8 +258,8 @@ We stored our file path for the dataset in an object named `data_file` in [this ```{r} # Read in data TSV file zebrafish_genes <- readr::read_tsv(data_file) %>% - # We only want the gene IDs so let's pull the `Gene` column - dplyr::pull("Gene") + # We only want the gene IDs so let's keep only the `Gene` column + dplyr::select("Gene") ``` ## Mapping human gene symbols to zebrafish Ensembl gene IDs @@ -248,7 +267,7 @@ zebrafish_genes <- readr::read_tsv(data_file) %>% refine.bio data uses Ensembl gene identifiers, which will be in the first column. ```{r} -# Let's take a look at the first 6 items of `zebrafish_genes` +# Let's take a look at the first 6 rows of `zebrafish_genes` head(zebrafish_genes) ``` @@ -263,7 +282,7 @@ This column may assist with addressing some of the multi-mappings that we will t ```{r} human_zebrafish_key <- zebrafish %>% - # We'll want to subset zebrafish to only the columns we're interested in + # Reduce the zebrafish table to only the columns we're interested in dplyr::select(zebrafish_ensembl, human_symbol, support) # Since we ignored the additional columns in `zebrafish`, let's check to see if @@ -272,57 +291,61 @@ any(duplicated(human_zebrafish_key)) ``` We do have duplicates! -We don't want to handle duplicate data, so let's remove those duplicates before moving forward. +Let's remove those duplicates before moving forward, as they provide no extra information at this point. ```{r} human_zebrafish_key <- human_zebrafish_key %>% - # We need to use the `distinct()` function to remove duplicates resulted from - # ignoring the additional columns in the `zebrafish` object + # Use the `distinct()` function to remove duplicates resulting from + # dropping the additional columns in the `zebrafish` data frame dplyr::distinct() ``` Now let's join the mapped data from `human_zebrafish_key` with the gene data in `zebrafish_genes`. +We are using a "left join" here so that we get at least one row per zebrafish gene, even if there is no matching human symbol in the mapping table. ```{r} -# First, we need to convert our vector of zebrafish genes into a data frame -human_zebrafish_mapped_df <- data.frame("Gene" = zebrafish_genes) %>% +human_zebrafish_mapped_df <- zebrafish_genes %>% # Now we can join the mapped data dplyr::left_join(human_zebrafish_key, by = c("Gene" = "zebrafish_ensembl")) ``` Here's what the new data frame looks like: -```{r} -head(human_zebrafish_mapped_df, n = 25) +```{r rownames.print = FALSE} +head(human_zebrafish_mapped_df, n = 10) ``` +Looks like we have mapped symbols! + +So now we have all the zebrafish genes mapped to human, but there might be places where there are multiple zebrafish genes that are orthologous to the same human gene, or vice versa. + Let's get a summary of the human gene symbols returned in our mapped data frame, `human_zebrafish_mapped_df`. ```{r} -# We can use this `count()` function after `group_by()`to get a count of how many +# We can use the `count()` function to get a tally of how many # `zebrafish_ensembl` IDs there are per `human_symbol` human_zebrafish_mapped_df %>% - dplyr::group_by(human_symbol) %>% - dplyr::count() %>% - # Sort by highest `n` which would be the human gene symbol with the most + # Remove the support column + dplyr::select(Gene, human_symbol) %>% + # Remove any remaining duplicates + dplyr::distinct() %>% + # Count the number of rows per human gene + dplyr::count(human_symbol) %>% + # Sort by highest `n` which will be the human gene symbol with the most # mapped zebrafish Ensembl IDs dplyr::arrange(desc(n)) ``` -Looks like we have mapped symbols! +There are certainly a good number of places where we mapped multiple zebrafish Ensembl IDs to the same human symbol! +We'll look at this in a bit. -Now, let's get an idea of how many zebrafish Ensembl IDs we have that were not mapped to human gene symbols. - -```{r} -sum(is.na(human_zebrafish_mapped_df$human_symbol)) -``` - -We have 463 NAs, which means we have 463 zebrafish Ensembl IDs that were not mapped to human gene symbols. -This is okay because we do not expect everything to map across species. +We can also see that there 738 zebrafish Ensembl IDs that did not map to a human symbol. +These are the ones with a value of NA. +This is okay because we do not expect everything to map neatly across species. ## Take a look at some multi-mappings -If a zebrafish Ensembl gene ID maps to multiple human symbols, the associated values will get duplicated. +If a zebrafish Ensembl gene ID maps to multiple human symbols, the associated Ensembl ID values will get duplicated in our output data. Let's look at the `ENSDARG00000069142` example below. ```{r} @@ -330,7 +353,7 @@ human_zebrafish_mapped_df %>% dplyr::filter(Gene == "ENSDARG00000069142") ``` -On the other hand, if you were to look at the original data associated to the zebrafish Ensembl IDs, when a human gene symbol maps to multiple zebrafish Ensembl IDs, the values will not get duplicated, but you will have multiple rows associated with that human symbol. +On the other hand, if you were to look at the original data associated to the zebrafish Ensembl IDs, when a human gene symbol maps to multiple zebrafish Ensembl IDs, the Ensembl IDs will not get duplicated, but you will have multiple rows associated with that human symbol. Let's look at the `MATR3` example below. ```{r} @@ -338,6 +361,11 @@ human_zebrafish_mapped_df %>% dplyr::filter(human_symbol == "MATR3") ``` +We can see that we have multiple zebrafish Ensembl IDs that mapped to the same gene. +(Notice that we also still have some duplicate zebrafish Ensembl ID/human symbol pairs here because the `support` column was different in the original data set! +This is why we removed that column before counting above.) + + ## Collapse zebrafish genes mapping to multiple human genes Remember that if a zebrafish Ensembl gene ID maps to multiple human symbols, the values get duplicated. @@ -349,9 +377,16 @@ In the next chunk, we show how we can collapse all the human gene symbols into o collapsed_human_symbol_df <- human_zebrafish_mapped_df %>% # Group by zebrafish Ensembl IDs dplyr::group_by(Gene) %>% - # Collapse the mapped values in `human_zebrafish_mapped_df` into one column named - # `all_human_symbols` -- note that we will lose the `support` column in this summarizing step - dplyr::summarize(all_human_symbols = paste(human_symbol, collapse = ";")) + # Collapse the mapped values in `human_zebrafish_mapped_df` to a + # `all_human_symbols` column, removing any duplicated human symbols + # note that we will lose the `support` column in this summarizing step + dplyr::summarize( + # combine unique symbols with semicolons between them + all_human_symbols = paste( + sort(unique(human_symbol)), + collapse = ";" + ) + ) head(collapsed_human_symbol_df) ``` @@ -377,11 +412,11 @@ Since multiple zebrafish Ensembl gene IDs map to the same human symbol, we may w *This is not at all straightforward!* (see [this paper](https://doi.org/10.1093/bioinformatics/btaa468) for just one example) [@Stamboulian2020]. Gene duplications along the zebrafish lineage may result in complicated relationships among genes, especially with regard to divisions of function. -Simply combining values across zebrafish transcripts using an average may result in the loss of a lot of data and will likely not be representative of the zebrafish biology. +Simply combining expression values across zebrafish transcripts that correspond to the same human gene using an average or other summary statistic may result in the loss of a lot of data and will likely not be representative of the zebrafish biology. One thing we might do to make the problem somewhat simpler is to reduce the number of multi-mapped genes by requiring a certain level of support for each mapping from across the various databases included in `HCOP`. This will not fully solve the problem (and may not even be desirable in some cases), but we present it here as an example of an approach one might take. -Therefore, we will use the `support` column to decide which mappings to retain. +To do this, we will use the `support` column to decide which mappings to retain. Let's take a look at `support`. ```{r} @@ -389,31 +424,32 @@ head(human_zebrafish_mapped_df$support) ``` Looks like we have a variety of databases for multiple mappings, but we do have some instances of only one database reported in support of the mapping. -As we noted earlier, an orthology prediction where more than one of the databases concur would be considered reliable. -Therefore, where we have multi-mapped zebrafish Ensembl gene IDs, we will take the mappings with more than one database to support the assertion. +As we noted earlier, an orthology prediction where more than one of the databases concur would be considered more reliable. +Therefore, where we have multi-mapped zebrafish Ensembl gene IDs, we will retain the mappings with more than one database to support the assertion. Before we do, let's take a look how many multi-mapped genes there are in the data frame. ```{r} human_zebrafish_mapped_df %>% - # Group by human gene symbols - dplyr::group_by(human_symbol) %>% + # Remove the `support` column + dplyr::select(Gene, human_symbol) %>% + # Remove any remaining duplicates + dplyr::distinct() %>% # Count the number of rows in the dataframe for each symbol - dplyr::count() %>% + dplyr::count(human_symbol) %>% # Filter out the symbols without multimapped genes dplyr::filter(n > 1) ``` -Looks like we have 4,192 human gene symbols with multiple mappings. +Looks like we have 4,169 human gene symbols with multiple mappings. Now let's filter out the less reliable mappings. ```{r} filtered_zebrafish_ensembl_df <- human_zebrafish_mapped_df %>% - # Count the number of databases in the support column for each prediction + # Count the number of databases in the support column + # by using the number of commas that separate the databases dplyr::mutate(n_databases = stringr::str_count(support, ",") + 1) %>% - # Group by human gene symbols - dplyr::group_by(human_symbol) %>% - # Now filter for the rows with more than one database in support for each human gene symbol + # Now filter to the rows where more than one database supports the mapping dplyr::filter(n_databases > 1) head(filtered_zebrafish_ensembl_df) @@ -423,14 +459,16 @@ Let's count how many multi-mapped genes we have now. ```{r} filtered_zebrafish_ensembl_df %>% - # Group by human gene symbols - dplyr::group_by(human_symbol) %>% + # Remove the support column + dplyr::select(Gene, human_symbol) %>% + # Remove any remaining duplicates + dplyr::distinct() %>% # Count the number of rows in the dataframe for each symbol - dplyr::count() %>% - # Filter out the symbols without multimapped genes + dplyr::count(human_symbol) %>% + # Filter to the symbols with multimapped genes dplyr::filter(n > 1) ``` -Now we only have 1,803 multi-mapped genes, compared to the 4,192 that we began with. +Now we only have 1,695 multi-mapped genes, compared to the 4,169 that we began with. Although we haven't filtered down to zero multi-mapped genes, we have hopefully removed some of the less _reliable_ mappings. ### Write results to file diff --git a/02-microarray/ortholog-mapping_microarray_01_ensembl.html b/02-microarray/ortholog-mapping_microarray_01_ensembl.html index bc145f48..051773a1 100644 --- a/02-microarray/ortholog-mapping_microarray_01_ensembl.html +++ b/02-microarray/ortholog-mapping_microarray_01_ensembl.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2948,7 +3774,7 @@

Ortholog Mapping - Microarray

CCDL for ALSF

-

October 2020

+

December 2020

@@ -2969,26 +3795,26 @@

2.1 Obtain the .Rmd

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

-
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
-  dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
-  dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir)
-}
+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

@@ -3003,7 +3829,7 @@

2.3 Obtain the dataset from refin

2.4 About the dataset we are using for this example

-

For this example analysis, we will use this CREB overexpression zebrafish dataset. Tregnago et al. (2016) measured microarray gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.

+

For this example analysis, we will use this CREB overexpression zebrafish dataset. Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.

2.5 Place the dataset in your new data/ folder

@@ -3037,20 +3863,25 @@

2.6 Check out our file structure!

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

-
# Define the file path to the data directory
-data_dir <- file.path("data", "GSE13490") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE13490.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE13490.tsv") # Replace with file path to your metadata
+
# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE71270")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE71270.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE71270.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

-
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+
# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-
# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

@@ -3069,31 +3900,43 @@

4 Ortholog Mapping - Microarray4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the software you will need, as well as more tips and resources.

Attach a package we need for this analysis.

-
# We will need this so we can use the pipe: %>%
-library(magrittr)
+
# We will need this so we can use the pipe: %>%
+library(magrittr)

4.2 Import data from HGNC

-

The HUGO Gene Nomenclature Committee (HGNC) assigns a unique and ideally meaningful name and symbol to every human gene. The HGNC database currently contains over 39,000 public records containing approved human gene nomenclature and associated gene information (Gray et al. 2015).

-

The HGNC Comparison of Orthology Predictions (HCOP) is a search tool that combines orthology predictions for a specified human gene, or set of human genes from a variety of sources, including Ensembl Compara, HGNC, and NCBI Gene Orthology (Wright et al. 2005). In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes, including zebrafish, which we will use in this notebook (HGNC team 2020).

-

First, we need to download the file from the server holding the HGNC data. Go to this directory page of the HGNC Comparison of Orthology Predictions (HCOP) files.

-

This is where the files that reflect the data provided via the HGNC database are maintained. Ortholog species files with the ‘6 Column’ output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the ‘15 Column’ output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.

-

Note: If you are using Safari (or the above FTP server link does not open in a web browser), you may need to go to the link for the HCOP search tool and scroll down to “Bulk Downloads” to choose a file to download. Here, you can find the same files you would find at the server linked above.

-

To download a file, click the file name. For this notebook, you will want to download the file named human_zebrafish_hcop_fifteen_column.txt.gz. If you are using a different dataset, you can replace zebrafish in human_zebrafish_hcop_fifteen_column.txt.gz with the name of the species you have data for, and click on that file to download.

-

-

Next, move the human_zebrafish_hcop_fifteen_column.txt.gz file into your data/ folder.

-

Note: If you are using Safari, this file will automatically be decompressed, so the name of the file would instead be human_zebrafish_hcop_fifteen_column.txt (don’t forget to change the file name in the chunk below if this is the case).

-

Now let’s double check that the file is in the right place.

-
# Define the file path to organism orthology file downloaded from the HGNC database
-zebrafish_hgnc_file <- file.path("data", "human_zebrafish_hcop_fifteen_column.txt.gz")
-
-# Check if the organism orthology file file is in the `data` directory
-file.exists(zebrafish_hgnc_file)
+

The HUGO Gene Nomenclature Committee (HGNC) assigns a unique and ideally meaningful name and symbol to every human gene. The HGNC database currently contains over 39,000 public records containing approved human gene nomenclature and associated gene information (Gray et al. 2015).

+

The HGNC Comparison of Orthology Predictions (HCOP) is a search tool that combines orthology predictions for a specified human gene, or set of human genes from a variety of sources, including Ensembl Compara, HGNC, and NCBI Gene Orthology (Wright et al. 2005). In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes, including zebrafish, which we will use in this notebook (HGNC Team 2020).

+

We can download the human to zebrafish translation file we need for this example using the download.file() command. For this notebook, we want to download the file named human_zebrafish_hcop_fifteen_column.txt.gz.

+

First we’ll declare a sensible file path for this.

+
# Declare what we want the downloaded file to be called and its location
+zebrafish_hgnc_file <- file.path(
+  data_dir,
+  # The name the file will have locally
+  "human_zebrafish_hcop_fifteen_column.txt.gz"
+)
+

Using the file path we just declared, we can use the destfile argument to download the file we need to this directory and use this file name.

+

We are downloading this orthology predictions file from the HGNC database. If you are looking for a different species, see the directory page of the HGNC Comparison of Orthology Predictions (HCOP) files and find the file name of the species you are looking for.

+
download.file(
+  paste0(
+    "http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/",
+    # Replace with the file name for the species conversion you want
+    "human_zebrafish_hcop_fifteen_column.txt.gz"
+  ),
+  # The file will be saved to the name and location we defined earlier
+  destfile = zebrafish_hgnc_file
+)
+

If you are using a different dataset, in the last chunk you can replace zebrafish in human_zebrafish_hcop_fifteen_column.txt.gz with the name of the species you have data for (if you see it listed in the directory). Don’t forget to change the destination file as well to reflect what you download!

+

Ortholog species files with the ‘6 Column’ output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the ‘15 Column’ output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.

+

Now let’s double check that the zebrafish ortholog file is in the right place.

+
# Check if the organism orthology file file is in the `data` directory
+file.exists(zebrafish_hgnc_file)
## [1] TRUE
-

In the next chunk, we will read in the orthology file that was just downloaded.

-
# Read in the data from HGNC
-zebrafish <- readr::read_tsv(zebrafish_hgnc_file)
-
## Parsed with column specification:
+

Now we can read in the orthology file that we downloaded.

+
# Read in the data from HGNC
+zebrafish <- readr::read_tsv(zebrafish_hgnc_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   human_entrez_gene = col_character(),
 ##   human_ensembl_gene = col_character(),
@@ -3112,230 +3955,247 @@ 

4.2 Import data from HGNC

## support = col_character() ## )

Let’s take a look at what zebrafish looks like.

-
zebrafish
+
zebrafish

We are going to manipulate some of the column names to make things easier when calling them downstream.

-
zebrafish <- zebrafish %>%
-  set_names(names(.) %>%
-    # Removing extra word in some of the column names
-    gsub("_gene$", "", .))
+
zebrafish <- zebrafish %>%
+  set_names(names(.) %>%
+    # Removing extra word in some of the column names
+    gsub("_gene$", "", .))

4.3 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the data TSV file and add it as an object to your environment.

We stored our file path for the dataset in an object named data_file in this previous step.

-
# Read in data TSV file
-zebrafish_genes <- readr::read_tsv(data_file) %>%
-  # We only want the gene IDs so let's pull the `Gene` column
-  dplyr::pull("Gene")
-
## Parsed with column specification:
+
# Read in data TSV file
+zebrafish_genes <- readr::read_tsv(data_file) %>%
+  # We only want the gene IDs so let's keep only the `Gene` column
+  dplyr::select("Gene")
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   Gene = col_character(),
-##   GSM340064 = col_double(),
-##   GSM340065 = col_double(),
-##   GSM340066 = col_double(),
-##   GSM340067 = col_double(),
-##   GSM340068 = col_double(),
-##   GSM340069 = col_double(),
-##   GSM340070 = col_double(),
-##   GSM340071 = col_double(),
-##   GSM340072 = col_double(),
-##   GSM340073 = col_double(),
-##   GSM340074 = col_double(),
-##   GSM340075 = col_double(),
-##   GSM340076 = col_double(),
-##   GSM340077 = col_double(),
-##   GSM340078 = col_double()
+##   GSM1831675 = col_double(),
+##   GSM1831676 = col_double(),
+##   GSM1831677 = col_double(),
+##   GSM1831678 = col_double(),
+##   GSM1831679 = col_double(),
+##   GSM1831680 = col_double(),
+##   GSM1831681 = col_double(),
+##   GSM1831682 = col_double(),
+##   GSM1831683 = col_double(),
+##   GSM1831684 = col_double()
 ## )

4.4 Mapping human gene symbols to zebrafish Ensembl gene IDs

refine.bio data uses Ensembl gene identifiers, which will be in the first column.

-
# Let's take a look at the first 6 items of `zebrafish_genes`
-head(zebrafish_genes)
-
## [1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028"
-## [4] "ENSMUSG00000000031" "ENSMUSG00000000037" "ENSMUSG00000000049"
+
# Let's take a look at the first 6 rows of `zebrafish_genes`
+head(zebrafish_genes)
+
+ +

Ensembl gene identifiers have different species-specific prefixes. In zebrafish, Ensembl gene identifiers begin with ENSDARG (in human, ENSG, etc.).

Now let’s do the mapping!

We’re interested in the human_symbol, zebrafish_ensembl, and support columns specifically. The support column contains a list of associated databases that support each assertion. This column may assist with addressing some of the multi-mappings that we will talk about later.

-
human_zebrafish_key <- zebrafish %>%
-  # We'll want to subset zebrafish to only the columns we're interested in
-  dplyr::select(zebrafish_ensembl, human_symbol, support)
-
-# Since we ignored the additional columns in `zebrafish`, let's check to see if
-# we have any duplicates in our `human_zebrafish_key`
-any(duplicated(human_zebrafish_key))
+
human_zebrafish_key <- zebrafish %>%
+  # Reduce the zebrafish table to only the columns we're interested in
+  dplyr::select(zebrafish_ensembl, human_symbol, support)
+
+# Since we ignored the additional columns in `zebrafish`, let's check to see if
+# we have any duplicates in our `human_zebrafish_key`
+any(duplicated(human_zebrafish_key))
## [1] TRUE
-

We do have duplicates! We don’t want to handle duplicate data, so let’s remove those duplicates before moving forward.

-
human_zebrafish_key <- human_zebrafish_key %>%
-  # We need to use the `distinct()` function to remove duplicates resulted from
-  # ignoring the additional columns in the `zebrafish` object
-  dplyr::distinct()
-

Now let’s join the mapped data from human_zebrafish_key with the gene data in zebrafish_genes.

-
# First, we need to convert our vector of zebrafish genes into a data frame
-human_zebrafish_mapped_df <- data.frame("Gene" = zebrafish_genes) %>%
-  # Now we can join the mapped data
-  dplyr::left_join(human_zebrafish_key, by = c("Gene" = "zebrafish_ensembl"))
+

We do have duplicates! Let’s remove those duplicates before moving forward, as they provide no extra information at this point.

+
human_zebrafish_key <- human_zebrafish_key %>%
+  # Use the `distinct()` function to remove duplicates resulting from
+  # dropping the additional columns in the `zebrafish` data frame
+  dplyr::distinct()
+

Now let’s join the mapped data from human_zebrafish_key with the gene data in zebrafish_genes. We are using a “left join” here so that we get at least one row per zebrafish gene, even if there is no matching human symbol in the mapping table.

+
human_zebrafish_mapped_df <- zebrafish_genes %>%
+  # Now we can join the mapped data
+  dplyr::left_join(human_zebrafish_key, by = c("Gene" = "zebrafish_ensembl"))

Here’s what the new data frame looks like:

-
head(human_zebrafish_mapped_df, n = 25)
+
head(human_zebrafish_mapped_df, n = 10)
+

Looks like we have mapped symbols!

+

So now we have all the zebrafish genes mapped to human, but there might be places where there are multiple zebrafish genes that are orthologous to the same human gene, or vice versa.

Let’s get a summary of the human gene symbols returned in our mapped data frame, human_zebrafish_mapped_df.

-
# We can use this `count()` function after `group_by()`to get a count of how many
-# `zebrafish_ensembl` IDs there are per `human_symbol`
-human_zebrafish_mapped_df %>%
-  dplyr::group_by(human_symbol) %>%
-  dplyr::count() %>%
-  # Sort by highest `n` which would be the human gene symbol with the most
-  # mapped zebrafish Ensembl IDs
-  dplyr::arrange(desc(n))
+
# We can use the `count()` function to get a tally of how many
+# `zebrafish_ensembl` IDs there are per `human_symbol`
+human_zebrafish_mapped_df %>%
+  # Remove the support column
+  dplyr::select(Gene, human_symbol) %>%
+  # Remove any remaining duplicates
+  dplyr::distinct() %>%
+  # Count the number of rows per human gene
+  dplyr::count(human_symbol) %>%
+  # Sort by highest `n` which will be the human gene symbol with the most
+  # mapped zebrafish Ensembl IDs
+  dplyr::arrange(desc(n))
-

Looks like we have mapped symbols!

-

Now, let’s get an idea of how many zebrafish Ensembl IDs we have that were not mapped to human gene symbols.

-
sum(is.na(human_zebrafish_mapped_df$human_symbol))
-
## [1] 17918
-

We have 463 NAs, which means we have 463 zebrafish Ensembl IDs that were not mapped to human gene symbols. This is okay because we do not expect everything to map across species.

+

There are certainly a good number of places where we mapped multiple zebrafish Ensembl IDs to the same human symbol! We’ll look at this in a bit.

+

We can also see that there 738 zebrafish Ensembl IDs that did not map to a human symbol. These are the ones with a value of NA. This is okay because we do not expect everything to map neatly across species.

4.5 Take a look at some multi-mappings

-

If a zebrafish Ensembl gene ID maps to multiple human symbols, the associated values will get duplicated. Let’s look at the ENSDARG00000069142 example below.

-
human_zebrafish_mapped_df %>%
-  dplyr::filter(Gene == "ENSDARG00000069142")
+

If a zebrafish Ensembl gene ID maps to multiple human symbols, the associated Ensembl ID values will get duplicated in our output data. Let’s look at the ENSDARG00000069142 example below.

+
human_zebrafish_mapped_df %>%
+  dplyr::filter(Gene == "ENSDARG00000069142")
-

On the other hand, if you were to look at the original data associated to the zebrafish Ensembl IDs, when a human gene symbol maps to multiple zebrafish Ensembl IDs, the values will not get duplicated, but you will have multiple rows associated with that human symbol. Let’s look at the MATR3 example below.

-
human_zebrafish_mapped_df %>%
-  dplyr::filter(human_symbol == "MATR3")
+

On the other hand, if you were to look at the original data associated to the zebrafish Ensembl IDs, when a human gene symbol maps to multiple zebrafish Ensembl IDs, the Ensembl IDs will not get duplicated, but you will have multiple rows associated with that human symbol. Let’s look at the MATR3 example below.

+
human_zebrafish_mapped_df %>%
+  dplyr::filter(human_symbol == "MATR3")
+

We can see that we have multiple zebrafish Ensembl IDs that mapped to the same gene. (Notice that we also still have some duplicate zebrafish Ensembl ID/human symbol pairs here because the support column was different in the original data set! This is why we removed that column before counting above.)

4.6 Collapse zebrafish genes mapping to multiple human genes

Remember that if a zebrafish Ensembl gene ID maps to multiple human symbols, the values get duplicated. We can collapse the multi-mapped values into a list for each Ensembl ID as to not have duplicate values in our data frame.

In the next chunk, we show how we can collapse all the human gene symbols into one column where we store them all for each unique zebrafish Ensembl ID.

-
collapsed_human_symbol_df <- human_zebrafish_mapped_df %>%
-  # Group by zebrafish Ensembl IDs
-  dplyr::group_by(Gene) %>%
-  # Collapse the mapped values in `human_zebrafish_mapped_df` into one column named
-  # `all_human_symbols` -- note that we will lose the `support` column in this summarizing step
-  dplyr::summarize(all_human_symbols = paste(human_symbol, collapse = ";"))
+
collapsed_human_symbol_df <- human_zebrafish_mapped_df %>%
+  # Group by zebrafish Ensembl IDs
+  dplyr::group_by(Gene) %>%
+  # Collapse the mapped values in `human_zebrafish_mapped_df` to a
+  # `all_human_symbols` column, removing any duplicated human symbols
+  # note that we will lose the `support` column in this summarizing step
+  dplyr::summarize(
+    # combine unique symbols with semicolons between them
+    all_human_symbols = paste(
+      sort(unique(human_symbol)),
+      collapse = ";"
+    )
+  )
## `summarise()` ungrouping output (override with `.groups` argument)
-
head(collapsed_human_symbol_df)
+
head(collapsed_human_symbol_df)

4.6.1 Write results to file

Now let’s write our list of human gene symbols for each unique zebrafish Ensembl ID results to file.

-
readr::write_tsv(
-  collapsed_human_symbol_df,
-  file.path(
-    results_dir,
-    # Replace with a relevant output file name
-    "GSE71270_zebrafish_ensembl_to_collapsed_human_gene_symbol.tsv"
-  )
-)
+
readr::write_tsv(
+  collapsed_human_symbol_df,
+  file.path(
+    results_dir,
+    # Replace with a relevant output file name
+    "GSE71270_zebrafish_ensembl_to_collapsed_human_gene_symbol.tsv"
+  )
+)

4.7 Collapse human genes mapping to multiple zebrafish genes

-

Since multiple zebrafish Ensembl gene IDs map to the same human symbol, we may want to identify which one of these mappings represents the “true” ortholog, i.e. which zebrafish gene is most similar to the human gene we are interested in. This is not at all straightforward! (see this paper for just one example) (Stamboulian et al. 2020). Gene duplications along the zebrafish lineage may result in complicated relationships among genes, especially with regard to divisions of function.

-

Simply combining values across zebrafish transcripts using an average may result in the loss of a lot of data and will likely not be representative of the zebrafish biology. One thing we might do to make the problem somewhat simpler is to reduce the number of multi-mapped genes by requiring a certain level of support for each mapping from across the various databases included in HCOP. This will not fully solve the problem (and may not even be desirable in some cases), but we present it here as an example of an approach one might take.

-

Therefore, we will use the support column to decide which mappings to retain. Let’s take a look at support.

-
head(human_zebrafish_mapped_df$support)
-
## [1] NA NA NA NA NA NA
-

Looks like we have a variety of databases for multiple mappings, but we do have some instances of only one database reported in support of the mapping. As we noted earlier, an orthology prediction where more than one of the databases concur would be considered reliable. Therefore, where we have multi-mapped zebrafish Ensembl gene IDs, we will take the mappings with more than one database to support the assertion.

+

Since multiple zebrafish Ensembl gene IDs map to the same human symbol, we may want to identify which one of these mappings represents the “true” ortholog, i.e. which zebrafish gene is most similar to the human gene we are interested in. This is not at all straightforward! (see this paper for just one example) (Stamboulian et al. 2020). Gene duplications along the zebrafish lineage may result in complicated relationships among genes, especially with regard to divisions of function.

+

Simply combining expression values across zebrafish transcripts that correspond to the same human gene using an average or other summary statistic may result in the loss of a lot of data and will likely not be representative of the zebrafish biology. One thing we might do to make the problem somewhat simpler is to reduce the number of multi-mapped genes by requiring a certain level of support for each mapping from across the various databases included in HCOP. This will not fully solve the problem (and may not even be desirable in some cases), but we present it here as an example of an approach one might take.

+

To do this, we will use the support column to decide which mappings to retain. Let’s take a look at support.

+
head(human_zebrafish_mapped_df$support)
+
## [1] "Inparanoid,PhylomeDB,HomoloGene,Ensembl,OMA,NCBI,ZFIN,OrthoMCL,Panther,OrthoDB"     
+## [2] "Inparanoid,EggNOG,HomoloGene,Treefam,Ensembl,OMA,NCBI,ZFIN,OrthoMCL,Panther,OrthoDB"
+## [3] "HomoloGene,Treefam,Ensembl,ZFIN,Panther,OrthoDB"                                    
+## [4] "OrthoDB"                                                                            
+## [5] "Inparanoid,HomoloGene,EggNOG,Treefam,NCBI,ZFIN,Panther"                             
+## [6] "OrthoMCL"
+

Looks like we have a variety of databases for multiple mappings, but we do have some instances of only one database reported in support of the mapping. As we noted earlier, an orthology prediction where more than one of the databases concur would be considered more reliable. Therefore, where we have multi-mapped zebrafish Ensembl gene IDs, we will retain the mappings with more than one database to support the assertion.

Before we do, let’s take a look how many multi-mapped genes there are in the data frame.

-
human_zebrafish_mapped_df %>%
-  # Group by human gene symbols
-  dplyr::group_by(human_symbol) %>%
-  # Count the number of rows in the dataframe for each symbol
-  dplyr::count() %>%
-  # Filter out the symbols without multimapped genes
-  dplyr::filter(n > 1)
+
human_zebrafish_mapped_df %>%
+  # Remove the `support` column
+  dplyr::select(Gene, human_symbol) %>%
+  # Remove any remaining duplicates
+  dplyr::distinct() %>%
+  # Count the number of rows in the dataframe for each symbol
+  dplyr::count(human_symbol) %>%
+  # Filter out the symbols without multimapped genes
+  dplyr::filter(n > 1)
-

Looks like we have 4,192 human gene symbols with multiple mappings.

+

Looks like we have 4,169 human gene symbols with multiple mappings.

Now let’s filter out the less reliable mappings.

-
filtered_zebrafish_ensembl_df <- human_zebrafish_mapped_df %>%
-  # Count the number of databases in the support column for each prediction
-  dplyr::mutate(n_databases = stringr::str_count(support, ",") + 1) %>%
-  # Group by human gene symbols
-  dplyr::group_by(human_symbol) %>%
-  # Now filter for the rows with more than one database in support for each human gene symbol
-  dplyr::filter(n_databases > 1)
-
-head(filtered_zebrafish_ensembl_df)
+
filtered_zebrafish_ensembl_df <- human_zebrafish_mapped_df %>%
+  # Count the number of databases in the support column
+  # by using the number of commas that separate the databases
+  dplyr::mutate(n_databases = stringr::str_count(support, ",") + 1) %>%
+  # Now filter to the rows where more than one database supports the mapping
+  dplyr::filter(n_databases > 1)
+
+head(filtered_zebrafish_ensembl_df)

Let’s count how many multi-mapped genes we have now.

-
filtered_zebrafish_ensembl_df %>%
-  # Group by human gene symbols
-  dplyr::group_by(human_symbol) %>%
-  # Count the number of rows in the dataframe for each symbol
-  dplyr::count() %>%
-  # Filter out the symbols without multimapped genes
-  dplyr::filter(n > 1)
+
filtered_zebrafish_ensembl_df %>%
+  # Remove the support column
+  dplyr::select(Gene, human_symbol) %>%
+  # Remove any remaining duplicates
+  dplyr::distinct() %>%
+  # Count the number of rows in the dataframe for each symbol
+  dplyr::count(human_symbol) %>%
+  # Filter to the symbols with multimapped genes
+  dplyr::filter(n > 1)
-

Now we only have 1,803 multi-mapped genes, compared to the 4,192 that we began with. Although we haven’t filtered down to zero multi-mapped genes, we have hopefully removed some of the less reliable mappings.

+

Now we only have 1,695 multi-mapped genes, compared to the 4,169 that we began with. Although we haven’t filtered down to zero multi-mapped genes, we have hopefully removed some of the less reliable mappings.

4.7.1 Write results to file

Now let’s write our filtered_zebrafish_ensembl_df object, with the reliable zebrafish Ensembl IDs for each unique human gene symbol, to file.

-
readr::write_tsv(
-  filtered_zebrafish_ensembl_df,
-  file.path(
-    results_dir,
-    # Replace with a relevant output file name
-    "GSE71270_filtered_zebrafish_ensembl_to_human_gene_symbol.tsv"
-  )
-)
+
readr::write_tsv(
+  filtered_zebrafish_ensembl_df,
+  file.path(
+    results_dir,
+    # Replace with a relevant output file name
+    "GSE71270_filtered_zebrafish_ensembl_to_human_gene_symbol.tsv"
+  )
+)

5 Resources for further learning

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

-
# Print session info
-sessioninfo::session_info()
-
## ─ Session info ───────────────────────────────────────────────────────────────
+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ─────────────────────────────────────────────────────
 ##  setting  value                       
 ##  version  R version 4.0.2 (2020-06-22)
 ##  os       Ubuntu 20.04 LTS            
@@ -3345,13 +4205,13 @@ 

6 Session info

## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2020-10-16 +## date 2020-12-21 ## -## ─ Packages ─────────────────────────────────────────────────────────────────── +## ─ Packages ───────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0) ## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2) -## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) +## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) ## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2) @@ -3370,23 +4230,23 @@

6 Session info

## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0) ## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0) +## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2) ## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0) ## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0) ## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2) ## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2) ## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2) ## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) -## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2) -## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2) +## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) ## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0) -## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2) +## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2) ## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2) ## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0) ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0) ## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0) ## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0) -## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2) +## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2) ## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0) ## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2) ## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) @@ -3400,23 +4260,28 @@

6 Session info

References

-

Gray K. A., B. Yates, R. L. Seal, M. W. Wright, and E. A. Bruford, 2015 Genenames.org: The hgnc resources in 2015. Nucleic Acids Res 43. https://doi.org/10.1038/nature11327

+

Gray K. A., B. Yates, R. L. Seal, M. W. Wright, and E. A. Bruford, 2015 Genenames.org: The HGNC resources in 2015. Nucleic Acids Research 43. https://doi.org/10.1038/nature11327

-

HGNC team, 2020 HCOP help

+

HGNC Team, 2020 HCOP help. https://www.genenames.org/help/hcop/

Stamboulian M., R. F. Guerrero, M. W. Hahn, and P. Radivojac, 2020 The ortholog conjecture revisited: The value of orthologs and paralogs in function prediction. Bioinformatics 36: i219–i226. https://doi.org/10.1093/bioinformatics/btaa468

-

Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896.

+

Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98

-

Wright M. W., T. A. Eyre, M. J. Lush, S. Povey, and E. A. Bruford, 2005 HCOP: The hgnc comparison of orthology predictions search tool. Mammalian Genome 16: 827–8. https://doi.org/10.1007/s00335-005-0103-2

+

Wright M. W., T. A. Eyre, M. J. Lush, S. Povey, and E. A. Bruford, 2005 HCOP: The HGNC comparison of orthology predictions search tool. Mammalian Genome 16: 827–8. https://doi.org/10.1007/s00335-005-0103-2

+

diff --git a/02-microarray/pathway-analysis_microarray_00_intro.Rmd b/02-microarray/pathway-analysis_microarray_00_intro.Rmd deleted file mode 100644 index 696e6724..00000000 --- a/02-microarray/pathway-analysis_microarray_00_intro.Rmd +++ /dev/null @@ -1,31 +0,0 @@ ---- -title: "Pathway Analysis Introduction" -output: - html_notebook: - toc: true - toc_float: true - number_sections: true ---- - -## Background - -Over-representation analysis (ORA) is a method of pathway or gene set analysis -where one can ask if a set of genes (e.g., those differentially expressed -using some cutoff) shares more or less genes with gene sets/pathways than -we would expect at random. -The other methodologies introduced throughout this module such as QuSAGE and -GSEA can require more samples than a different expression analysis. -For instance, the sample label permutation step of GSEA is reported to -perform poorly with 7 samples or less in each group -([](https://doi.org/10.1093/nar/gkt660)). -It is not uncommon to have n ~ 3 for each group in a treatment-control -transcriptomic study, at which point identifying differentially expressed genes -is possible. -If you are interested in performing pathway analysis on a small study, ORA may -be your best bet. -There are some limitations to ORA methods to be aware such as ignoring -gene-gene correlation. -See [](https://doi.org/10.1371/journal.pcbi.1002375) -to learn more about the different types of pathway analysis and their -limitations. - diff --git a/02-microarray/pathway-analysis_microarray_02_ora.Rmd b/02-microarray/pathway-analysis_microarray_01_ora.Rmd similarity index 62% rename from 02-microarray/pathway-analysis_microarray_02_ora.Rmd rename to 02-microarray/pathway-analysis_microarray_01_ora.Rmd index 74e8b73a..f2d845e6 100644 --- a/02-microarray/pathway-analysis_microarray_02_ora.Rmd +++ b/02-microarray/pathway-analysis_microarray_01_ora.Rmd @@ -1,7 +1,7 @@ --- title: "Over-representation analysis - Microarray" author: "CCDL for ALSF" -date: "`r format(Sys.time(), '%B %Y')`" +date: "November 2020" output: html_notebook: toc: true @@ -11,34 +11,58 @@ output: # Purpose of this analysis -This example is one of pathway analysis module set, we recommend looking at the [pathway analysis introduction](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_00_intro.html) to help you determine which pathway analysis method is best suited for your purposes. +This example is one of pathway analysis module set, we recommend looking at the [pathway analysis table below](#how-to-choose-a-pathway-analysis) to help you determine which pathway analysis method is best suited for your purposes. -This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes (e.g., those differentially expressed using some cutoff) shares more or fewer genes with gene sets/pathways than we would expect at random. -This pathway analysis method does not require any particular sample size, since the only input from your dataset is a set of genes of interest [@Yaari2013]. +This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes (e.g., those differentially expressed using some cutoff) shares more or fewer genes with gene sets/pathways than we would expect by chance. + +ORA is a broadly applicable technique that may be good in scenarios where your dataset or scientific questions don't fit the requirements of other pathway analyses methods. +It also does not require any particular sample size, since the only input from your dataset is a set of genes of interest [@Yaari2013]. + +If you have differential expression results or something with a gene-level ranking and a two-group comparison, we recommend considering [GSEA](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_02_gsea.html) for your pathway analysis questions. ⬇️ [**Jump to the analysis code**](#analysis) ⬇️ +### What is pathway analysis? + +Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. +In the context of [refine.bio](https://www.refine.bio/), we use these techniques to analyze and interpret genome-wide gene expression experiments. +The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. +In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis. + +We highly recommend taking a look at [Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002375) from @Khatri2012 for a more comprehensive overview. +We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the [`Resources for further learning` section](#resources-for-further-learning). + +### How to choose a pathway analysis? + +This table summarizes the pathway analyses examples in this module. + +|Analysis|What is required for input|What output looks like |✅ Pros| ⚠️ Cons| +|--------|--------------------------|-----------------------|-------|-------| +|[**ORA (Over-representation Analysis)**](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_01_ora.html)|A list of gene IDs (no stats needed)|A per-pathway hypergeometric test result|- Simple

- Inexpensive computationally to calculate p-values| - Requires arbitrary thresholds and ignores any statistics associated with a gene

- Assumes independence of genes and pathways| +|[**GSEA (Gene Set Enrichment Analysis)**](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_02_gsea.html)|A list of genes IDs with gene-level summary statistics|A per-pathway enrichment score|- Includes all genes (no arbitrary threshold!)

- Attempts to measure coordination of genes|- Permutations can be expensive

- Does not account for pathway overlap

- Two-group comparisons not always appropriate/feasible| +|[**GSVA (Gene Set Variation Analysis)**](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_03_gsva.html)|A gene expression matrix (like what you get from refine.bio directly)|Pathway-level scores on a per-sample basis|- Does not require two groups to compare upfront

- Normally distributed scores|- Scores are not a good fit for gene sets that contain genes that go up AND down

- Method doesn’t assign statistical significance itself

- Recommended sample size n > 10| + # How to run this example For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). -We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. ## Obtain the `.Rmd` file -To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_02_ora_with_webgestaltr.Rmd). +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_01_ora.Rmd). -Clicking this link will most likely send this to your downloads folder on your computer. +Clicking this link will most likely send this to your downloads folder on your computer. Move this `.Rmd` file to where you would like this example and its files to be stored. You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) -## Set up your analysis folders +## Set up your analysis folders Good file organization is helpful for keeping your data analysis project on track! -We have set up some code that will automatically set up a folder structure for you. -Run this next chunk to set up your folders! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! -If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. ```{r} # Create the data folder if it doesn't exist @@ -47,7 +71,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -55,7 +79,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -75,26 +99,29 @@ We have provided this file for you and the code in this notebook will read in th ## About the dataset we are using for this example For this example analysis, we will use this [CREB overexpression zebrafish experiment](https://www.refine.bio/experiments/GSE71270/creb-overexpression-induces-leukemia-in-zebrafish-by-blocking-myeloid-differentiation-process) [@Tregnago2016]. -@Tregnago2016 measured microarray gene expression of zebrafish samples overexpressing human CREB, as well as control samples. +@Tregnago2016 used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples. ## Check out our file structure! -Your new analysis folder should contain: +Your new analysis folder should contain: - The example analysis `.Rmd` you downloaded -- A folder called `data` (currently empty) +- A folder called `data` (currently empty) - A folder for `plots` (currently empty) - A folder for `results` (currently empty) - + Your example analysis folder should contain your `.Rmd` and three empty folders (which won't be empty for long!). -If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). # Using a different refine.bio dataset with this analysis? -If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). +The file we use here has several columns of differential expression summary statistics. +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend replacing the `dge_results_file` with a different file path to a read in a similar table of genes with the information that you are interested in. +If your gene table differs, many steps will need to be changed or deleted entirely depending on the contents of that file (particularly in the [`Determine our genes of interest list` section](#determined-our-genes-of-interest-list)). + We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. -From here you can customize this analysis example to fit your own scientific questions and preferences. +From here you can customize this analysis example to fit your own scientific questions and preferences. *** @@ -106,7 +133,7 @@ From here you can customize this analysis example to fit your own scientific que See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. -In this analysis, we will be using [`clusterProfiler`](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package to perform ORA and the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package which contains gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) already in the tidy format required by `clusterProfiler` [@Igor2020; @Subramanian2005]. +In this analysis, we will be using [`clusterProfiler`](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package to perform ORA and the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package which contains gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) already in the tidy format required by `clusterProfiler` [@Yu2012; @Dolgalev2020; @Subramanian2005; @Liberzon2011]. We will also need the [`org.Dr.eg.db`](https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html) package to perform gene identifier conversion and [`ggupset`](https://cran.r-project.org/web/packages/ggupset/readme/README.html) to make an UpSet plot [@Carlson2019-zebrafish; @Ahlmann-Eltze2020]. @@ -135,7 +162,7 @@ if (!("org.Dr.eg.db" %in% installed.packages())) { Attach the packages we need for this analysis. -```{r} +```{r message=FALSE} # Attach the library library(clusterProfiler) @@ -149,74 +176,100 @@ library(org.Dr.eg.db) library(magrittr) ``` -## Import data +## Download data file + +For ORA, we only need a list of gene IDs as our input, so this example can work for any situations where you have gene list and want to know more about what biological pathways it shares genes with. -We will read in the differential expression results we will download from online. +For this example, we will read in results from a differential expression analysis that we have already performed. +Rather than reading from a local file, we will download the results table directly from a URL. These results are from a zebrafish microarray experiment we used for [differential expression analysis for two groups](https://alexslemonade.github.io/refinebio-examples/02-microarray/differential-expression_microarray_02_2-groups.html) using [`limma`](https://bioconductor.org/packages/release/bioc/html/limma.html) [@Ritchie2015]. The table contains Ensembl gene IDs, log fold-changes for each group, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to ORA. Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. -First we will assign the URL to its own variable called, `dge_url`. +First we will assign the URL to its own variable called, `dge_url`. ```{r} # Define the url to your differential expression results file dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv" ``` -Read in the file that has differential expression results. -Here we are using the URL we set up above, but this can be a local file path instead _i.e._ you can replace `dge_url` in the code below with a path to file you have on your computer like: `file.path("results", "GSE71270_limma_results.tsv")`. +We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R. ```{r} -# Read in the contents of your differential expression results file -# `dge_url` can be replaced with a file path to a TSV file with your -# desired gene list results -dge_df <- readr::read_tsv(dge_url) +dge_results_file <- file.path( + results_dir, + "GSE71270_limma_results.tsv" +) ``` -`read_tsv()` can read TSV files online and doesn't necessarily require you download the file first. -Let's take a look at what these contrast results from the differential expression analysis look like. +Using the URL (`dge_url`) and file path (`dge_results_file`) we can download the file and use the `destfile` argument to specify where it should be saved. ```{r} -dge_df +download.file( + dge_url, + # The file will be saved to this location and with this name + destfile = dge_results_file +) +``` + +Now let's double check that the results file is in the right place. + +```{r} +# Check if the file exists +file.exists(dge_results_file) ``` -## Getting familiar with `clusterProfiler`'s options +## Import data -Let's take a look at what organisms the package supports. +Read in the file that has differential expression results. ```{r} -msigdbr_species() +# Read in the contents of the differential expression results file +dge_df <- readr::read_tsv(dge_results_file) ``` -The data we're interested in here comes from zebrafish samples, so we can obtain just the gene sets relevant to _D. rerio_ with the `species` argument to `msigdbr()`. +Note that `read_tsv()` can also read TSV files directly from a URL and doesn't necessarily require you download the file first. +If you wanted to use that feature, you could replace the call above with `readr::read_tsv(dge_url)` and skip the download steps. + +Let's take a look at what these results from the differential expression analysis look like. ```{r} -dr_msigdb_df <- msigdbr(species = "Danio rerio") +dge_df ``` -MSigDB contains [8 different gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) [@Subramanian2005]. +## Getting familiar with MSigDB gene sets available via `msigdbr` - H: hallmark gene sets - C1: positional gene sets - C2: curated gene sets - C3: motif gene sets - C4: computational gene sets - C5: GO gene sets - C6: oncogenic signatures - C7: immunologic signatures +The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses [@Subramanian2005; @Liberzon2011]. +We can use the `msigdbr` package to access these gene sets in a format compatible with the package we'll use for analysis, `clusterProfiler` [@Dolgalev2020; @Yu2012]. -In this example, we will use pathways that are gene sets considered to be "canonical representations of a biological process compiled by domain experts" and are a subset of `C2: curated gene sets` [@Subramanian2005]. +The gene sets available directly from MSigDB are applicable to human studies. +`msigdbr` also supports commonly studied model organisms. -Specifically, we will use the [KEGG (Kyoto Encyclopedia of Genes and Genomes)](https://www.genome.jp/kegg/) pathways [@Kanehisa2000]. +Let's take a look at what organisms the package supports with `msigdbr_species()`. -First, let's take a look at what information is included in this data frame. +```{r} +msigdbr_species() +``` + +The data we're interested in here comes from zebrafish samples, so we can obtain only the gene sets relevant to _D. rerio_ with the `species` argument to `msigdbr()`. ```{r} +dr_msigdb_df <- msigdbr(species = "Danio rerio") +``` + +MSigDB contains [8 different gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) [@Subramanian2005; @Liberzon2011] that are distinguished by how they are derived (e.g., computationally mined, curated). +In this example, we will use pathways that are gene sets considered to be "canonical representations of a biological process compiled by domain experts" and are a subset of `C2: curated gene sets` [@Subramanian2005; @Liberzon2011]. + +Specifically, we will use the [KEGG (Kyoto Encyclopedia of Genes and Genomes)](https://www.genome.jp/kegg/) pathways [@Kanehisa2000]. + +First, let's take a look at what information is included in the data frame returned by `msigdbr()`. + +```{r rownames.print = FALSE} head(dr_msigdb_df) ``` -We will need to use `gs_cat` and `gs_subcat` columns to construct a filter step that will only keep curated gene sets and KEGG pathways. +We will need to use `gs_cat` and `gs_subcat` columns to construct a filter step that will only keep curated gene sets and KEGG pathways. ```{r} # Filter the zebrafish data frame to the KEGG pathways that are included in the @@ -228,13 +281,11 @@ dr_kegg_df <- dr_msigdb_df %>% ) ``` -Note: We could have specified that we wanted the KEGG gene sets using the `category` and `subcategory` arguments of `msigdbr()`, but we're going for general steps! -- use `?msigdbr` to see more information. - The `clusterProfiler()` function we will use requires a data frame with two columns, where one column contains the term identifier or name and one column contains gene identifiers that match our gene lists we want to check for enrichment. Our data frame with KEGG terms contains Entrez IDs and gene symbols. -In our differential expression results data frame, `dge_df` we have Ensembl gene identifiers. +In our differential expression results data frame, `dge_df` we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs. ## Gene identifier conversion @@ -255,18 +306,20 @@ keytypes(org.Dr.eg.db) Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to gene symbols (`SYMBOL`), we could just as easily use it to convert from an Ensembl transcript ID (`ENSEMBLTRANS`) to Entrez IDs (`ENTREZID`). -The function we will use to map from Ensembl gene IDs to gene symbols is called `mapIds()`. +The function we will use to map from Ensembl gene IDs to gene symbols is called `mapIds()` and comes from the `AnnotationDbi` package. ```{r} # This returns a named vector which we can convert to a data frame, where # the keys (Ensembl IDs) are the names -symbols_vector <- mapIds(org.Dr.eg.db, # Specify the annotation package +symbols_vector <- mapIds( + # Replace with annotation package for the organism relevant to your data + org.Dr.eg.db, # The vector of gene identifiers we want to map keys = dge_df$Gene, - # The type of gene identifier we want returned - column = "SYMBOL", - # What type of gene identifiers we're starting with + # Replace with the type of gene identifiers in your data keytype = "ENSEMBL", + # Replace with the type of gene identifiers you would like to map to + column = "SYMBOL", # In the case of 1:many mappings, return the # first one. This is default behavior! multiVals = "first" @@ -291,13 +344,13 @@ gene_key_df <- data.frame( dplyr::filter(!is.na(gene_symbol)) ``` -Let's see a preview of `gene_key_df`. +Let's see a preview of `gene_key_df`. -```{r} +```{r rownames.print = FALSE} head(gene_key_df) ``` -Now we are ready to add the `gene_key_df` to our data frame with the differential expression stats, `dge_df`. +Now we are ready to add the `gene_key_df` to our data frame with the differential expression stats, `dge_df`. Here we're using a `dplyr::left_join()` because we only want to retain the genes that have gene symbols and this will filter out anything in our `dge_df` that does not have gene symbols when we join using the Ensembl gene identifiers. ```{r} @@ -311,35 +364,36 @@ dge_annot_df <- gene_key_df %>% ) ``` -Let's take a look at what this data frame looks like. +Let's take a look at what this data frame looks like. -```{r} +```{r rownames.print = FALSE} # Print out a preview head(dge_annot_df) ``` ## Over-representation Analysis (ORA) -Over-representation testing using `clusterProfiler` is based on a hypergeometric test [@clusterProfiler-book]. - -\(p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{ {M \choose i}{ {N-M} \choose {n-i} } } { {N \choose n} }\) +Over-representation testing using `clusterProfiler` is based on a hypergeometric test (often referred to as Fisher's exact test) [@clusterProfiler-book]. +For more background on hypergeometric tests, this [handy tutorial](https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html) explains more about how hypergeometric tests work [@Puthier2015]. -Where `N` is the number of genes in the background distribution, `M` is the number of genes in a pathway, `n` is the number of genes we are interested in (our differentially expressed genes), and `k` is the number of genes that overlap between the pathway and our genes of interest. +We will need to provide to `clusterProfiler` two genes lists: -So, we will need to provide to `clusterProfiler` two genes lists: - -1) Our genes of interest (`n`) -2) What genes were in our total background set (`N`). (All genes that originally had an opportunity to be measured). +1) Our genes of interest +2) What genes were in our total background set. (All genes that originally had an opportunity to be measured). ## Determine our genes of interest list -We will use our differential expression results to get a genes of interest list. -Let's use our adjusted p values as a cutoff. +We will use our differential expression results to get a genes of interest list. +Let's use our adjusted p values as a cutoff. + +This step is highly variable depending on what your gene list is, what information it contains and what your goals are. +You may want to delete this next chunk entirely if you supply an already determined list of genes OR you may need to adjust the cutoffs and column names. ```{r} # Select genes that are below a cutoff genes_of_interest <- dge_annot_df %>% - # Here we want the top differentially expressed genes and we will use downregulated genes + # Here we want the top differentially expressed genes and we will use + # downregulated genes dplyr::filter(adj.P.Val < 0.05, logFC < -1) %>% # We are extracting the gene symbols as a vector dplyr::pull(gene_symbol) @@ -349,9 +403,9 @@ There are a lot of ways we could make a genes of interest list, and using a p-va ORA generally requires you make some sort of arbitrary decision to obtain your genes of interest list and this is one of the approach's weaknesses -- to get to a gene list we've removed all other context. -Because one `gene_symbol` may map to multiple Ensembl IDs, we need to make sure we have no repeated gene symbols in this list. +Because one `gene_symbol` may map to multiple Ensembl IDs, we need to make sure we have no repeated gene symbols in this list. -```{r} +```{r rownames.print = FALSE} # Reduce to only unique gene symbols genes_of_interest <- unique(as.character(genes_of_interest)) @@ -361,7 +415,7 @@ head(genes_of_interest) ## Determine our background set gene list -Sometimes folks consider genes from the entire genome to comprise the background, but for our microarray data, we should consider all genes that were measured as our background set. +Sometimes folks consider genes from the entire genome to comprise the background, but for our microarray data, we should consider all genes that were measured as our background set. In other words, if we are unable to detect a gene, it should not be in our background set. We can obtain our detected genes list from our data frame, `dge_annot_df` (which we haven't done filtering on). @@ -378,7 +432,7 @@ Now that we have our background set, our genes of interest, and our pathway info kegg_ora_results <- enricher( gene = genes_of_interest, # A vector of your genes of interest pvalueCutoff = 0.1, # Can choose a FDR cutoff - pAdjustMethod = "BH", # What method for multiple testing correction should we use + pAdjustMethod = "BH", # Method to be used for multiple testing correction universe = background_set, # A vector containing your background set genes # The pathway information should be a data frame with a term name or # identifier and the gene identifiers @@ -392,9 +446,6 @@ kegg_ora_results <- enricher( *Note: using `enrichKEGG()` is a shortcut for doing ORA using KEGG, but the approach we covered here can be used with any gene sets you'd like!* -What is returned by `enricher()`? -You can run `View(kegg_ora_results)` or click on the object in your Environment panel. - The information we're most likely interested in is in the `results` slot. Let's convert this into a data frame that we can write to file. @@ -402,19 +453,19 @@ Let's convert this into a data frame that we can write to file. kegg_result_df <- data.frame(kegg_ora_results@result) ``` -Let's print out a sneak peek of it here and take a look at how many sets do we have that fit our cutoff of `0.1` FDR? +Let's print out a sneak peek of it here and take a look at how many sets do we have that fit our cutoff of `0.1` FDR? -```{r} +```{r rownames.print = FALSE} kegg_result_df %>% dplyr::filter(p.adjust < 0.1) ``` -Looks like there are four KEGG sets returned as significant at FDR of `0.1`. +Looks like there are four KEGG sets returned as significant at FDR of `0.1`. ## Visualizing results We can use a dot plot to visualize our significant enrichment results. -The `enrichplot::dotplot()` function will only plot gene sets that are significant according to the multiple testing corrected p values (in the `p.adjust` column) and the `pvalueCutoff` you provided in the [`enricher()` step](#run-ora-using-the-enricher-function). +The `enrichplot::dotplot()` function will only plot gene sets that are significant according to the multiple testing corrected p values (in the `p.adjust` column) and the `pvalueCutoff` you provided in the [`enricher()` step](#run-ora-using-the-enricher-function). ```{r} enrich_plot <- enrichplot::dotplot(kegg_ora_results) @@ -427,7 +478,7 @@ Use `?enrichplot::dotplot` to see the help page for more about how to use this f This plot is arguably more useful when we have a large number of significant pathways. -Let's save it to a PNG. +Let's save it to a PNG. ```{r} ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_enrich_plot.png"), @@ -438,17 +489,20 @@ ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_enrich_plot.png"), We can use an [UpSet plot](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720993/) to visualize the **overlap** between the gene sets that were returned as significant. ```{r} -enrichplot::upsetplot(kegg_ora_results) +upset_plot <- enrichplot::upsetplot(kegg_ora_results) + +# Print out the plot here +upset_plot ``` -See that `KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION` and `KEGG_LYSOSOME` have all their genes in common. +See that `KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION` and `KEGG_LYSOSOME` have all their genes in common. Gene sets or pathways aren't independent! -Let's also save this to a PNG. +Let's also save this to a PNG. ```{r} ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_upset_plot.png"), - plot = enrich_plot + plot = upset_plot ) ``` @@ -466,18 +520,20 @@ readr::write_tsv( # Resources for further learning +- [Hypergeometric test exercises](https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html)[@Puthier2015]. +- [clusterProfiler ORA tutorial](https://learn.gencore.bio.nyu.edu/rna-seq-analysis/over-representation-analysis/#:~:text=Over%2Drepresentation%20(or%20enrichment),a%20subset%20of%20your%20data.) - [clusterProfiler paper](https://doi.org/10.1089/omi.2011.0118) [@Yu2012]. - [clusterProfiler book](https://yulab-smu.github.io/clusterProfiler-book/index.html) [@clusterProfiler-book]. - [This handy review](https://doi.org/10.1371/journal.pcbi.1002375) which summarizes the different types of pathway analysis and their limitations [@Khatri2012]. # Session info -At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of software and packages you used to run this. +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info -sessionInfo() +sessioninfo::session_info() ``` # References diff --git a/02-microarray/pathway-analysis_microarray_02_ora.html b/02-microarray/pathway-analysis_microarray_01_ora.html similarity index 89% rename from 02-microarray/pathway-analysis_microarray_02_ora.html rename to 02-microarray/pathway-analysis_microarray_01_ora.html index 4fd4efeb..ddd2c5ec 100644 --- a/02-microarray/pathway-analysis_microarray_02_ora.html +++ b/02-microarray/pathway-analysis_microarray_01_ora.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2949,23 +3774,75 @@

Over-representation analysis - Microarray

CCDL for ALSF

-

October 2020

+

November 2020

1 Purpose of this analysis

-

This example is one of pathway analysis module set, we recommend looking at the pathway analysis introduction to help you determine which pathway analysis method is best suited for your purposes.

-

This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes (e.g., those differentially expressed using some cutoff) shares more or fewer genes with gene sets/pathways than we would expect at random. This pathway analysis method does not require any particular sample size, since the only input from your dataset is a set of genes of interest (Yaari et al. 2013).

+

This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

+

This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes (e.g., those differentially expressed using some cutoff) shares more or fewer genes with gene sets/pathways than we would expect by chance.

+

ORA is a broadly applicable technique that may be good in scenarios where your dataset or scientific questions don’t fit the requirements of other pathway analyses methods. It also does not require any particular sample size, since the only input from your dataset is a set of genes of interest (Yaari et al. 2013).

+

If you have differential expression results or something with a gene-level ranking and a two-group comparison, we recommend considering GSEA for your pathway analysis questions.

⬇️ Jump to the analysis code ⬇️

+
+

1.0.1 What is pathway analysis?

+

Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

+

We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

+
+
+

1.0.2 How to choose a pathway analysis?

+

This table summarizes the pathway analyses examples in this module.

+ +++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
AnalysisWhat is required for inputWhat output looks like✅ Pros⚠️ Cons
ORA (Over-representation Analysis)A list of gene IDs (no stats needed)A per-pathway hypergeometric test result- Simple

- Inexpensive computationally to calculate p-values
- Requires arbitrary thresholds and ignores any statistics associated with a gene

- Assumes independence of genes and pathways
GSEA (Gene Set Enrichment Analysis)A list of genes IDs with gene-level summary statisticsA per-pathway enrichment score- Includes all genes (no arbitrary threshold!)

- Attempts to measure coordination of genes
- Permutations can be expensive

- Does not account for pathway overlap

- Two-group comparisons not always appropriate/feasible
GSVA (Gene Set Variation Analysis)A gene expression matrix (like what you get from refine.bio directly)Pathway-level scores on a per-sample basis- Does not require two groups to compare upfront

- Normally distributed scores
- Scores are not a good fit for gene sets that contain genes that go up AND down

- Method doesn’t assign statistical significance itself

- Recommended sample size n > 10
+

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the .Rmd file

-

To run this example yourself, download the .Rmd for this analysis by clicking this link.

+

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

@@ -2973,36 +3850,36 @@

2.1 Obtain the .Rmd

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

-
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
-  dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
-  dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir)
-}
+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the gene set for this example

-

In this example, we are using differential expression results table we obtained from an example analysis of zebrafish samples overexpressing human CREB experiment using limma (Ritchie et al. 2015). The table contains Ensembl gene IDs, log fold-changes, and adjusted p-values (FDR in this case).

+

In this example, we are using differential expression results table we obtained from an example analysis of zebrafish samples overexpressing human CREB experiment using limma (Ritchie et al. 2015). The table contains Ensembl gene IDs, log fold-changes, and adjusted p-values (FDR in this case).

We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you’d like to follow the steps for obtaining this results file yourself, we suggest going through that differential expression analysis example.

2.4 About the dataset we are using for this example

-

For this example analysis, we will use this CREB overexpression zebrafish experiment (Tregnago et al. 2016). Tregnago et al. (2016) measured microarray gene expression of zebrafish samples overexpressing human CREB, as well as control samples.

+

For this example analysis, we will use this CREB overexpression zebrafish experiment (Tregnago et al. 2016). Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.

2.5 Check out our file structure!

@@ -3021,7 +3898,8 @@

2.5 Check out our file structure!

3 Using a different refine.bio dataset with this analysis?

-

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

+

The file we use here has several columns of differential expression summary statistics. If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend replacing the dge_results_file with a different file path to a read in a similar table of genes with the information that you are interested in. If your gene table differs, many steps will need to be changed or deleted entirely depending on the contents of that file (particularly in the Determine our genes of interest list section).

+

We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.


 

@@ -3031,110 +3909,71 @@

4 Over-Representation Analysis wi

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

-

In this analysis, we will be using clusterProfiler package to perform ORA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Dolgalev 2020; Subramanian et al. 2005).

-

We will also need the org.Dr.eg.db package to perform gene identifier conversion and ggupset to make an UpSet plot (Carlson 2019; Ahlmann-Eltze 2020).

-
if (!("clusterProfiler" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("clusterProfiler", update = FALSE)
-}
-
-# This is required to make one of the plots that clusterProfiler will make
-if (!("ggupset" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("ggupset", update = FALSE)
-}
-
-if (!("msigdbr" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("msigdbr", update = FALSE)
-}
-
-if (!("org.Dr.eg.db" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("org.Dr.eg.db", update = FALSE)
-}
+

In this analysis, we will be using clusterProfiler package to perform ORA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005; Liberzon et al. 2011).

+

We will also need the org.Dr.eg.db package to perform gene identifier conversion and ggupset to make an UpSet plot (Carlson 2019; Ahlmann-Eltze 2020).

+
if (!("clusterProfiler" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("clusterProfiler", update = FALSE)
+}
+
+# This is required to make one of the plots that clusterProfiler will make
+if (!("ggupset" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("ggupset", update = FALSE)
+}
+
+if (!("msigdbr" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("msigdbr", update = FALSE)
+}
+
+if (!("org.Dr.eg.db" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("org.Dr.eg.db", update = FALSE)
+}

Attach the packages we need for this analysis.

-
# Attach the library
-library(clusterProfiler)
-
## 
-
## clusterProfiler v3.16.1  For help: https://guangchuangyu.github.io/software/clusterProfiler
-## 
-## If you use clusterProfiler in published research, please cite:
-## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
-
## 
-## Attaching package: 'clusterProfiler'
-
## The following object is masked from 'package:stats':
-## 
-##     filter
-
# Package that contains MSigDB gene sets in tidy format
-library(msigdbr)
-
-# Danio rerio annotation package we'll use for gene identifier conversion
-library(org.Dr.eg.db)
-
## Loading required package: AnnotationDbi
-
## Loading required package: stats4
-
## Loading required package: BiocGenerics
-
## Loading required package: parallel
-
## 
-## Attaching package: 'BiocGenerics'
-
## The following objects are masked from 'package:parallel':
-## 
-##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
-##     clusterExport, clusterMap, parApply, parCapply, parLapply,
-##     parLapplyLB, parRapply, parSapply, parSapplyLB
-
## The following objects are masked from 'package:stats':
-## 
-##     IQR, mad, sd, var, xtabs
-
## The following objects are masked from 'package:base':
-## 
-##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
-##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
-##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
-##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
-##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
-##     union, unique, unsplit, which, which.max, which.min
-
## Loading required package: Biobase
-
## Welcome to Bioconductor
-## 
-##     Vignettes contain introductory material; view with
-##     'browseVignettes()'. To cite Bioconductor, see
-##     'citation("Biobase")', and for packages 'citation("pkgname")'.
-
## Loading required package: IRanges
-
## Loading required package: S4Vectors
-
## 
-## Attaching package: 'S4Vectors'
-
## The following object is masked from 'package:clusterProfiler':
-## 
-##     rename
-
## The following object is masked from 'package:base':
-## 
-##     expand.grid
-
## 
-## Attaching package: 'IRanges'
-
## The following object is masked from 'package:clusterProfiler':
-## 
-##     slice
-
## 
-## Attaching package: 'AnnotationDbi'
-
## The following object is masked from 'package:clusterProfiler':
-## 
-##     select
-
## 
-
# We will need this so we can use the pipe: %>%
-library(magrittr)
+
# Attach the library
+library(clusterProfiler)
+
+# Package that contains MSigDB gene sets in tidy format
+library(msigdbr)
+
+# Danio rerio annotation package we'll use for gene identifier conversion
+library(org.Dr.eg.db)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
-
-

4.2 Import data

-

We will read in the differential expression results we will download from online. These results are from a zebrafish microarray experiment we used for differential expression analysis for two groups using limma (Ritchie et al. 2015). The table contains Ensembl gene IDs, log fold-changes for each group, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to ORA.

+
+

4.2 Download data file

+

For ORA, we only need a list of gene IDs as our input, so this example can work for any situations where you have gene list and want to know more about what biological pathways it shares genes with.

+

For this example, we will read in results from a differential expression analysis that we have already performed. Rather than reading from a local file, we will download the results table directly from a URL. These results are from a zebrafish microarray experiment we used for differential expression analysis for two groups using limma (Ritchie et al. 2015). The table contains Ensembl gene IDs, log fold-changes for each group, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to ORA.

Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. First we will assign the URL to its own variable called, dge_url.

-
# Define the url to your differential expression results file
-dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv"
-

Read in the file that has differential expression results. Here we are using the URL we set up above, but this can be a local file path instead i.e. you can replace dge_url in the code below with a path to file you have on your computer like: file.path("results", "GSE71270_limma_results.tsv").

-
# Read in the contents of your differential expression results file
-# `dge_url` can be replaced with a file path to a TSV file with your
-# desired gene list results
-dge_df <- readr::read_tsv(dge_url)
-
## Parsed with column specification:
+
# Define the url to your differential expression results file
+dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv"
+

We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.

+
dge_results_file <- file.path(
+  results_dir,
+  "GSE71270_limma_results.tsv"
+)
+

Using the URL (dge_url) and file path (dge_results_file) we can download the file and use the destfile argument to specify where it should be saved.

+
download.file(
+  dge_url,
+  # The file will be saved to this location and with this name
+  destfile = dge_results_file
+)
+

Now let’s double check that the results file is in the right place.

+
# Check if the file exists
+file.exists(dge_results_file)
+
## [1] TRUE
+
+
+

4.3 Import data

+

Read in the file that has differential expression results.

+
# Read in the contents of the differential expression results file
+dge_df <- readr::read_tsv(dge_results_file)
+
## 
+## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────
 ## cols(
 ##   Gene = col_character(),
 ##   logFC = col_double(),
@@ -3144,336 +3983,404 @@ 

4.2 Import data

## adj.P.Val = col_double(), ## B = col_double() ## )
-

read_tsv() can read TSV files online and doesn’t necessarily require you download the file first. Let’s take a look at what these contrast results from the differential expression analysis look like.

-
dge_df
+

Note that read_tsv() can also read TSV files directly from a URL and doesn’t necessarily require you download the file first. If you wanted to use that feature, you could replace the call above with readr::read_tsv(dge_url) and skip the download steps.

+

Let’s take a look at what these results from the differential expression analysis look like.

+
dge_df
-
-

4.3 Getting familiar with clusterProfiler’s options

-

Let’s take a look at what organisms the package supports.

-
msigdbr_species()
+
+

4.4 Getting familiar with MSigDB gene sets available via msigdbr

+

The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). We can use the msigdbr package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler (Yu et al. 2012; Dolgalev 2020).

+

The gene sets available directly from MSigDB are applicable to human studies. msigdbr also supports commonly studied model organisms.

+

Let’s take a look at what organisms the package supports with msigdbr_species().

+
msigdbr_species()
-

The data we’re interested in here comes from zebrafish samples, so we can obtain just the gene sets relevant to D. rerio with the species argument to msigdbr().

-
dr_msigdb_df <- msigdbr(species = "Danio rerio")
-

MSigDB contains 8 different gene set collections (Subramanian et al. 2005).

-
H: hallmark gene sets
-C1: positional gene sets
-C2: curated gene sets
-C3: motif gene sets
-C4: computational gene sets
-C5: GO gene sets
-C6: oncogenic signatures
-C7: immunologic signatures
-

In this example, we will use pathways that are gene sets considered to be “canonical representations of a biological process compiled by domain experts” and are a subset of C2: curated gene sets (Subramanian et al. 2005).

-

Specifically, we will use the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways (Kanehisa and Goto 2000).

-

First, let’s take a look at what information is included in this data frame.

-
head(dr_msigdb_df)
+

The data we’re interested in here comes from zebrafish samples, so we can obtain only the gene sets relevant to D. rerio with the species argument to msigdbr().

+
dr_msigdb_df <- msigdbr(species = "Danio rerio")
+

MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated). In this example, we will use pathways that are gene sets considered to be “canonical representations of a biological process compiled by domain experts” and are a subset of C2: curated gene sets (Subramanian et al. 2005; Liberzon et al. 2011).

+

Specifically, we will use the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways (Kanehisa and Goto 2000).

+

First, let’s take a look at what information is included in the data frame returned by msigdbr().

+
head(dr_msigdb_df)

We will need to use gs_cat and gs_subcat columns to construct a filter step that will only keep curated gene sets and KEGG pathways.

-
# Filter the zebrafish data frame to the KEGG pathways that are included in the
-# curated gene sets
-dr_kegg_df <- dr_msigdb_df %>%
-  dplyr::filter(
-    gs_cat == "C2", # This is to filter only to the C2 curated gene sets
-    gs_subcat == "CP:KEGG" # This is because we only want KEGG pathways
-  )
-

Note: We could have specified that we wanted the KEGG gene sets using the category and subcategory arguments of msigdbr(), but we’re going for general steps! – use ?msigdbr to see more information.

+
# Filter the zebrafish data frame to the KEGG pathways that are included in the
+# curated gene sets
+dr_kegg_df <- dr_msigdb_df %>%
+  dplyr::filter(
+    gs_cat == "C2", # This is to filter only to the C2 curated gene sets
+    gs_subcat == "CP:KEGG" # This is because we only want KEGG pathways
+  )

The clusterProfiler() function we will use requires a data frame with two columns, where one column contains the term identifier or name and one column contains gene identifiers that match our gene lists we want to check for enrichment.

Our data frame with KEGG terms contains Entrez IDs and gene symbols.

In our differential expression results data frame, dge_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs.

-

4.4 Gene identifier conversion

+

4.5 Gene identifier conversion

We’re going to convert our identifiers in dge_df to gene symbols because they are a bit more human readable, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

-

The annotation package org.Dr.eg.db contains information for different identifiers (Carlson 2019). org.Dr.eg.db is specific to Danio rerio – this is what the Dr in the package name is referencing.

+

The annotation package org.Dr.eg.db contains information for different identifiers (Carlson 2019). org.Dr.eg.db is specific to Danio rerio – this is what the Dr in the package name is referencing.

Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

We can see what types of IDs are available to us in an annotation package with keytypes().

-
keytypes(org.Dr.eg.db)
-
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
-##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
-## [11] "GO"           "GOALL"        "IPI"          "ONTOLOGY"     "ONTOLOGYALL" 
-## [16] "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"      
+
keytypes(org.Dr.eg.db)
+
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
+##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
+##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
+## [13] "IPI"          "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
+## [17] "PFAM"         "PMID"         "PROSITE"      "REFSEQ"      
 ## [21] "SYMBOL"       "UNIGENE"      "UNIPROT"      "ZFIN"

Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to gene symbols (SYMBOL), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS) to Entrez IDs (ENTREZID).

-

The function we will use to map from Ensembl gene IDs to gene symbols is called mapIds().

-
# This returns a named vector which we can convert to a data frame, where
-# the keys (Ensembl IDs) are the names
-symbols_vector <- mapIds(org.Dr.eg.db, # Specify the annotation package
-  # The vector of gene identifiers we want to map
-  keys = dge_df$Gene,
-  # The type of gene identifier we want returned
-  column = "SYMBOL",
-  # What type of gene identifiers we're starting with
-  keytype = "ENSEMBL",
-  # In the case of 1:many mappings, return the
-  # first one. This is default behavior!
-  multiVals = "first"
-)
+

The function we will use to map from Ensembl gene IDs to gene symbols is called mapIds() and comes from the AnnotationDbi package.

+
# This returns a named vector which we can convert to a data frame, where
+# the keys (Ensembl IDs) are the names
+symbols_vector <- mapIds(
+  # Replace with annotation package for the organism relevant to your data
+  org.Dr.eg.db,
+  # The vector of gene identifiers we want to map
+  keys = dge_df$Gene,
+  # Replace with the type of gene identifiers in your data
+  keytype = "ENSEMBL",
+  # Replace with the type of gene identifiers you would like to map to
+  column = "SYMBOL",
+  # In the case of 1:many mappings, return the
+  # first one. This is default behavior!
+  multiVals = "first"
+)
## 'select()' returned 1:many mapping between keys and columns

This message is letting us know that sometimes Ensembl gene identifiers will map to multiple gene symbols. In this case, it’s also possible that a gene symbol will map to multiple Ensembl IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.

Let’s create a two column data frame that shows the gene symbols and their Ensembl IDs side-by-side.

-
# We would like a data frame we can join to the differential expression stats
-gene_key_df <- data.frame(
-  ensembl_id = names(symbols_vector),
-  gene_symbol = symbols_vector,
-  stringsAsFactors = FALSE
-) %>%
-  # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
-  # from the data frame
-  dplyr::filter(!is.na(gene_symbol))
+
# We would like a data frame we can join to the differential expression stats
+gene_key_df <- data.frame(
+  ensembl_id = names(symbols_vector),
+  gene_symbol = symbols_vector,
+  stringsAsFactors = FALSE
+) %>%
+  # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
+  # from the data frame
+  dplyr::filter(!is.na(gene_symbol))

Let’s see a preview of gene_key_df.

-
head(gene_key_df)
+
head(gene_key_df)

Now we are ready to add the gene_key_df to our data frame with the differential expression stats, dge_df. Here we’re using a dplyr::left_join() because we only want to retain the genes that have gene symbols and this will filter out anything in our dge_df that does not have gene symbols when we join using the Ensembl gene identifiers.

-
dge_annot_df <- gene_key_df %>%
-  # Using a left join removes the rows without gene symbols because those rows
-  # have already been removed in `gene_symbols_df`
-  dplyr::left_join(dge_df,
-    # The name of the column that contains the Ensembl gene IDs
-    # in the left data frame and right data frame
-    by = c("ensembl_id" = "Gene")
-  )
+
dge_annot_df <- gene_key_df %>%
+  # Using a left join removes the rows without gene symbols because those rows
+  # have already been removed in `gene_symbols_df`
+  dplyr::left_join(dge_df,
+    # The name of the column that contains the Ensembl gene IDs
+    # in the left data frame and right data frame
+    by = c("ensembl_id" = "Gene")
+  )

Let’s take a look at what this data frame looks like.

-
# Print out a preview
-head(dge_annot_df)
+
# Print out a preview
+head(dge_annot_df)
-

4.5 Over-representation Analysis (ORA)

-

Over-representation testing using clusterProfiler is based on a hypergeometric test (Guangchuang Yu).

-

\(p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{ {M \choose i}{ {N-M} \choose {n-i} } } { {N \choose n} }\)

-

Where N is the number of genes in the background distribution, M is the number of genes in a pathway, n is the number of genes we are interested in (our differentially expressed genes), and k is the number of genes that overlap between the pathway and our genes of interest.

-

So, we will need to provide to clusterProfiler two genes lists:

+

4.6 Over-representation Analysis (ORA)

+

Over-representation testing using clusterProfiler is based on a hypergeometric test (often referred to as Fisher’s exact test) (Yu 2020). For more background on hypergeometric tests, this handy tutorial explains more about how hypergeometric tests work (Puthier and van Helden 2015).

+

We will need to provide to clusterProfiler two genes lists:

    -
  1. Our genes of interest (n)
  2. -
  3. What genes were in our total background set (N). (All genes that originally had an opportunity to be measured).
  4. +
  5. Our genes of interest
  6. +
  7. What genes were in our total background set. (All genes that originally had an opportunity to be measured).
-

4.6 Determine our genes of interest list

+

4.7 Determine our genes of interest list

We will use our differential expression results to get a genes of interest list. Let’s use our adjusted p values as a cutoff.

-
# Select genes that are below a cutoff
-genes_of_interest <- dge_annot_df %>%
-  # Here we want the top differentially expressed genes and we will use downregulated genes
-  dplyr::filter(adj.P.Val < 0.05, logFC < -1) %>%
-  # We are extracting the gene symbols as a vector
-  dplyr::pull(gene_symbol)
+

This step is highly variable depending on what your gene list is, what information it contains and what your goals are. You may want to delete this next chunk entirely if you supply an already determined list of genes OR you may need to adjust the cutoffs and column names.

+
# Select genes that are below a cutoff
+genes_of_interest <- dge_annot_df %>%
+  # Here we want the top differentially expressed genes and we will use
+  # downregulated genes
+  dplyr::filter(adj.P.Val < 0.05, logFC < -1) %>%
+  # We are extracting the gene symbols as a vector
+  dplyr::pull(gene_symbol)

There are a lot of ways we could make a genes of interest list, and using a p-value cutoff for differential expression analysis is just one way you can do this.

ORA generally requires you make some sort of arbitrary decision to obtain your genes of interest list and this is one of the approach’s weaknesses – to get to a gene list we’ve removed all other context.

Because one gene_symbol may map to multiple Ensembl IDs, we need to make sure we have no repeated gene symbols in this list.

-
# Reduce to only unique gene symbols
-genes_of_interest <- unique(as.character(genes_of_interest))
-
-# Let's print out some of these genes
-head(genes_of_interest)
+
# Reduce to only unique gene symbols
+genes_of_interest <- unique(as.character(genes_of_interest))
+
+# Let's print out some of these genes
+head(genes_of_interest)
## [1] "si:ch1073-67j19.1" "ypel3"             "pdia4"            
 ## [4] "cst14a.2"          "viml"              "spink2.1"
-

4.7 Determine our background set gene list

+

4.8 Determine our background set gene list

Sometimes folks consider genes from the entire genome to comprise the background, but for our microarray data, we should consider all genes that were measured as our background set. In other words, if we are unable to detect a gene, it should not be in our background set.

We can obtain our detected genes list from our data frame, dge_annot_df (which we haven’t done filtering on).

-
background_set <- unique(as.character(dge_annot_df$gene_symbol))
+
background_set <- unique(as.character(dge_annot_df$gene_symbol))
-

4.8 Run ORA using the enricher() function

+

4.9 Run ORA using the enricher() function

Now that we have our background set, our genes of interest, and our pathway information, we’re ready to run ORA using the enricher() function.

-
kegg_ora_results <- enricher(
-  gene = genes_of_interest, # A vector of your genes of interest
-  pvalueCutoff = 0.1, # Can choose a FDR cutoff
-  pAdjustMethod = "BH", # What method for multiple testing correction should we use
-  universe = background_set, # A vector containing your background set genes
-  # The pathway information should be a data frame with a term name or
-  # identifier and the gene identifiers
-  TERM2GENE = dplyr::select(
-    dr_kegg_df,
-    gs_name,
-    gene_symbol
-  )
-)
+
kegg_ora_results <- enricher(
+  gene = genes_of_interest, # A vector of your genes of interest
+  pvalueCutoff = 0.1, # Can choose a FDR cutoff
+  pAdjustMethod = "BH", # Method to be used for multiple testing correction
+  universe = background_set, # A vector containing your background set genes
+  # The pathway information should be a data frame with a term name or
+  # identifier and the gene identifiers
+  TERM2GENE = dplyr::select(
+    dr_kegg_df,
+    gs_name,
+    gene_symbol
+  )
+)

Note: using enrichKEGG() is a shortcut for doing ORA using KEGG, but the approach we covered here can be used with any gene sets you’d like!

-

What is returned by enricher()? You can run View(kegg_ora_results) or click on the object in your Environment panel.

The information we’re most likely interested in is in the results slot. Let’s convert this into a data frame that we can write to file.

-
kegg_result_df <- data.frame(kegg_ora_results@result)
+
kegg_result_df <- data.frame(kegg_ora_results@result)

Let’s print out a sneak peek of it here and take a look at how many sets do we have that fit our cutoff of 0.1 FDR?

-
kegg_result_df %>%
-  dplyr::filter(p.adjust < 0.1)
+
kegg_result_df %>%
+  dplyr::filter(p.adjust < 0.1)

Looks like there are four KEGG sets returned as significant at FDR of 0.1.

-

4.9 Visualizing results

+

4.10 Visualizing results

We can use a dot plot to visualize our significant enrichment results. The enrichplot::dotplot() function will only plot gene sets that are significant according to the multiple testing corrected p values (in the p.adjust column) and the pvalueCutoff you provided in the enricher() step.

-
enrich_plot <- enrichplot::dotplot(kegg_ora_results)
-
-# Print out the plot here
-enrich_plot
-

+
enrich_plot <- enrichplot::dotplot(kegg_ora_results)
+
## wrong orderBy parameter; set to default `orderBy = "x"`
+
# Print out the plot here
+enrich_plot
+

Use ?enrichplot::dotplot to see the help page for more about how to use this function.

This plot is arguably more useful when we have a large number of significant pathways.

Let’s save it to a PNG.

-
ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_enrich_plot.png"),
-  plot = enrich_plot
-)
+
ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_enrich_plot.png"),
+  plot = enrich_plot
+)
## Saving 7 x 5 in image

We can use an UpSet plot to visualize the overlap between the gene sets that were returned as significant.

-
enrichplot::upsetplot(kegg_ora_results)
+
upset_plot <- enrichplot::upsetplot(kegg_ora_results)
+
+# Print out the plot here
+upset_plot

See that KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION and KEGG_LYSOSOME have all their genes in common. Gene sets or pathways aren’t independent!

Let’s also save this to a PNG.

-
ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_upset_plot.png"),
-  plot = enrich_plot
-)
+
ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_upset_plot.png"),
+  plot = upset_plot
+)
## Saving 7 x 5 in image
-

4.10 Write results to file

-
readr::write_tsv(
-  kegg_result_df,
-  file.path(
-    results_dir,
-    "GSE71270_pathway_analysis_results.tsv"
-  )
-)
+

4.11 Write results to file

+
readr::write_tsv(
+  kegg_result_df,
+  file.path(
+    results_dir,
+    "GSE71270_pathway_analysis_results.tsv"
+  )
+)

5 Resources for further learning

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

-
# Print session info
-sessionInfo()
-
## R version 4.0.2 (2020-06-22)
-## Platform: x86_64-pc-linux-gnu (64-bit)
-## Running under: Ubuntu 20.04 LTS
-## 
-## Matrix products: default
-## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
-## 
-## locale:
-##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
-##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
-##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
-##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
-##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
-## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
-## 
-## attached base packages:
-## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
-## [8] methods   base     
+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ─────────────────────────────────────────────────────
+##  setting  value                       
+##  version  R version 4.0.2 (2020-06-22)
+##  os       Ubuntu 20.04 LTS            
+##  system   x86_64, linux-gnu           
+##  ui       X11                         
+##  language (EN)                        
+##  collate  en_US.UTF-8                 
+##  ctype    en_US.UTF-8                 
+##  tz       Etc/UTC                     
+##  date     2020-12-21                  
 ## 
-## other attached packages:
-##  [1] magrittr_1.5           org.Dr.eg.db_3.11.4    AnnotationDbi_1.50.3  
-##  [4] IRanges_2.22.2         S4Vectors_0.26.1       Biobase_2.48.0        
-##  [7] BiocGenerics_0.34.0    msigdbr_7.2.1          clusterProfiler_3.16.1
-## [10] optparse_1.6.6        
+## ─ Packages ─────────────────────────────────────────────────────────
+##  package         * version  date       lib source        
+##  AnnotationDbi   * 1.52.0   2020-10-27 [1] Bioconductor  
+##  assertthat        0.2.1    2019-03-21 [1] RSPM (R 4.0.0)
+##  backports         1.1.10   2020-09-15 [1] RSPM (R 4.0.2)
+##  Biobase         * 2.50.0   2020-10-27 [1] Bioconductor  
+##  BiocGenerics    * 0.36.0   2020-10-27 [1] Bioconductor  
+##  BiocManager       1.30.10  2019-11-16 [1] RSPM (R 4.0.0)
+##  BiocParallel      1.24.1   2020-11-06 [1] Bioconductor  
+##  bit               4.0.4    2020-08-04 [1] RSPM (R 4.0.2)
+##  bit64             4.0.5    2020-08-30 [1] RSPM (R 4.0.2)
+##  blob              1.2.1    2020-01-20 [1] RSPM (R 4.0.0)
+##  cli               2.1.0    2020-10-12 [1] RSPM (R 4.0.2)
+##  clusterProfiler * 3.18.0   2020-10-27 [1] Bioconductor  
+##  colorspace        1.4-1    2019-03-18 [1] RSPM (R 4.0.0)
+##  cowplot           1.1.0    2020-09-08 [1] RSPM (R 4.0.2)
+##  crayon            1.3.4    2017-09-16 [1] RSPM (R 4.0.0)
+##  data.table        1.13.0   2020-07-24 [1] RSPM (R 4.0.2)
+##  DBI               1.1.0    2019-12-15 [1] RSPM (R 4.0.0)
+##  digest            0.6.25   2020-02-23 [1] RSPM (R 4.0.0)
+##  DO.db             2.9      2020-12-16 [1] Bioconductor  
+##  DOSE              3.16.0   2020-10-27 [1] Bioconductor  
+##  downloader        0.4      2015-07-09 [1] RSPM (R 4.0.0)
+##  dplyr             1.0.2    2020-08-18 [1] RSPM (R 4.0.2)
+##  ellipsis          0.3.1    2020-05-15 [1] RSPM (R 4.0.0)
+##  enrichplot        1.10.1   2020-11-14 [1] Bioconductor  
+##  evaluate          0.14     2019-05-28 [1] RSPM (R 4.0.0)
+##  fansi             0.4.1    2020-01-08 [1] RSPM (R 4.0.0)
+##  farver            2.0.3    2020-01-16 [1] RSPM (R 4.0.0)
+##  fastmatch         1.1-0    2017-01-28 [1] RSPM (R 4.0.0)
+##  fgsea             1.16.0   2020-10-27 [1] Bioconductor  
+##  generics          0.0.2    2018-11-29 [1] RSPM (R 4.0.0)
+##  getopt            1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
+##  ggforce           0.3.2    2020-06-23 [1] RSPM (R 4.0.2)
+##  ggplot2           3.3.2    2020-06-19 [1] RSPM (R 4.0.1)
+##  ggraph            2.0.3    2020-05-20 [1] RSPM (R 4.0.2)
+##  ggrepel           0.8.2    2020-03-08 [1] RSPM (R 4.0.2)
+##  ggupset           0.3.0    2020-05-05 [1] RSPM (R 4.0.0)
+##  glue              1.4.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  GO.db             3.12.1   2020-12-16 [1] Bioconductor  
+##  GOSemSim          2.16.1   2020-10-29 [1] Bioconductor  
+##  graphlayouts      0.7.0    2020-04-25 [1] RSPM (R 4.0.2)
+##  gridExtra         2.3      2017-09-09 [1] RSPM (R 4.0.0)
+##  gtable            0.3.0    2019-03-25 [1] RSPM (R 4.0.0)
+##  hms               0.5.3    2020-01-08 [1] RSPM (R 4.0.0)
+##  htmltools         0.5.0    2020-06-16 [1] RSPM (R 4.0.1)
+##  igraph            1.2.6    2020-10-06 [1] RSPM (R 4.0.2)
+##  IRanges         * 2.24.1   2020-12-12 [1] Bioconductor  
+##  jsonlite          1.7.1    2020-09-07 [1] RSPM (R 4.0.2)
+##  knitr             1.30     2020-09-22 [1] RSPM (R 4.0.2)
+##  labeling          0.3      2014-08-23 [1] RSPM (R 4.0.0)
+##  lattice           0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
+##  lifecycle         0.2.0    2020-03-06 [1] RSPM (R 4.0.0)
+##  magrittr        * 1.5      2014-11-22 [1] RSPM (R 4.0.0)
+##  MASS              7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+##  Matrix            1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
+##  memoise           1.1.0    2017-04-21 [1] RSPM (R 4.0.0)
+##  msigdbr         * 7.2.1    2020-10-02 [1] RSPM (R 4.0.2)
+##  munsell           0.5.0    2018-06-12 [1] RSPM (R 4.0.0)
+##  optparse        * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
+##  org.Dr.eg.db    * 3.12.0   2020-12-16 [1] Bioconductor  
+##  pillar            1.4.6    2020-07-10 [1] RSPM (R 4.0.2)
+##  pkgconfig         2.0.3    2019-09-22 [1] RSPM (R 4.0.0)
+##  plyr              1.8.6    2020-03-03 [1] RSPM (R 4.0.2)
+##  polyclip          1.10-0   2019-03-14 [1] RSPM (R 4.0.0)
+##  ps                1.4.0    2020-10-07 [1] RSPM (R 4.0.2)
+##  purrr             0.3.4    2020-04-17 [1] RSPM (R 4.0.0)
+##  qvalue            2.22.0   2020-10-27 [1] Bioconductor  
+##  R.cache           0.14.0   2019-12-06 [1] RSPM (R 4.0.0)
+##  R.methodsS3       1.8.1    2020-08-26 [1] RSPM (R 4.0.2)
+##  R.oo              1.24.0   2020-08-26 [1] RSPM (R 4.0.2)
+##  R.utils           2.10.1   2020-08-26 [1] RSPM (R 4.0.2)
+##  R6                2.4.1    2019-11-12 [1] RSPM (R 4.0.0)
+##  RColorBrewer      1.1-2    2014-12-07 [1] RSPM (R 4.0.0)
+##  Rcpp              1.0.5    2020-07-06 [1] RSPM (R 4.0.2)
+##  readr             1.4.0    2020-10-05 [1] RSPM (R 4.0.2)
+##  rematch2          2.1.2    2020-05-01 [1] RSPM (R 4.0.0)
+##  reshape2          1.4.4    2020-04-09 [1] RSPM (R 4.0.2)
+##  rlang             0.4.8    2020-10-08 [1] RSPM (R 4.0.2)
+##  rmarkdown         2.4      2020-09-30 [1] RSPM (R 4.0.2)
+##  RSQLite           2.2.1    2020-09-30 [1] RSPM (R 4.0.2)
+##  rstudioapi        0.11     2020-02-07 [1] RSPM (R 4.0.0)
+##  rvcheck           0.1.8    2020-03-01 [1] RSPM (R 4.0.0)
+##  S4Vectors       * 0.28.1   2020-12-09 [1] Bioconductor  
+##  scales            1.1.1    2020-05-11 [1] RSPM (R 4.0.0)
+##  scatterpie        0.1.5    2020-09-09 [1] RSPM (R 4.0.2)
+##  sessioninfo       1.1.1    2018-11-05 [1] RSPM (R 4.0.0)
+##  shadowtext        0.0.7    2019-11-06 [1] RSPM (R 4.0.0)
+##  stringi           1.5.3    2020-09-09 [1] RSPM (R 4.0.2)
+##  stringr           1.4.0    2019-02-10 [1] RSPM (R 4.0.0)
+##  styler            1.3.2    2020-02-23 [1] RSPM (R 4.0.0)
+##  tibble            3.0.4    2020-10-12 [1] RSPM (R 4.0.2)
+##  tidygraph         1.2.0    2020-05-12 [1] RSPM (R 4.0.2)
+##  tidyr             1.1.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  tidyselect        1.1.0    2020-05-11 [1] RSPM (R 4.0.0)
+##  tweenr            1.0.1    2018-12-14 [1] RSPM (R 4.0.2)
+##  vctrs             0.3.4    2020-08-29 [1] RSPM (R 4.0.2)
+##  viridis           0.5.1    2018-03-29 [1] RSPM (R 4.0.0)
+##  viridisLite       0.3.0    2018-02-01 [1] RSPM (R 4.0.0)
+##  withr             2.3.0    2020-09-22 [1] RSPM (R 4.0.2)
+##  xfun              0.18     2020-09-29 [1] RSPM (R 4.0.2)
+##  yaml              2.2.1    2020-02-01 [1] RSPM (R 4.0.0)
 ## 
-## loaded via a namespace (and not attached):
-##   [1] enrichplot_1.8.1    bit64_4.0.5         progress_1.2.2     
-##   [4] httr_1.4.2          RColorBrewer_1.1-2  R.cache_0.14.0     
-##   [7] tools_4.0.2         backports_1.1.10    R6_2.4.1           
-##  [10] DBI_1.1.0           colorspace_1.4-1    prettyunits_1.1.1  
-##  [13] tidyselect_1.1.0    gridExtra_2.3       curl_4.3           
-##  [16] bit_4.0.4           compiler_4.0.2      cli_2.0.2          
-##  [19] scatterpie_0.1.5    xml2_1.3.2          labeling_0.3       
-##  [22] triebeard_0.3.0     scales_1.1.1        readr_1.3.1        
-##  [25] ggridges_0.5.2      stringr_1.4.0       digest_0.6.25      
-##  [28] ggupset_0.3.0       rmarkdown_2.4       DOSE_3.14.0        
-##  [31] R.utils_2.10.1      pkgconfig_2.0.3     htmltools_0.5.0    
-##  [34] styler_1.3.2        rlang_0.4.7         rstudioapi_0.11    
-##  [37] RSQLite_2.2.0       gridGraphics_0.5-0  generics_0.0.2     
-##  [40] farver_2.0.3        jsonlite_1.7.1      BiocParallel_1.22.0
-##  [43] GOSemSim_2.14.2     dplyr_1.0.2         R.oo_1.24.0        
-##  [46] ggplotify_0.0.5     GO.db_3.11.4        Matrix_1.2-18      
-##  [49] Rcpp_1.0.5          munsell_0.5.0       fansi_0.4.1        
-##  [52] viridis_0.5.1       lifecycle_0.2.0     R.methodsS3_1.8.1  
-##  [55] stringi_1.5.3       yaml_2.2.1          ggraph_2.0.3       
-##  [58] MASS_7.3-51.6       plyr_1.8.6          qvalue_2.20.0      
-##  [61] grid_4.0.2          blob_1.2.1          ggrepel_0.8.2      
-##  [64] DO.db_2.9           crayon_1.3.4        lattice_0.20-41    
-##  [67] cowplot_1.1.0       graphlayouts_0.7.0  splines_4.0.2      
-##  [70] hms_0.5.3           knitr_1.30          pillar_1.4.6       
-##  [73] fgsea_1.14.0        igraph_1.2.5        reshape2_1.4.4     
-##  [76] fastmatch_1.1-0     glue_1.4.2          evaluate_0.14      
-##  [79] downloader_0.4      BiocManager_1.30.10 data.table_1.13.0  
-##  [82] urltools_1.7.3      vctrs_0.3.4         tweenr_1.0.1       
-##  [85] gtable_0.3.0        getopt_1.20.3       purrr_0.3.4        
-##  [88] polyclip_1.10-0     tidyr_1.1.2         rematch2_2.1.2     
-##  [91] assertthat_0.2.1    ggplot2_3.3.2       xfun_0.18          
-##  [94] ggforce_0.3.2       europepmc_0.4       tidygraph_1.2.0    
-##  [97] viridisLite_0.3.0   tibble_3.0.3        rvcheck_0.1.8      
-## [100] memoise_1.1.0       ellipsis_0.3.1
+## [1] /usr/local/lib/R/site-library +## [2] /usr/local/lib/R/library

References

-

Ahlmann-Eltze C., 2020 Ggupset: Combination matrix axis for ’ggplot2’ to create ’upset’ plots.

+

Ahlmann-Eltze C., 2020 ggupset: Combination matrix axis for ’ggplot2’ to create ’upset’ plots. https://github.com/const-ae/ggupset

-

Carlson M., 2019 Genome wide annotation for zebrafish

+

Carlson M., 2019 Genome wide annotation for zebrafish. https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html

-
-

Dolgalev I., 2020 Msigdbr: MSigDB gene sets for multiple organisms in a tidy data format.

-
-
-

Guangchuang Yu, ClusterProfiler: Universal enrichment tool for functional and comparative study

+
+

Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html

-

Kanehisa M., and S. Goto, 2000 KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28: 27–30.

+

Kanehisa M., and S. Goto, 2000 KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28: 27–30. https://doi.org/10.1093/nar/28.1.27

-

Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput. Biol. 8: e1002375.

+

Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

+
+
+

Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, and P. Tamayo et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260

+
+
+

Puthier D., and J. van Helden, 2015 Statistics for Bioinformatics - Practicals - Gene enrichment statistics. https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html

Ritchie M. E., B. Phipson, D. Wu, Y. Hu, and C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007

-

Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102: 15545–15550.

+

Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

-

Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896.

+

Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98

-

Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: a method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Res. 41: e170.

+

Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660

-

Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 ClusterProfiler: An r package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118

+

Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118

+
+
+

Yu G., 2020 clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html

+
diff --git a/02-microarray/pathway-analysis_microarray_01_ortholog_mapping_kegg.Rmd b/02-microarray/pathway-analysis_microarray_01_ortholog_mapping_kegg.Rmd deleted file mode 100644 index aba23317..00000000 --- a/02-microarray/pathway-analysis_microarray_01_ortholog_mapping_kegg.Rmd +++ /dev/null @@ -1,272 +0,0 @@ ---- -title: "KEGG pathways: mapping to mouse orthologs with `hcop`" -output: - html_notebook: - toc: TRUE - toc_float: TRUE -author: J. Taroni for ALSF CCDL -date: 2019 ---- - -## Background - -In this module, we use QuSAGE ([](https://doi.org/10.1093/nar/gkt660)) -for pathway analysis (implemented in the [`qusage` bioconductor package](https://bioconductor.org/packages/release/bioc/html/qusage.html)). - -`qusage` allows you to read in gene sets that are in the [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). - -[MSigDB](http://software.broadinstitute.org/gsea/msigdb) offers gene sets in this format. -[Curated gene sets](http://software.broadinstitute.org/gsea/msigdb/collections.jsp#C2) -such as [KEGG](https://www.genome.jp/kegg/) are a good starting point for any pathway analysis. - -However, MSigDB only distributes human pathways. -If we want to use KEGG Pathways with another species without going through -[KEGG Orthology](https://www.genome.jp/kegg/ko.html), we need to map to -orthologs ourselves. - -We'll use the [`hcop` package](https://github.com/stephenturner/hcop) to do -this. -If you're looking for a little bit more background information (like if you -run into trouble installing `hcop`), check out the notebook in our -[`ortholog-mapping`](https://github.com/AlexsLemonade/refinebio-examples/tree/master/ortholog-mapping) module. - -## Setup - -Package installation - -```{r} -# need read.gmt functionality from qusage -if (!("qusage" %in% installed.packages())) { - BiocManager::install("qusage", update = FALSE) -} - -# we need devtools in order to install the hcop package we will use to -# do the ortholog mapping -if (!("devtools" %in% installed.packages())) { - install.packages("devtools") -} - -# this installs a specific version of hcop -# we pass the commit hash to the ref argument -devtools::install_github("stephenturner/hcop", - ref = "0985fddc91a6ef2308f4800958dfd11c25fe6a98" -) -``` - -```{r} -`%>%` <- dplyr::`%>%` -``` - -```{r} -library(hcop) -``` - -## KEGG human pathways - -We need to download the the MSigDB v6.2 KEGG gene sets that use Entrez gene IDs -and place them at the following path if we have not done so already: - -``` -gene-sets/c2.cp.kegg.v6.2.entrez.gmt -``` - -```{r} -# the kegg gmt file should be located in the spot we mention above -kegg_file <- file.path("gene-sets", "c2.cp.kegg.v6.2.entrez.gmt") -# since we do not track this file in our repository, let's check to make sure -# it exists where we expect it and download it if we don't find it -if (!file.exists(kegg_file)) { - message(paste( - "KEGG GMT file is not found at", kegg_file, - ", downloading now..." - )) - # need gene-sets directory - if (!dir.exists("gene-sets")) { - dir.create("gene-sets") - } - download.file("https://data.broadinstitute.org/gsea-msigdb/msigdb/release/6.2/c2.cp.kegg.v6.2.entrez.gmt", - destfile = kegg_file - ) -} -``` - -Read in the pathway file - -```{r} -kegg_human_list <- qusage::read.gmt(kegg_file) -``` - -## Conversion from human Entrez ID to mouse symbol - -### Human to mouse mapping with `hcop` - -`hcop` is designed to work well with `dplyr`, so we'll get all the possible -human Entrez IDs into a `data.frame`. - -```{r} -entrez_in_pathways_df <- data.frame( - human_entrez = as.integer(unique(unlist(kegg_human_list))) -) -``` - -Join to mouse orthologs by the human Entrez IDs. - -```{r} -# Join to mouse orthologs -mouse_ortholog_df <- entrez_in_pathways_df %>% - dplyr::inner_join(mouse, by = "human_entrez") -``` - -### 1:many mapping - -For 1:many mappings, we'll pick the one with the most support. -Here, we'll only consider the _number_ of sources without any regard for _what_ -those resources are. -Essentially, we weigh all sources equally though they likely have more or less -permissive criteria. - -```{r} -# add a column that counts resources -mouse_ortholog_df <- mouse_ortholog_df %>% - dplyr::rowwise() %>% - dplyr::mutate( - num_resources_support = - length(stringr::str_split(support, - pattern = ",", - simplify = TRUE - )) - ) -``` - -To demonstrate how this works, we'll pick a gene that had 1:many mappings -and follow it along. - -```{r} -mouse_ortholog_df %>% - dplyr::filter(human_entrez == 5631) %>% - dplyr::select( - human_entrez, mouse_entrez, human_symbol, mouse_symbol, - num_resources_support - ) -``` - -We can see that the human gene _PRPS1_/`5631` maps to 4 mouse genes, one of -which has 10 resources supporting that mapping. - -```{r} -# for each unique human entrez id, pick the mapping with the highest number of -# resources supporting it -most_support_df <- mouse_ortholog_df %>% - dplyr::group_by(human_entrez) %>% - dplyr::top_n(1, num_resources_support) -``` - -What happened to `5631`? - -```{r} -most_support_df %>% - dplyr::filter(human_entrez == 5631) %>% - dplyr::select( - human_entrez, mouse_entrez, human_symbol, mouse_symbol, - num_resources_support - ) -``` - -We successfully selected the mapping with the highest number of resources -supporting it. - -### Conversion of KEGG pathways - -For each KEGG pathway we have (currently populated with human Entrez IDs), -we need a new gene set that is comprised of mouse gene symbols. - -```{r} -kegg_mouse_list <- - lapply( - kegg_human_list, - function(pathway) { - dplyr::filter(most_support_df, human_entrez %in% pathway) %>% - dplyr::pull(mouse_symbol) - } - ) -``` - -Do the results this seem reasonable? -Let's pick the polymerase pathway, where our success should be pretty obvious -from the mouse gene symbols. - -```{r} -kegg_mouse_list[[grep( - "POLYMERASE", # Replace with a word (case sensitive) or phrase that would filter in your desired pathway(s) - names(kegg_mouse_list) -)]] -``` - -### Write mouse pathway list in `GMT` format - -We need to write this new mouse pathway list to file in [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29), as this will facilitate use with downstream pathway analysis. - -Briefly, the GMT format has one pathway per line and it follows this pattern: - -``` -\t\t... -``` - -We've lost the description information because it's removed by -`qusage::read.gmt`. -The description in `r kegg_file` follows this pattern: - -``` -http://www.broadinstitute.org/gsea/msigdb/cards/ -``` - -We can pretty easily stick this back in when we write to file with `write()`. - -```{r} -# filename we're going to write to -mouse_file <- file.path( - "gene-sets", # Replace with path to desired output directory - "c2.cp.kegg.v6.2.entrez_mouse_symbol_hcop.gmt" # Replace with name for output file -) - -# if you run this again after making changes above, you'd just end up appending -# this to the end of your old file, so we should take steps to get rid of -# the existing file -if (file.exists(mouse_file)) { - message(paste("Removing old", mouse_file)) - file.remove(mouse_file) -} - -# for each pathway, write it to line following GMT format -for (pathway_iter in seq_along(kegg_mouse_list)) { - # extract the current pathway name - pathway_name <- names(kegg_mouse_list)[pathway_iter] - text_to_write <- - paste( - # the name of the pathway - pathway_name, - # the description -- this is stripped out by qusage::read.gmt - paste0( - "http://www.broadinstitute.org/gsea/msigdb/cards/", - pathway_name - ), - # the gene symbols - paste(kegg_mouse_list[[pathway_iter]], collapse = "\t"), - sep = "\t" - ) - write(text_to_write, mouse_file, append = TRUE) -} -``` - -We can double check how this went by reading it back in with `qusage::read.gmt`. - -```{r} -mouse_read_list <- qusage::read.gmt(mouse_file) -all.equal(mouse_read_list, kegg_mouse_list) -``` - -## Session Info - -```{r} -sessioninfo::session_info() -``` diff --git a/02-microarray/pathway-analysis_microarray_02_gsea.Rmd b/02-microarray/pathway-analysis_microarray_02_gsea.Rmd new file mode 100644 index 00000000..462e45cc --- /dev/null +++ b/02-microarray/pathway-analysis_microarray_02_gsea.Rmd @@ -0,0 +1,583 @@ +--- +title: "Gene set enrichment analysis - Microarray" +author: "CCDL for ALSF" +date: "October 2020" +output: + html_notebook: + toc: true + toc_float: true + number_sections: true +--- + +# Purpose of this analysis + +This example is one of pathway analysis module set, we recommend looking at the [pathway analysis table below](#how-to-choose-a-pathway-analysis) to help you determine which pathway analysis method is best suited for your purposes. + +This particular example analysis shows how you can use Gene Set Enrichment Analysis (GSEA) to detect situations where genes in a predefined gene set or pathway change in a coordinated way between two conditions [@Subramanian2005]. +Changes at the pathway-level may be statistically significant, and contribute to phenotypic differences, even if the changes in the expression level of individual genes are small. + +⬇️ [**Jump to the analysis code**](#analysis) ⬇️ + +### What is pathway analysis? + +Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. +In the context of [refine.bio](https://www.refine.bio/), we use these techniques to analyze and interpret genome-wide gene expression experiments. +The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. +In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis. + +We highly recommend taking a look at [Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002375) from @Khatri2012 for a more comprehensive overview. +We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the [`Resources for further learning` section](#resources-for-further-learning). + +### How to choose a pathway analysis? + +This table summarizes the pathway analyses examples in this module. + +|Analysis|What is required for input|What output looks like |✅ Pros| ⚠️ Cons| +|--------|--------------------------|-----------------------|-------|-------| +|[**ORA (Over-representation Analysis)**](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_01_ora.html)|A list of gene IDs (no stats needed)|A per-pathway hypergeometric test result|- Simple

- Inexpensive computationally to calculate p-values| - Requires arbitrary thresholds and ignores any statistics associated with a gene

- Assumes independence of genes and pathways| +|[**GSEA (Gene Set Enrichment Analysis)**](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_02_gsea.html)|A list of genes IDs with gene-level summary statistics|A per-pathway enrichment score|- Includes all genes (no arbitrary threshold!)

- Attempts to measure coordination of genes|- Permutations can be expensive

- Does not account for pathway overlap

- Two-group comparisons not always appropriate/feasible| +|[**GSVA (Gene Set Variation Analysis)**](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_03_gsva.html)|A gene expression matrix (like what you get from refine.bio directly)|Pathway-level scores on a per-sample basis|- Does not require two groups to compare upfront

- Normally distributed scores|- Scores are not a good fit for gene sets that contain genes that go up AND down

- Method doesn’t assign statistical significance itself

- Recommended sample size n > 10| + +# How to run this example + +For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. + +## Obtain the `.Rmd` file + +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_02_gsea.Rmd). + +Clicking this link will most likely send this to your downloads folder on your computer. +Move this `.Rmd` file to where you would like this example and its files to be stored. + +You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) + +## Set up your analysis folders + +Good file organization is helpful for keeping your data analysis project on track! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! + +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. + +```{r} +# Create the data folder if it doesn't exist +if (!dir.exists("data")) { + dir.create("data") +} + +# Define the file path to the plots directory +plots_dir <- "plots" + +# Create the plots folder if it doesn't exist +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} + +# Define the file path to the results directory +results_dir <- "results" + +# Create the results folder if it doesn't exist +if (!dir.exists(results_dir)) { + dir.create(results_dir) +} +``` + +In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! + +## Obtain the gene set for this example + +In this example, we are using differential expression results table we obtained from an [example analysis of zebrafish samples overexpressing human CREB experiment](https://alexslemonade.github.io/refinebio-examples/02-microarray/differential-expression_microarray_01_2-groups.html) using [`limma`](https://bioconductor.org/packages/release/bioc/html/limma.html) [@Ritchie2015]. +The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case). + +We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you'd like to follow the steps for obtaining this results file yourself, we suggest going through [that differential expression analysis example](https://alexslemonade.github.io/refinebio-examples/02-microarray/differential-expression_microarray_01_2-groups.html). + +## About the dataset we are using for this example + +For this example analysis, we will use this [CREB overexpression zebrafish experiment](https://www.refine.bio/experiments/GSE71270/creb-overexpression-induces-leukemia-in-zebrafish-by-blocking-myeloid-differentiation-process) [@Tregnago2016]. +@Tregnago2016 used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples. + +## Check out our file structure! + +Your new analysis folder should contain: + +- The example analysis `.Rmd` you downloaded +- A folder called `data` (currently empty) +- A folder for `plots` (currently empty) +- A folder for `results` (currently empty) + +Your example analysis folder should contain your `.Rmd` and three empty folders (which won't be empty for long!). + +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). + +# Using a different refine.bio dataset with this analysis? + +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). +We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. +From here you can customize this analysis example to fit your own scientific questions and preferences. + +*** + +   + +# Gene set enrichment analysis - Microarray + +## Install libraries + +See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. + +In this analysis, we will be using [`clusterProfiler`](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package to perform GSEA and the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package which contains gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) already in the tidy format required by `clusterProfiler` [@Yu2012; @Dolgalev2020; @Subramanian2005; @Liberzon2011]. +In this analysis, we will be using [`clusterProfiler`](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package to perform GSEA and the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package which contains gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) already in the tidy format required by `clusterProfiler` [@Yu2012; @Dolgalev2020; @Subramanian2005; @Liberzon2011]. + +We will also need the [`org.Dr.eg.db`](https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html) package to perform gene identifier conversion [@Carlson2019-zebrafish]. + +```{r} +if (!("clusterProfiler" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("clusterProfiler", update = FALSE) +} + +if (!("msigdbr" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("msigdbr", update = FALSE) +} + +if (!("org.Dr.eg.db" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("org.Dr.eg.db", update = FALSE) +} +``` + +Attach the packages we need for this analysis. + +```{r message=FALSE} +# Attach the library +library(clusterProfiler) + +# Package that contains MSigDB gene sets in tidy format +library(msigdbr) + +# Zebrafish annotation package we'll use for gene identifier conversion +library(org.Dr.eg.db) + +# We will need this so we can use the pipe: %>% +library(magrittr) +``` + +## Download data file + +We will read in the differential expression results we will download from online. +These results are from a zebrafish microarray experiment we used for [differential expression analysis for two groups](https://alexslemonade.github.io/refinebio-examples/02-microarray/differential-expression_microarray_02_2-groups.html) using [`limma`](https://bioconductor.org/packages/release/bioc/html/limma.html) [@Ritchie2015]. +The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case). + +Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. +First we will assign the URL to its own variable called, `dge_url`. + +```{r} +# Define the url to your differential expression results file +dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv" +``` + +We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R. + +```{r} +dge_results_file <- file.path( + results_dir, + "GSE71270_limma_results.tsv" +) +``` + +Using the URL (`dge_url`) and file path (`dge_results_file`) we can download the file and use the `destfile` argument to specify where it should be saved. + +```{r} +download.file( + dge_url, + # The file will be saved to this location and with this name + destfile = dge_results_file +) +``` + +Now let's double check that the results file is in the right place. + +```{r} +# Check if the file exists +file.exists(dge_results_file) +``` + +## Import data + +Read in the file that has differential expression results. + +```{r} +# Read in the contents of the differential expression results file +dge_df <- readr::read_tsv(dge_results_file) +``` + +Note that `read_tsv()` can also read TSV files directly from a URL and doesn't necessarily require you download the file first. +If you wanted to use that feature, you could replace the call above with `readr::read_tsv(dge_url)` and skip the download steps. + +Let's take a look at what these results from the differential expression analysis look like. + +```{r} +dge_df +``` + +## Getting familiar with MSigDB gene sets available via `msigdbr` + +We can use the `msigdbr` package to access these gene sets in a format compatible with the package we'll use for analysis, `clusterProfiler` [@Dolgalev2020; @Yu2012]. +The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses [@Subramanian2005; @Liberzon2011]. +We can use the `msigdbr` package to access these gene sets in a format compatible with the package we'll use for analysis, `clusterProfiler` [@Dolgalev2020; @Yu2012]. + +The gene sets available directly from MSigDB are applicable to human studies. +`msigdbr` also supports commonly studied model organisms. + +Let's take a look at what organisms the package supports with `msigdbr_species()`. + +```{r} +msigdbr_species() +``` + +MSigDB contains [8 different gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) [@Subramanian2005; @Liberzon2011] that are distinguished by how they are derived (e.g., computationally mined, curated). + +In this example, we will use a collection called Hallmark gene sets for GSEA [@Liberzon2015]. +Here's an excerpt of [the collection description from MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H): + +> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. +> These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. +> The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA. + +Notably, there are only 50 gene sets included in this collection. +The fewer gene sets we test, the lower our multiple hypothesis testing burden. + +The data we're interested in here comes from zebrafish samples, so we can obtain only the Hallmarks gene sets relevant to _D. rerio_ by specifying `category = "H"` and `species = "Danio rerio"`, respectively, to the `msigdbr()` function. + +```{r} +dr_hallmark_df <- msigdbr( + species = "Danio rerio", # Replace with species name relevant to your data + category = "H" +) +``` + +If you run the chunk above without specifying a `category` to the `msigdbr()` function, it will return all of the MSigDB gene sets for zebrafish. + +Let's preview what's in `dr_hallmark_df`. + +```{r} +head(dr_hallmark_df) +``` + +Looks like we have a data frame of gene sets with associated gene symbols and Entrez IDs. + +In our differential expression results data frame, `dge_df` we have Ensembl gene identifiers. +So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs for GSEA. + +## Gene identifier conversion + +We're going to convert our identifiers in `dge_df` to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers! + +The annotation package `org.Dr.eg.db` contains information for different identifiers [@Carlson2019-zebrafish]. +`org.Dr.eg.db` is specific to _Danio rerio_ -- this is what the `Dr` in the package name is referencing. + +We can see what types of IDs are available to us in an annotation package with `keytypes()`. + +```{r} +keytypes(org.Dr.eg.db) +``` + +Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to Entrez IDs (`ENTREZID`), we could just as easily use it to convert from an Ensembl transcript ID (`ENSEMBLTRANS`) to gene symbols (`SYMBOL`). + +Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: +[the microarray example](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html) and [the RNA-seq example](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html). + +The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called `mapIds()` and comes from the `AnnotationDbi` package. + +Let's create a data frame that shows the mapped Entrez IDs along with the differential expression stats for the respective Ensembl IDs. + +```{r} +# First let's create a mapped data frame we can join to the differential +# expression stats +dge_mapped_df <- data.frame( + entrez_id = mapIds( + # Replace with annotation package for the organism relevant to your data + org.Dr.eg.db, + keys = dge_df$Gene, + # Replace with the type of gene identifiers in your data + keytype = "ENSEMBL", + # Replace with the type of gene identifiers you would like to map to + column = "ENTREZID", + # This will keep only the first mapped value for each Ensembl ID + multiVals = "first" + ) +) %>% + # If an Ensembl gene identifier doesn't map to a Entrez gene identifier, + # drop that from the data frame + dplyr::filter(!is.na(entrez_id)) %>% + # Make an `Ensembl` column to store the rownames + tibble::rownames_to_column("Ensembl") %>% + # Now let's join the rest of the expression data + dplyr::inner_join(dge_df, by = c("Ensembl" = "Gene")) +``` + +This `1:many mapping between keys and columns` message means that some Ensembl gene identifiers map to multiple Entrez IDs. +In this case, it's also possible that a Entrez ID will map to multiple Ensembl IDs. +For the purpose of performing GSEA later in this notebook, we keep only the first mapped IDs. +For more about how to explore this, take a look at our [microarray gene ID conversion example](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html). + +Let's see a preview of `dge_mapped_df`. + +```{r rownames.print=FALSE} +head(dge_mapped_df) +``` + +## Perform gene set enrichment analysis (GSEA) + +The goal of GSEA is to detect situations where many genes in a gene set change in a coordinated way, even when individual changes are small in magnitude [@Subramanian2005]. + +GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic. +This score reflects whether or not a gene set or pathway is overrepresented at the top or bottom of the gene rankings [@Subramanian2005; @clusterProfiler-book]. +Specifically, genes are ranked from most positive to most negative based on their statistic and a running sum is calculated by starting with the most highly ranked genes and increasing the score when a gene is in the pathway and decreasing the score when a gene is not. +In this example, the enrichment score for a pathway is the running sum's maximum deviation from zero. +GSEA also assesses statistical significance of the scores for each pathway through permutation testing. +As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing [@Subramanian2005; @clusterProfiler-book]. + +The implementation of GSEA we use in this examples requires a gene list ordered by some statistic (here we'll use the t-statistic calculated as part of differential gene expression analysis) and input gene sets (Hallmark collection). +When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked. + +### Determine our pre-ranked genes list + +The `GSEA()` function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. +It requires _unique gene identifiers_ to produce the most accurate results, so we will need to resolve any duplicates found in our dataset. +(The `GSEA()` function will throw a warning if we do not do this ahead of time.) + +Let's check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs in our data frame of differential expression results. + +```{r} +any(duplicated(dge_mapped_df$entrez_id)) +``` + +Looks like we do have duplicated Entrez IDs. +Let's find out which ones. + +```{r} +dup_entrez_ids <- dge_mapped_df %>% + dplyr::filter(duplicated(entrez_id)) %>% + dplyr::pull(entrez_id) + +dup_entrez_ids +``` + +Now let's take a look at the rows associated with the duplicated Entrez IDs. + +```{r} +dge_mapped_df %>% + dplyr::filter(entrez_id %in% dup_entrez_ids) +``` + +We can see that the associated values vary for each row. + +As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the `GSEA()` step. +Let's keep the Entrez IDs associated with the higher absolute value of the t-statistic. +GSEA relies on genes' rankings on the basis of a gene-level statistic and the enrichment score that is calculated reflects the degree to which genes in a gene set are overrepresented in the top or bottom of the rankings [@Subramanian2005; @clusterProfiler-book]. + +Retaining the instance of the Entrez ID with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. +We should keep this decision in mind when interpreting our results. +For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking. + +We are removing values for two genes here, so it is unlikely to have a considerable impact on our results. + +```{r} +filtered_dge_mapped_df <- dge_mapped_df %>% + # Sort so that the highest absolute values of the t-statistic are at the top + dplyr::arrange(dplyr::desc(abs(t))) %>% + # Filter out the duplicated rows using `dplyr::distinct()`-- this will keep + # the first row with the duplicated value thus keeping the row with the + # highest absolute value of the t-statistic + dplyr::distinct(entrez_id, .keep_all = TRUE) +``` + +Let's check to see that we removed the duplicate Entrez IDs and kept the rows with the higher absolute value of the t-statistic. + +```{r} +filtered_dge_mapped_df %>% + dplyr::filter(entrez_id %in% dup_entrez_ids) +``` + +Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher absolute value of the t-statistic! + +In the next chunk, we will create a named vector ranked based on the gene-level t-statistic values. + +```{r} +# Let's create a named vector ranked based on the t-statistic values +t_vector <- filtered_dge_mapped_df$t +names(t_vector) <- filtered_dge_mapped_df$entrez_id + +# We need to sort the t-statistic values in descending order here +t_vector <- sort(t_vector, decreasing = TRUE) +``` + +Let's preview our pre-ranked named vector. + +```{r} +# Look at first entries of the ranked t-statistic vector +head(t_vector) +``` + +### Run GSEA using the `GSEA()` function + +Genes were ranked from most positive to most negative, weighted according to their gene-level statistic, in the previous section. +In this section, we will implement GSEA to calculate the enrichment score for each gene set using our pre-ranked gene list. + +The GSEA algorithm utilizes random sampling so we are going to set the seed to make our results reproducible. + +```{r} +# Set the seed so our results are reproducible: +set.seed(2020) +``` + +We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`). + +```{r} +gsea_results <- GSEA( + geneList = t_vector, # Ordered ranked gene list + minGSSize = 25, # Minimum gene set size + maxGSSize = 500, # Maximum gene set set + pvalueCutoff = 0.05, # p-value cutoff + eps = 0, # Boundary for calculating the p-value + seed = TRUE, # Set seed to make results reproducible + pAdjustMethod = "BH", # Benjamini-Hochberg correction + TERM2GENE = dplyr::select( + dr_hallmark_df, + gs_name, + entrez_gene + ) +) +``` + +Significance is assessed by permuting the gene labels of the pre-ranked gene list and recomputing the enrichment scores of the gene set for the permuted data, which generates a null distribution for the enrichment score. +The `pAdjustMethod` argument to `GSEA()` above specifies what method to use for adjusting the p-values to account for multiple hypothesis testing; the `pvalueCutoff` argument tells the function to only return pathways with adjusted p-values less than that threshold in the `results` slot. + +Let's take a look at the table in the `result` slot of `gsea_results`. + +```{r rownames.print=FALSE} +# We can access the results from our gseaResult object using `@result` +head(gsea_results@result) +``` + +Looks like we have gene sets returned as significant at FDR (false discovery rate) of `0.05`. +If we did not have results that met the `pvalueCutoff` condition, this table would be empty. + +The `NES` column contains the normalized enrichment score, which normalizes for the gene set size, for that pathway. + +Let's convert the contents of `result` into a data frame that we can use for further analysis and write to a file later. + +```{r} +gsea_result_df <- data.frame(gsea_results@result) +``` + +## Visualizing results + +We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`. +Let's take a look at 2 different pathways -- one with a highly positive NES and one with a highly negative NES -- to get more insight into how ES are calculated. + +### Most Positive NES + +Let's look at the 3 gene sets with the most positive NES. + +```{r rownames.print = FALSE} +gsea_result_df %>% + # This returns the 3 rows with the largest NES values + dplyr::slice_max(n = 3, order_by = NES) +``` + +The gene set `HALLMARK_TNFA_SIGNALING_VIA_NFKB` has the most positive NES score. + +```{r} +most_positive_nes_plot <- enrichplot::gseaplot( + gsea_results, + geneSetID = "HALLMARK_TNFA_SIGNALING_VIA_NFKB", + title = "HALLMARK_TNFA_SIGNALING_VIA_NFKB", + color.line = "#0d76ff" +) + +most_positive_nes_plot +``` + +Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. +The red dashed line indicates the enrichment score, which is the maximum deviation from zero. +As mentioned earlier, an enrichment is calculated by starting with the most highly ranked genes (according to the gene-level t-statistic values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. + +The plots returned by `enrichplot::gseaplot` are ggplots, so we can use `ggplot2::ggsave()` to save them to file. + +Let's save to PNG. + +```{r} +ggplot2::ggsave(file.path(plots_dir, "GSE71270_gsea_enrich_positive_plot.png"), + plot = most_positive_nes_plot +) +``` + +### Most Negative NES + +Let's look for the 3 gene sets with the most negative NES. + +```{r rownames.print=FALSE} +gsea_result_df %>% + # Return the 3 rows with the smallest (most negative) NES values + dplyr::slice_min(n = 3, order_by = NES) +``` + +The gene set `HALLMARK_E2F_TARGETS` has the most negative NES. + +```{r} +most_negative_nes_plot <- enrichplot::gseaplot( + gsea_results, + geneSetID = "HALLMARK_E2F_TARGETS", + title = "HALLMARK_E2F_TARGETS", + color.line = "#0d76ff" +) + +most_negative_nes_plot +``` + +This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph. +Again, the red dashed line here indicates the maximum deviation from zero, in other words, the enrichment score. +A _negative_ enrichment score will be returned when many genes are near the bottom of the ranked list. + +Let's save this plot to PNG as well. + +```{r} +ggplot2::ggsave(file.path(plots_dir, "GSE71270_gsea_enrich_negative_plot.png"), + plot = most_negative_nes_plot +) +``` + +## Write results to file + +```{r} +readr::write_tsv( + gsea_result_df, + file.path( + results_dir, + "GSE71270_gsea_results.tsv" + ) +) +``` + +# Resources for further learning + +- [clusterProfiler paper](https://doi.org/10.1089/omi.2011.0118) [@Yu2012]. +- [clusterProfiler book](https://yulab-smu.github.io/clusterProfiler-book/index.html) [@clusterProfiler-book]. +- [This handy review](https://doi.org/10.1371/journal.pcbi.1002375) which summarizes the different types of pathway analysis and their limitations [@Khatri2012]. +- See this [Broad Institute page](https://www.gsea-msigdb.org/gsea/index.jsp) for more on GSEA and MSigDB [@GSEA-broad-institute]. + +# Session info + +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. + +```{r} +# Print session info +sessioninfo::session_info() +``` + +# References diff --git a/02-microarray/pathway-analysis_microarray_02_gsea.html b/02-microarray/pathway-analysis_microarray_02_gsea.html new file mode 100644 index 00000000..bfb8a9e8 --- /dev/null +++ b/02-microarray/pathway-analysis_microarray_02_gsea.html @@ -0,0 +1,4486 @@ + + + + + + + + + + + + + + +Gene set enrichment analysis - Microarray + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + + + +
+

1 Purpose of this analysis

+

This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

+

This particular example analysis shows how you can use Gene Set Enrichment Analysis (GSEA) to detect situations where genes in a predefined gene set or pathway change in a coordinated way between two conditions (Subramanian et al. 2005). Changes at the pathway-level may be statistically significant, and contribute to phenotypic differences, even if the changes in the expression level of individual genes are small.

+

⬇️ Jump to the analysis code ⬇️

+
+

1.0.1 What is pathway analysis?

+

Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

+

We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

+
+
+

1.0.2 How to choose a pathway analysis?

+

This table summarizes the pathway analyses examples in this module.

+ +++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
AnalysisWhat is required for inputWhat output looks like✅ Pros⚠️ Cons
ORA (Over-representation Analysis)A list of gene IDs (no stats needed)A per-pathway hypergeometric test result- Simple

- Inexpensive computationally to calculate p-values
- Requires arbitrary thresholds and ignores any statistics associated with a gene

- Assumes independence of genes and pathways
GSEA (Gene Set Enrichment Analysis)A list of genes IDs with gene-level summary statisticsA per-pathway enrichment score- Includes all genes (no arbitrary threshold!)

- Attempts to measure coordination of genes
- Permutations can be expensive

- Does not account for pathway overlap

- Two-group comparisons not always appropriate/feasible
GSVA (Gene Set Variation Analysis)A gene expression matrix (like what you get from refine.bio directly)Pathway-level scores on a per-sample basis- Does not require two groups to compare upfront

- Normally distributed scores
- Scores are not a good fit for gene sets that contain genes that go up AND down

- Method doesn’t assign statistical significance itself

- Recommended sample size n > 10
+
+
+
+

2 How to run this example

+

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

+
+

2.1 Obtain the .Rmd file

+

To run this example yourself, download the .Rmd for this analysis by clicking this link.

+

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

+

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

+
+
+

2.2 Set up your analysis folders

+

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

+

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}
+

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

+
+
+

2.3 Obtain the gene set for this example

+

In this example, we are using differential expression results table we obtained from an example analysis of zebrafish samples overexpressing human CREB experiment using limma (Ritchie et al. 2015). The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case).

+

We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you’d like to follow the steps for obtaining this results file yourself, we suggest going through that differential expression analysis example.

+
+
+

2.4 About the dataset we are using for this example

+

For this example analysis, we will use this CREB overexpression zebrafish experiment (Tregnago et al. 2016). Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.

+
+
+

2.5 Check out our file structure!

+

Your new analysis folder should contain:

+
    +
  • The example analysis .Rmd you downloaded
    +
  • +
  • A folder called data (currently empty)
  • +
  • A folder for plots (currently empty)
    +
  • +
  • A folder for results (currently empty)
  • +
+

Your example analysis folder should contain your .Rmd and three empty folders (which won’t be empty for long!).

+

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

+
+
+
+

3 Using a different refine.bio dataset with this analysis?

+

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

+
+ +

 

+
+
+

4 Gene set enrichment analysis - Microarray

+
+

4.1 Install libraries

+

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

+

In this analysis, we will be using clusterProfiler package to perform GSEA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005; Liberzon et al. 2011). In this analysis, we will be using clusterProfiler package to perform GSEA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005; Liberzon et al. 2011).

+

We will also need the org.Dr.eg.db package to perform gene identifier conversion (Carlson 2019).

+
if (!("clusterProfiler" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("clusterProfiler", update = FALSE)
+}
+
+if (!("msigdbr" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("msigdbr", update = FALSE)
+}
+
+if (!("org.Dr.eg.db" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("org.Dr.eg.db", update = FALSE)
+}
+

Attach the packages we need for this analysis.

+
# Attach the library
+library(clusterProfiler)
+
+# Package that contains MSigDB gene sets in tidy format
+library(msigdbr)
+
+# Zebrafish annotation package we'll use for gene identifier conversion
+library(org.Dr.eg.db)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+
+

4.2 Download data file

+

We will read in the differential expression results we will download from online. These results are from a zebrafish microarray experiment we used for differential expression analysis for two groups using limma (Ritchie et al. 2015). The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case).

+

Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. First we will assign the URL to its own variable called, dge_url.

+
# Define the url to your differential expression results file
+dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv"
+

We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.

+
dge_results_file <- file.path(
+  results_dir,
+  "GSE71270_limma_results.tsv"
+)
+

Using the URL (dge_url) and file path (dge_results_file) we can download the file and use the destfile argument to specify where it should be saved.

+
download.file(
+  dge_url,
+  # The file will be saved to this location and with this name
+  destfile = dge_results_file
+)
+

Now let’s double check that the results file is in the right place.

+
# Check if the file exists
+file.exists(dge_results_file)
+
## [1] TRUE
+
+
+

4.3 Import data

+

Read in the file that has differential expression results.

+
# Read in the contents of the differential expression results file
+dge_df <- readr::read_tsv(dge_results_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
+## cols(
+##   Gene = col_character(),
+##   logFC = col_double(),
+##   AveExpr = col_double(),
+##   t = col_double(),
+##   P.Value = col_double(),
+##   adj.P.Val = col_double(),
+##   B = col_double()
+## )
+

Note that read_tsv() can also read TSV files directly from a URL and doesn’t necessarily require you download the file first. If you wanted to use that feature, you could replace the call above with readr::read_tsv(dge_url) and skip the download steps.

+

Let’s take a look at what these results from the differential expression analysis look like.

+
dge_df
+
+ +
+
+
+

4.4 Getting familiar with MSigDB gene sets available via msigdbr

+

We can use the msigdbr package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler (Yu et al. 2012; Dolgalev 2020). The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). We can use the msigdbr package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler (Yu et al. 2012; Dolgalev 2020).

+

The gene sets available directly from MSigDB are applicable to human studies. msigdbr also supports commonly studied model organisms.

+

Let’s take a look at what organisms the package supports with msigdbr_species().

+
msigdbr_species()
+
+ +
+

MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).

+

In this example, we will use a collection called Hallmark gene sets for GSEA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:

+
+

Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

+
+

Notably, there are only 50 gene sets included in this collection. The fewer gene sets we test, the lower our multiple hypothesis testing burden.

+

The data we’re interested in here comes from zebrafish samples, so we can obtain only the Hallmarks gene sets relevant to D. rerio by specifying category = "H" and species = "Danio rerio", respectively, to the msigdbr() function.

+
dr_hallmark_df <- msigdbr(
+  species = "Danio rerio", # Replace with species name relevant to your data
+  category = "H"
+)
+

If you run the chunk above without specifying a category to the msigdbr() function, it will return all of the MSigDB gene sets for zebrafish.

+

Let’s preview what’s in dr_hallmark_df.

+
head(dr_hallmark_df)
+
+ +
+

Looks like we have a data frame of gene sets with associated gene symbols and Entrez IDs.

+

In our differential expression results data frame, dge_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs for GSEA.

+
+
+

4.5 Gene identifier conversion

+

We’re going to convert our identifiers in dge_df to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

+

The annotation package org.Dr.eg.db contains information for different identifiers (Carlson 2019). org.Dr.eg.db is specific to Danio rerio – this is what the Dr in the package name is referencing.

+

We can see what types of IDs are available to us in an annotation package with keytypes().

+
keytypes(org.Dr.eg.db)
+
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
+##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
+##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
+## [13] "IPI"          "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
+## [17] "PFAM"         "PMID"         "PROSITE"      "REFSEQ"      
+## [21] "SYMBOL"       "UNIGENE"      "UNIPROT"      "ZFIN"
+

Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to Entrez IDs (ENTREZID), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS) to gene symbols (SYMBOL).

+

Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

+

The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds() and comes from the AnnotationDbi package.

+

Let’s create a data frame that shows the mapped Entrez IDs along with the differential expression stats for the respective Ensembl IDs.

+
# First let's create a mapped data frame we can join to the differential
+# expression stats
+dge_mapped_df <- data.frame(
+  entrez_id = mapIds(
+    # Replace with annotation package for the organism relevant to your data
+    org.Dr.eg.db,
+    keys = dge_df$Gene,
+    # Replace with the type of gene identifiers in your data
+    keytype = "ENSEMBL",
+    # Replace with the type of gene identifiers you would like to map to
+    column = "ENTREZID",
+    # This will keep only the first mapped value for each Ensembl ID
+    multiVals = "first"
+  )
+) %>%
+  # If an Ensembl gene identifier doesn't map to a Entrez gene identifier,
+  # drop that from the data frame
+  dplyr::filter(!is.na(entrez_id)) %>%
+  # Make an `Ensembl` column to store the rownames
+  tibble::rownames_to_column("Ensembl") %>%
+  # Now let's join the rest of the expression data
+  dplyr::inner_join(dge_df, by = c("Ensembl" = "Gene"))
+
## 'select()' returned 1:many mapping between keys and columns
+

This 1:many mapping between keys and columns message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSEA later in this notebook, we keep only the first mapped IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.

+

Let’s see a preview of dge_mapped_df.

+
head(dge_mapped_df)
+
+ +
+
+
+

4.6 Perform gene set enrichment analysis (GSEA)

+

The goal of GSEA is to detect situations where many genes in a gene set change in a coordinated way, even when individual changes are small in magnitude (Subramanian et al. 2005).

+

GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic. This score reflects whether or not a gene set or pathway is overrepresented at the top or bottom of the gene rankings (Yu 2020; Subramanian et al. 2005). Specifically, genes are ranked from most positive to most negative based on their statistic and a running sum is calculated by starting with the most highly ranked genes and increasing the score when a gene is in the pathway and decreasing the score when a gene is not. In this example, the enrichment score for a pathway is the running sum’s maximum deviation from zero. GSEA also assesses statistical significance of the scores for each pathway through permutation testing. As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing (Yu 2020; Subramanian et al. 2005).

+

The implementation of GSEA we use in this examples requires a gene list ordered by some statistic (here we’ll use the t-statistic calculated as part of differential gene expression analysis) and input gene sets (Hallmark collection). When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.

+
+

4.6.1 Determine our pre-ranked genes list

+

The GSEA() function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. It requires unique gene identifiers to produce the most accurate results, so we will need to resolve any duplicates found in our dataset. (The GSEA() function will throw a warning if we do not do this ahead of time.)

+

Let’s check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs in our data frame of differential expression results.

+
any(duplicated(dge_mapped_df$entrez_id))
+
## [1] TRUE
+

Looks like we do have duplicated Entrez IDs. Let’s find out which ones.

+
dup_entrez_ids <- dge_mapped_df %>%
+  dplyr::filter(duplicated(entrez_id)) %>%
+  dplyr::pull(entrez_id)
+
+dup_entrez_ids
+
## [1] "336702" "57924"
+

Now let’s take a look at the rows associated with the duplicated Entrez IDs.

+
dge_mapped_df %>%
+  dplyr::filter(entrez_id %in% dup_entrez_ids)
+
+ +
+

We can see that the associated values vary for each row.

+

As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSEA() step. Let’s keep the Entrez IDs associated with the higher absolute value of the t-statistic. GSEA relies on genes’ rankings on the basis of a gene-level statistic and the enrichment score that is calculated reflects the degree to which genes in a gene set are overrepresented in the top or bottom of the rankings (Yu 2020; Subramanian et al. 2005).

+

Retaining the instance of the Entrez ID with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. We should keep this decision in mind when interpreting our results. For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking.

+

We are removing values for two genes here, so it is unlikely to have a considerable impact on our results.

+
filtered_dge_mapped_df <- dge_mapped_df %>%
+  # Sort so that the highest absolute values of the t-statistic are at the top
+  dplyr::arrange(dplyr::desc(abs(t))) %>%
+  # Filter out the duplicated rows using `dplyr::distinct()`-- this will keep
+  # the first row with the duplicated value thus keeping the row with the
+  # highest absolute value of the t-statistic
+  dplyr::distinct(entrez_id, .keep_all = TRUE)
+

Let’s check to see that we removed the duplicate Entrez IDs and kept the rows with the higher absolute value of the t-statistic.

+
filtered_dge_mapped_df %>%
+  dplyr::filter(entrez_id %in% dup_entrez_ids)
+
+ +
+

Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher absolute value of the t-statistic!

+

In the next chunk, we will create a named vector ranked based on the gene-level t-statistic values.

+
# Let's create a named vector ranked based on the t-statistic values
+t_vector <- filtered_dge_mapped_df$t
+names(t_vector) <- filtered_dge_mapped_df$entrez_id
+
+# We need to sort the t-statistic values in descending order here
+t_vector <- sort(t_vector, decreasing = TRUE)
+

Let’s preview our pre-ranked named vector.

+
# Look at first entries of the ranked t-statistic vector
+head(t_vector)
+
##   555053   140633   407728   368924   335916   323329 
+## 20.22172 14.48634 13.88657 12.45258 11.24450 10.92140
+
+
+

4.6.2 Run GSEA using the GSEA() function

+

Genes were ranked from most positive to most negative, weighted according to their gene-level statistic, in the previous section. In this section, we will implement GSEA to calculate the enrichment score for each gene set using our pre-ranked gene list.

+

The GSEA algorithm utilizes random sampling so we are going to set the seed to make our results reproducible.

+
# Set the seed so our results are reproducible:
+set.seed(2020)
+

We can use the GSEA() function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., gseKEGG()).

+
gsea_results <- GSEA(
+  geneList = t_vector, # Ordered ranked gene list
+  minGSSize = 25, # Minimum gene set size
+  maxGSSize = 500, # Maximum gene set set
+  pvalueCutoff = 0.05, # p-value cutoff
+  eps = 0, # Boundary for calculating the p-value
+  seed = TRUE, # Set seed to make results reproducible
+  pAdjustMethod = "BH", # Benjamini-Hochberg correction
+  TERM2GENE = dplyr::select(
+    dr_hallmark_df,
+    gs_name,
+    entrez_gene
+  )
+)
+
## preparing geneSet collections...
+
## GSEA analysis...
+
## leading edge analysis...
+
## done...
+

Significance is assessed by permuting the gene labels of the pre-ranked gene list and recomputing the enrichment scores of the gene set for the permuted data, which generates a null distribution for the enrichment score. The pAdjustMethod argument to GSEA() above specifies what method to use for adjusting the p-values to account for multiple hypothesis testing; the pvalueCutoff argument tells the function to only return pathways with adjusted p-values less than that threshold in the results slot.

+

Let’s take a look at the table in the result slot of gsea_results.

+
# We can access the results from our gseaResult object using `@result`
+head(gsea_results@result)
+
+ +
+

Looks like we have gene sets returned as significant at FDR (false discovery rate) of 0.05. If we did not have results that met the pvalueCutoff condition, this table would be empty.

+

The NES column contains the normalized enrichment score, which normalizes for the gene set size, for that pathway.

+

Let’s convert the contents of result into a data frame that we can use for further analysis and write to a file later.

+
gsea_result_df <- data.frame(gsea_results@result)
+
+
+
+

4.7 Visualizing results

+

We can visualize GSEA results for individual pathways or gene sets using enrichplot::gseaplot(). Let’s take a look at 2 different pathways – one with a highly positive NES and one with a highly negative NES – to get more insight into how ES are calculated.

+
+

4.7.1 Most Positive NES

+

Let’s look at the 3 gene sets with the most positive NES.

+
gsea_result_df %>%
+  # This returns the 3 rows with the largest NES values
+  dplyr::slice_max(n = 3, order_by = NES)
+
+ +
+

The gene set HALLMARK_TNFA_SIGNALING_VIA_NFKB has the most positive NES score.

+
most_positive_nes_plot <- enrichplot::gseaplot(
+  gsea_results,
+  geneSetID = "HALLMARK_TNFA_SIGNALING_VIA_NFKB",
+  title = "HALLMARK_TNFA_SIGNALING_VIA_NFKB",
+  color.line = "#0d76ff"
+)
+
+most_positive_nes_plot
+

+

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. The red dashed line indicates the enrichment score, which is the maximum deviation from zero. As mentioned earlier, an enrichment is calculated by starting with the most highly ranked genes (according to the gene-level t-statistic values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway.

+

The plots returned by enrichplot::gseaplot are ggplots, so we can use ggplot2::ggsave() to save them to file.

+

Let’s save to PNG.

+
ggplot2::ggsave(file.path(plots_dir, "GSE71270_gsea_enrich_positive_plot.png"),
+  plot = most_positive_nes_plot
+)
+
## Saving 7 x 5 in image
+
+
+

4.7.2 Most Negative NES

+

Let’s look for the 3 gene sets with the most negative NES.

+
gsea_result_df %>%
+  # Return the 3 rows with the smallest (most negative) NES values
+  dplyr::slice_min(n = 3, order_by = NES)
+
+ +
+

The gene set HALLMARK_E2F_TARGETS has the most negative NES.

+
most_negative_nes_plot <- enrichplot::gseaplot(
+  gsea_results,
+  geneSetID = "HALLMARK_E2F_TARGETS",
+  title = "HALLMARK_E2F_TARGETS",
+  color.line = "#0d76ff"
+)
+
+most_negative_nes_plot
+

+

This gene set shows the opposite pattern – genes in the pathway tend to be on the right side of the graph. Again, the red dashed line here indicates the maximum deviation from zero, in other words, the enrichment score. A negative enrichment score will be returned when many genes are near the bottom of the ranked list.

+

Let’s save this plot to PNG as well.

+
ggplot2::ggsave(file.path(plots_dir, "GSE71270_gsea_enrich_negative_plot.png"),
+  plot = most_negative_nes_plot
+)
+
## Saving 7 x 5 in image
+
+
+
+

4.8 Write results to file

+
readr::write_tsv(
+  gsea_result_df,
+  file.path(
+    results_dir,
+    "GSE71270_gsea_results.tsv"
+  )
+)
+
+
+
+

5 Resources for further learning

+ +
+
+

6 Session info

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ─────────────────────────────────────────────────────
+##  setting  value                       
+##  version  R version 4.0.2 (2020-06-22)
+##  os       Ubuntu 20.04 LTS            
+##  system   x86_64, linux-gnu           
+##  ui       X11                         
+##  language (EN)                        
+##  collate  en_US.UTF-8                 
+##  ctype    en_US.UTF-8                 
+##  tz       Etc/UTC                     
+##  date     2020-12-21                  
+## 
+## ─ Packages ─────────────────────────────────────────────────────────
+##  package         * version  date       lib source        
+##  AnnotationDbi   * 1.52.0   2020-10-27 [1] Bioconductor  
+##  assertthat        0.2.1    2019-03-21 [1] RSPM (R 4.0.0)
+##  backports         1.1.10   2020-09-15 [1] RSPM (R 4.0.2)
+##  Biobase         * 2.50.0   2020-10-27 [1] Bioconductor  
+##  BiocGenerics    * 0.36.0   2020-10-27 [1] Bioconductor  
+##  BiocManager       1.30.10  2019-11-16 [1] RSPM (R 4.0.0)
+##  BiocParallel      1.24.1   2020-11-06 [1] Bioconductor  
+##  bit               4.0.4    2020-08-04 [1] RSPM (R 4.0.2)
+##  bit64             4.0.5    2020-08-30 [1] RSPM (R 4.0.2)
+##  blob              1.2.1    2020-01-20 [1] RSPM (R 4.0.0)
+##  cli               2.1.0    2020-10-12 [1] RSPM (R 4.0.2)
+##  clusterProfiler * 3.18.0   2020-10-27 [1] Bioconductor  
+##  colorspace        1.4-1    2019-03-18 [1] RSPM (R 4.0.0)
+##  cowplot           1.1.0    2020-09-08 [1] RSPM (R 4.0.2)
+##  crayon            1.3.4    2017-09-16 [1] RSPM (R 4.0.0)
+##  data.table        1.13.0   2020-07-24 [1] RSPM (R 4.0.2)
+##  DBI               1.1.0    2019-12-15 [1] RSPM (R 4.0.0)
+##  digest            0.6.25   2020-02-23 [1] RSPM (R 4.0.0)
+##  DO.db             2.9      2020-12-01 [1] Bioconductor  
+##  DOSE              3.16.0   2020-10-27 [1] Bioconductor  
+##  downloader        0.4      2015-07-09 [1] RSPM (R 4.0.0)
+##  dplyr             1.0.2    2020-08-18 [1] RSPM (R 4.0.2)
+##  ellipsis          0.3.1    2020-05-15 [1] RSPM (R 4.0.0)
+##  enrichplot        1.10.1   2020-11-14 [1] Bioconductor  
+##  evaluate          0.14     2019-05-28 [1] RSPM (R 4.0.0)
+##  fansi             0.4.1    2020-01-08 [1] RSPM (R 4.0.0)
+##  farver            2.0.3    2020-01-16 [1] RSPM (R 4.0.0)
+##  fastmatch         1.1-0    2017-01-28 [1] RSPM (R 4.0.0)
+##  fgsea             1.16.0   2020-10-27 [1] Bioconductor  
+##  generics          0.0.2    2018-11-29 [1] RSPM (R 4.0.0)
+##  getopt            1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
+##  ggforce           0.3.2    2020-06-23 [1] RSPM (R 4.0.2)
+##  ggplot2           3.3.2    2020-06-19 [1] RSPM (R 4.0.1)
+##  ggraph            2.0.3    2020-05-20 [1] RSPM (R 4.0.2)
+##  ggrepel           0.8.2    2020-03-08 [1] RSPM (R 4.0.2)
+##  glue              1.4.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  GO.db             3.12.1   2020-12-01 [1] Bioconductor  
+##  GOSemSim          2.16.1   2020-10-29 [1] Bioconductor  
+##  graphlayouts      0.7.0    2020-04-25 [1] RSPM (R 4.0.2)
+##  gridExtra         2.3      2017-09-09 [1] RSPM (R 4.0.0)
+##  gtable            0.3.0    2019-03-25 [1] RSPM (R 4.0.0)
+##  hms               0.5.3    2020-01-08 [1] RSPM (R 4.0.0)
+##  htmltools         0.5.0    2020-06-16 [1] RSPM (R 4.0.1)
+##  igraph            1.2.6    2020-10-06 [1] RSPM (R 4.0.2)
+##  IRanges         * 2.24.0   2020-10-27 [1] Bioconductor  
+##  jsonlite          1.7.1    2020-09-07 [1] RSPM (R 4.0.2)
+##  knitr             1.30     2020-09-22 [1] RSPM (R 4.0.2)
+##  labeling          0.3      2014-08-23 [1] RSPM (R 4.0.0)
+##  lattice           0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
+##  lifecycle         0.2.0    2020-03-06 [1] RSPM (R 4.0.0)
+##  magrittr        * 1.5      2014-11-22 [1] RSPM (R 4.0.0)
+##  MASS              7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+##  Matrix            1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
+##  memoise           1.1.0    2017-04-21 [1] RSPM (R 4.0.0)
+##  msigdbr         * 7.2.1    2020-10-02 [1] RSPM (R 4.0.2)
+##  munsell           0.5.0    2018-06-12 [1] RSPM (R 4.0.0)
+##  optparse        * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
+##  org.Dr.eg.db    * 3.12.0   2020-12-01 [1] Bioconductor  
+##  pillar            1.4.6    2020-07-10 [1] RSPM (R 4.0.2)
+##  pkgconfig         2.0.3    2019-09-22 [1] RSPM (R 4.0.0)
+##  plyr              1.8.6    2020-03-03 [1] RSPM (R 4.0.2)
+##  polyclip          1.10-0   2019-03-14 [1] RSPM (R 4.0.0)
+##  ps                1.4.0    2020-10-07 [1] RSPM (R 4.0.2)
+##  purrr             0.3.4    2020-04-17 [1] RSPM (R 4.0.0)
+##  qvalue            2.22.0   2020-10-27 [1] Bioconductor  
+##  R.cache           0.14.0   2019-12-06 [1] RSPM (R 4.0.0)
+##  R.methodsS3       1.8.1    2020-08-26 [1] RSPM (R 4.0.2)
+##  R.oo              1.24.0   2020-08-26 [1] RSPM (R 4.0.2)
+##  R.utils           2.10.1   2020-08-26 [1] RSPM (R 4.0.2)
+##  R6                2.4.1    2019-11-12 [1] RSPM (R 4.0.0)
+##  RColorBrewer      1.1-2    2014-12-07 [1] RSPM (R 4.0.0)
+##  Rcpp              1.0.5    2020-07-06 [1] RSPM (R 4.0.2)
+##  readr             1.4.0    2020-10-05 [1] RSPM (R 4.0.2)
+##  rematch2          2.1.2    2020-05-01 [1] RSPM (R 4.0.0)
+##  reshape2          1.4.4    2020-04-09 [1] RSPM (R 4.0.2)
+##  rlang             0.4.8    2020-10-08 [1] RSPM (R 4.0.2)
+##  rmarkdown         2.4      2020-09-30 [1] RSPM (R 4.0.2)
+##  RSQLite           2.2.1    2020-09-30 [1] RSPM (R 4.0.2)
+##  rstudioapi        0.11     2020-02-07 [1] RSPM (R 4.0.0)
+##  rvcheck           0.1.8    2020-03-01 [1] RSPM (R 4.0.0)
+##  S4Vectors       * 0.28.0   2020-10-27 [1] Bioconductor  
+##  scales            1.1.1    2020-05-11 [1] RSPM (R 4.0.0)
+##  scatterpie        0.1.5    2020-09-09 [1] RSPM (R 4.0.2)
+##  sessioninfo       1.1.1    2018-11-05 [1] RSPM (R 4.0.0)
+##  shadowtext        0.0.7    2019-11-06 [1] RSPM (R 4.0.0)
+##  stringi           1.5.3    2020-09-09 [1] RSPM (R 4.0.2)
+##  stringr           1.4.0    2019-02-10 [1] RSPM (R 4.0.0)
+##  styler            1.3.2    2020-02-23 [1] RSPM (R 4.0.0)
+##  tibble            3.0.4    2020-10-12 [1] RSPM (R 4.0.2)
+##  tidygraph         1.2.0    2020-05-12 [1] RSPM (R 4.0.2)
+##  tidyr             1.1.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  tidyselect        1.1.0    2020-05-11 [1] RSPM (R 4.0.0)
+##  tweenr            1.0.1    2018-12-14 [1] RSPM (R 4.0.2)
+##  vctrs             0.3.4    2020-08-29 [1] RSPM (R 4.0.2)
+##  viridis           0.5.1    2018-03-29 [1] RSPM (R 4.0.0)
+##  viridisLite       0.3.0    2018-02-01 [1] RSPM (R 4.0.0)
+##  withr             2.3.0    2020-09-22 [1] RSPM (R 4.0.2)
+##  xfun              0.18     2020-09-29 [1] RSPM (R 4.0.2)
+##  yaml              2.2.1    2020-02-01 [1] RSPM (R 4.0.0)
+## 
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
+
+
+

References

+
+
+

Carlson M., 2019 Genome wide annotation for zebrafish. https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html

+
+
+

Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html

+
+
+

Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

+
+
+

Liberzon A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, and J. P. Mesirov et al., 2015 The molecular signatures database hallmark gene set collection. Cell Systems 1. https://doi.org/10.1016/j.cels.2015.12.004

+
+
+

Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, and P. Tamayo et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260

+
+
+

Ritchie M. E., B. Phipson, D. Wu, Y. Hu, and C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007

+
+
+

Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

+
+
+

Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98

+
+
+

UC San Diego and Broad Institute Team, GSEA: Gene set enrichment analysis. https://www.gsea-msigdb.org/gsea/index.jsp

+
+
+

Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118

+
+
+

Yu G., 2020 clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html

+
+
+
+ + + + +
+
+ +
+ + + + + + + + + + + + + + + + diff --git a/02-microarray/pathway-analysis_microarray_03_gsva.Rmd b/02-microarray/pathway-analysis_microarray_03_gsva.Rmd new file mode 100644 index 00000000..023bbe7c --- /dev/null +++ b/02-microarray/pathway-analysis_microarray_03_gsva.Rmd @@ -0,0 +1,757 @@ +--- +title: "Gene set variation analysis - Microarray" +author: "CCDL for ALSF" +date: "November 2020" +output: + html_notebook: + toc: true + toc_float: true + number_sections: true +--- + +# Purpose of this analysis + +This example is one of pathway analysis module set, we recommend looking at the [pathway analysis table below](#how-to-choose-a-pathway-analysis) to help you determine which pathway analysis method is best suited for your purposes. + +In this example we will cover a method called Gene Set Variation Analysis (GSVA) to calculate gene set or pathway scores on a per-sample basis [@Hanzelmann2013]. +You can use the GSVA scores for other downstream analyses. +In this analysis, we will test GSVA scores for differential expression. + +⬇️ [**Jump to the analysis code**](#analysis) ⬇️ + +### What is pathway analysis? + +Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. +In the context of [refine.bio](https://www.refine.bio/), we use these techniques to analyze and interpret genome-wide gene expression experiments. +The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. +In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis. + +We highly recommend taking a look at [Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002375) from @Khatri2012 for a more comprehensive overview. +We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the [`Resources for further learning` section](#resources-for-further-learning). + +### How to choose a pathway analysis? + +This table summarizes the pathway analyses examples in this module. + +|Analysis|What is required for input|What output looks like |✅ Pros| ⚠️ Cons| +|--------|--------------------------|-----------------------|-------|-------| +|[**ORA (Over-representation Analysis)**](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_01_ora.html)|A list of gene IDs (no stats needed)|A per-pathway hypergeometric test result|- Simple

- Inexpensive computationally to calculate p-values| - Requires arbitrary thresholds and ignores any statistics associated with a gene

- Assumes independence of genes and pathways| +|[**GSEA (Gene Set Enrichment Analysis)**](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_02_gsea.html)|A list of genes IDs with gene-level summary statistics|A per-pathway enrichment score|- Includes all genes (no arbitrary threshold!)

- Attempts to measure coordination of genes|- Permutations can be expensive

- Does not account for pathway overlap

- Two-group comparisons not always appropriate/feasible| +|[**GSVA (Gene Set Variation Analysis)**](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_03_gsva.html)|A gene expression matrix (like what you get from refine.bio directly)|Pathway-level scores on a per-sample basis|- Does not require two groups to compare upfront

- Normally distributed scores|- Scores are not a good fit for gene sets that contain genes that go up AND down

- Method doesn’t assign statistical significance itself

- Recommended sample size n > 10| + + +# How to run this example + +For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. + +## Obtain the `.Rmd` file + +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_03_gsva.Rmd). + +Clicking this link will most likely send this to your downloads folder on your computer. +Move this `.Rmd` file to where you would like this example and its files to be stored. + +You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) + +## Set up your analysis folders + +Good file organization is helpful for keeping your data analysis project on track! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! + +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. + +```{r} +# Create the data folder if it doesn't exist +if (!dir.exists("data")) { + dir.create("data") +} + +# Define the file path to the plots directory +plots_dir <- "plots" + +# Create the plots folder if it doesn't exist +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} + +# Define the file path to the results directory +results_dir <- "results" + +# Create the results folder if it doesn't exist +if (!dir.exists(results_dir)) { + dir.create(results_dir) +} + +# Define the file path to the gene_sets directory +gene_sets_dir <- "gene_sets" + +# Create the gene_sets folder if it doesn't exist +if (!dir.exists(gene_sets_dir)) { + dir.create(gene_sets_dir) +} +``` + +In the same place you put this `.Rmd` file, you should now have four new empty folders called `data`, `plots`, `results`, and `gene_sets`! + +## Obtain the dataset from refine.bio + +For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). + +Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/GSE37382). + +Click the "Download Now" button on the right side of this screen. + + + +Fill out the pop up window with your email and our Terms and Conditions: + + + +It may take a few minutes for the dataset to process. +You will get an email when it is ready. + +## About the dataset we are using for this example + +For this example analysis, we will use this [medulloblastoma dataset](https://www.refine.bio/experiments/GSE37382) [@Northcott2012]. +The data that we downloaded from refine.bio for this analysis has 285 microarray samples obtained from patients with medulloblastoma. +Medulloblastoma is the most common childhood brain cancer and is often categorized by subgroups. +We will use these `subgroup` labels from our metadata to perform differential expression with our GSVA scores. + +## Place the dataset in your new `data/` folder + +refine.bio will send you a download button in the email when it is ready. +Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. +Double clicking should unzip this for you and create a folder of the same name. + + + +For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). + +The `GSE37382` folder has the data and metadata TSV files you will need for this example analysis. +Experiment accession ids usually look something like `GSE1235` or `SRP12345`. + +Copy and paste the `GSE37382` folder into your newly created `data/` folder. + +## Check out our file structure! + +Your new analysis folder should contain: + +- The example analysis `.Rmd` you downloaded +- A folder called "data" which contains: + - The `GSE37382` folder which contains: + - The gene expression + - The metadata TSV +- A folder for `plots` (currently empty) +- A folder for `results` (currently empty) +- A folder for `gene_sets` (currently empty) + +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): + + + +In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. +These chunks will declare your file paths and double check that your files are in the right place. + +First we will declare our file paths to our data and metadata files, which should be in our data directory. +This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started. + +```{r} +# Define the file path to the data directory +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "GSE37382") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "GSE37382.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv") +``` + +Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. + +```{r} +# Check if the gene expression matrix file is at the path stored in `data_file` +file.exists(data_file) + +# Check if the metadata file is at the file path stored in `metadata_file` +file.exists(metadata_file) +``` + +If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. + +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). + +# Using a different refine.bio dataset with this analysis? + +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). +We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. +From here you can customize this analysis example to fit your own scientific questions and preferences. + +*** + +   + +# Gene set variation analysis - Microarray + +## Install libraries + +See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. + +In this analysis, we will be using the [`GSVA`](https://www.bioconductor.org/packages/release/bioc/html/GSVA.html) package to perform GSVA and the [`qusage`](https://www.bioconductor.org/packages/release/bioc/html/qusage.html) package to read in the GMT file containing the gene set data [@Hanzelmann2013; @Yaari2013]. + +We will also need the [`org.Hs.eg.db`](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) package to perform gene identifier conversion [@Carlson2020-human]. + +We'll also be performing differential expression test on our GSVA scores, and for that we will use `limma` [@Ritchie2015] and we'll make a sina plot of the scores of our most significant pathway using a ggplot2 companion package, `ggforce`. + +```{r} +if (!("GSVA" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("GSVA", update = FALSE) +} + +if (!("qusage" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("qusage", update = FALSE) +} + +if (!("org.Hs.eg.db" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("org.Hs.eg.db", update = FALSE) +} + +if (!("limma" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("limma", update = FALSE) +} + +if (!("ggforce" %in% installed.packages())) { + # Install this package if it isn't installed yet + install.packages("ggforce") +} +``` + +Attach the packages we need for this analysis. + +```{r message=FALSE} +# Attach the `qusage` library +library(qusage) + +# Attach the `GSVA` library +library(GSVA) + +# Human annotation package we'll use for gene identifier conversion +library(org.Hs.eg.db) + +# Attach the ggplot2 package for plotting +library(ggplot2) + +# We will need this so we can use the pipe: %>% +library(magrittr) +``` + +## Import and set up data + +Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. +This chunk of code will read both TSV files and add them as data frames to your environment. + +We stored our file paths as objects named `metadata_file` and `data_file` in [this previous step](#check-out-our-file-structure). + +```{r} +# Read in metadata TSV file +metadata <- readr::read_tsv(metadata_file) + +# Read in data TSV file +expression_df <- readr::read_tsv(data_file) +``` + +Let’s ensure that the metadata and data are in the same sample order. + +```{r} +# Make the data in the order of the metadata +expression_df <- expression_df %>% + dplyr::select(c(Gene, metadata$geo_accession)) + +# Check if this is in the same order +all.equal(colnames(expression_df)[-1], metadata$geo_accession) +``` + +### Import Gene Sets + +The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses [@Subramanian2005; @Liberzon2011]. +MSigDB contains [8 different gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) [@Subramanian2005; @Liberzon2011] that are distinguished by how they are derived (e.g., computationally mined, curated). + +In this example, we will use a collection called Hallmark gene sets for GSVA [@Liberzon2015]. +Here's an excerpt of [the collection description from MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H): + +> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. +> These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. + +The function that we will use to run GSVA requires the gene sets to be in a list. +We are going to download a [GMT file](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29) that contains the the Hallmark gene sets [@Liberzon2015] from MSigDB [@Subramanian2005; @Liberzon2011] to the `gene_sets` directory. + +Note that when downloading GMT files from MSigDB, only _Homo sapiens_ gene sets are supported. +If you'd like to use MSigDB gene sets with GSVA for a commonly studied model organism, check out [our RNA-seq GSVA example](http://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_03_gsva.html). + +```{r} +hallmarks_gmt_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.2/h.all.v7.2.entrez.gmt" +``` + +We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R. + +```{r} +hallmarks_gmt_file <- file.path( + gene_sets_dir, + "h.all.v7.2.entrez.gmt" +) +``` + +Using the URL (`hallmarks_gmt_url`) and file path (`hallmark_gmt_file`) we can download the file and use the `destfile` argument to specify where it should be saved. + +```{r} +download.file( + hallmarks_gmt_url, + # The file will be saved to this location and with this name + destfile = hallmarks_gmt_file +) +``` + +Now let's double check that the file that contains the gene sets is in the right place. + +```{r} +# Check if the file exists +file.exists(hallmarks_gmt_file) +``` + +Now we're ready to read the file into R with `qusage::read.gmt()`. + +```{r} +# QuSAGE is another pathway analysis method, the `qusage` package has a function +# for reading GMT files and turning them into a list that we can use with GSVA +hallmarks_list <- qusage::read.gmt(hallmarks_gmt_file) +``` + +What does this `hallmarks_list` look like? + +```{r} +head(hallmarks_list, n = 2) +``` + +Looks like we have a list of gene sets with associated Entrez IDs. + +In our gene expression data frame, `expression_df` we have Ensembl gene identifiers. +So we will need to convert our Ensembl IDs into Entrez IDs for GSVA. + +### Gene identifier conversion + +We're going to convert our identifiers in `expression_df` to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers! + +The annotation package `org.Hs.eg.db` contains information for different identifiers [@Carlson2020-human]. +`org.Hs.eg.db` is specific to _Homo sapiens_ -- this is what the `Hs` in the package name is referencing. + +We can see what types of IDs are available to us in an annotation package with `keytypes()`. + +```{r} +keytypes(org.Hs.eg.db) +``` + +Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to Entrez IDs (`ENTREZID`), we could just as easily use it to convert from an Ensembl transcript ID (`ENSEMBLTRANS`) to gene symbols (`SYMBOL`). + +Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: +[the microarray example](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html) and [the RNA-seq example](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html). + +The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called `mapIds()` and comes from the `AnnotationDbi` package. + +Let's create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs. + +```{r} +# First let's create a mapped data frame we can join to the gene expression +# values +mapped_df <- data.frame( + "entrez_id" = mapIds( + # Replace with annotation package for the organism relevant to your data + org.Hs.eg.db, + keys = expression_df$Gene, + # Replace with the type of gene identifiers in your data + keytype = "ENSEMBL", + # Replace with the type of gene identifiers you would like to map to + column = "ENTREZID", + # This will keep only the first mapped value for each Ensembl ID + multiVals = "first" + ) +) %>% + # If an Ensembl gene identifier doesn't map to a Entrez gene identifier, drop + # that from the data frame + dplyr::filter(!is.na(entrez_id)) %>% + # Make an `Ensembl` column to store the row names + tibble::rownames_to_column("Ensembl") %>% + # Now let's join the rest of the expression data + dplyr::inner_join(expression_df, by = c("Ensembl" = "Gene")) +``` + +This `1:many mapping between keys and columns` message means that some Ensembl gene identifiers map to multiple Entrez IDs. +In this case, it's also possible that a Entrez ID will map to multiple Ensembl IDs. +For the purpose of performing GSVA later in this notebook, we keep only the first mapped IDs. +For more about how to explore this, take a look at our [microarray gene ID conversion example](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html). + +Let's see a preview of `mapped_df`. + +```{r rownames.print=FALSE} +head(mapped_df) +``` + +We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as row names. +In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers. + +Let's check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs. + +```{r} +any(duplicated(mapped_df$entrez_id)) +``` + +Looks like we do have duplicated Entrez IDs. +Let's find out which ones. + +```{r} +dup_entrez_ids <- mapped_df %>% + dplyr::filter(duplicated(entrez_id)) %>% + dplyr::pull(entrez_id) + +dup_entrez_ids +``` + +#### Handling duplicate gene identifiers + +As we mentioned earlier, we will not want any duplicate gene identifiers in our data frame when we convert it into a matrix in preparation for the GSVA steps later. +GSVA is executed on a per sample basis so let's keep the maximum expression value per sample associated with the duplicate Entrez gene identifiers. +In other words, we will keep only the maximum expression value found across the duplicate Entrez gene identifier instances for each sample or column. + +Let's take a look at the rows associated with the duplicated Entrez IDs and see how this will play out. +```{r} +mapped_df %>% + dplyr::filter(entrez_id %in% dup_entrez_ids) +``` + +As an example using the strategy we described, for `GSM917111`'s data in the first column, `0.2294387` is larger than `0.1104345` so moving forward, Entrez gene `6013` will have `0.2294387` value and the `0.1104345` would be dropped from the dataset. +This is just one method of handling duplicate gene identifiers. +See the [Gene Set Enrichment Analysis (GSEA) User guide](https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html) for more information on other commonly used strategies, such as taking the median expression value. + +Now, let's implement the maximum value method for all samples and Entrez IDs using tidyverse functions. +```{r} +max_dup_df <- mapped_df %>% + # We won't be using Ensembl IDs moving forward, so we will drop this column + dplyr::select(-Ensembl) %>% + # Filter to include only the rows associated with the duplicate Entrez gene + # identifiers + dplyr::filter(entrez_id %in% dup_entrez_ids) %>% + # Group by Entrez IDs + dplyr::group_by(entrez_id) %>% + # Get the maximum expression value using all values associated with each + # duplicate Entrez ID for each column or sample in this case + dplyr::summarize_all(max) + +max_dup_df +``` + +We can see `GSM917111` now has the `0.2294387` value for Entrez ID `6013` like expected. +Looks like we were able to successfully get rid of the duplicate Entrez gene identifiers! + +Now let's combine our newly de-duplicated data with the rest of the mapped data! + +```{r} +filtered_mapped_df <- mapped_df %>% + # We won't be using Ensembl IDs moving forward, so we will drop this column + dplyr::select(-Ensembl) %>% + # First let's get the data associated with the Entrez gene identifiers that + # aren't duplicated + dplyr::filter(!entrez_id %in% dup_entrez_ids) %>% + # Now let's bind the rows of the maximum expression data we stored in + # `max_dup_df` + dplyr::bind_rows(max_dup_df) +``` + +As mentioned earlier, we need a matrix for GSVA. +Let's now convert our data frame into a matrix and prepare our object for GSVA. + +```{r} +filtered_mapped_matrix <- filtered_mapped_df %>% + # We need to store our gene identifiers as row names + tibble::column_to_rownames("entrez_id") %>% + # Now we can convert our object into a matrix + as.matrix() +``` + +Note that if we had duplicate gene identifiers here, we would not be able to set them as row names. + +## GSVA - Microarray + +GSVA fits a model and ranks genes based on their expression level relative to the sample distribution [@Hanzelmann2013]. +The pathway-level score calculated is a way of asking how genes _within_ a gene set vary as compared to genes that are _outside_ of that gene set [@Malhotra2018]. + +The idea here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) [@Hanzelmann2013]. +This means that GSVA scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and ran GSVA again, you would expect the scores to change [@Hanzelmann2013]. + +The output is a gene set by sample matrix of GSVA scores. + +### Perform GSVA + +Let's perform GSVA using the `gsva()` function. +See `?gsva` for more options. + +```{r} +gsva_results <- gsva( + filtered_mapped_matrix, + hallmarks_list, + method = "gsva", + # Appropriate for our log2-transformed microarray data + kcdf = "Gaussian", + # Minimum gene set size + min.sz = 15, + # Maximum gene set size + max.sz = 500, + # Compute Gaussian-distributed scores + mx.diff = TRUE, + # Don't print out the progress bar + verbose = FALSE +) +``` + +Let's explore what the output of `gsva()` looks like. + +```{r} +# First 6 rows, first 10 columns +head(gsva_results[, 1:10]) +``` + +## Find differentially expressed pathways + +If we want to identify most differentially expressed pathways across subgroups, we can use functionality in the `limma` package to test the GSVA scores. + +This is one approach for working with GSVA scores; the `mx.diff = TRUE` argument that we supplied to the `gsva()` function in the previous section means the GSVA output scores should be normally distributed, which can be helpful if you want to perform downstream analyses with approaches that make assumptions of normality [@Hanzelmann-gsva-vignette]. + +### Create the design matrix + +`limma` needs a numeric design matrix to indicate which subtype of medulloblastoma a sample originates from. +Now we will create a model matrix based on our `subgroup` variable. +We are using a `+ 0` in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. +If you have a control group, you might want to leave off the `+ 0` so the model includes an intercept representing the control group expression level, with the remaining coefficients the changes relative to that expression level. + +```{r} +# Create the design matrix +des_mat <- model.matrix(~ subgroup + 0, data = metadata) +``` + +Let's take a look at the design matrix we created. + +```{r} +# Print out part of the design matrix +head(des_mat) +``` + +The design matrix column names are a bit messy, so we will neaten them up by dropping the `subgroup` designation they all have and any spaces in names. + +```{r} +# Make the column names less messy +colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "subgroup") + +# Do a similar thing but remove spaces in names +colnames(des_mat) <- stringr::str_remove(colnames(des_mat), " ") +``` + +## Perform differential expression on pathway scores + +Run the linear model on each pathway (each row of `gsva_results`) with `lmFit()` and apply empirical Bayes smoothing with `eBayes()`. + +```{r} +# Apply linear model to data +fit <- limma::lmFit(gsva_results, design = des_mat) + +# Apply empirical Bayes to smooth standard errors +fit <- limma::eBayes(fit) +``` + +Now that we have our basic model fitting, we will want to make the contrasts among all our groups. +Depending on your scientific questions, you will need to customize the next steps. +Consulting the [limma users guide](https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf) for how to set up your model based on your hypothesis is a good idea. + +In this contrasts matrix, we are comparing each subgroup to the average of the other subgroups. +We're dividing by two in this expression so that each group is compared to the average of the other two groups (`makeContrasts()` doesn't allow you to use functions like `mean()`; it wants a formula). + +```{r} +contrast_matrix <- makeContrasts( + "G3vsOther" = Group3 - (Group4 + SHH) / 2, + "G4vsOther" = Group4 - (Group3 + SHH) / 2, + "SHHvsOther" = SHH - (Group3 + Group4) / 2, + levels = des_mat +) +``` + +Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulae would look like `Group3 = Group3 - Control` for each one. +We highly recommend consulting the [limma users guide](https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf) for figuring out what your `makeContrasts()` and `model.matrix()` setups should look like [@Ritchie2015]. + +Now that we have the contrasts matrix set up, we can use it to re-fit the model and re-smooth it with `eBayes()`. + +```{r} +# Fit the model according to the contrasts matrix +contrasts_fit <- contrasts.fit(fit, contrast_matrix) + +# Re-smooth the Bayes +contrasts_fit <- eBayes(contrasts_fit) +``` + +Here's a [nifty article and example](http://varianceexplained.org/r/empirical_bayes_baseball/) about what the empirical Bayes smoothing is for [@bayes-estimates]. + +Now let's create the results table based on the contrasts fitted model. + +This step will provide the Benjamini-Hochberg multiple testing correction. +The `topTable()` function default is to use Benjamini-Hochberg but this can be changed to a different method using the `adjust.method` argument (see the `?topTable` help page for more about the options). + +```{r} +# Apply multiple testing correction and obtain stats +stats_df <- topTable(contrasts_fit, number = nrow(expression_df)) %>% + tibble::rownames_to_column("Gene") +``` + +Let's take a peek at our results table. + +```{r rownames.print=FALSE} +head(stats_df) +``` + +For each pathway, each group's fold change in expression, compared to the average of the other groups is reported. + +By default, results are ordered from largest `F` value to the smallest, which means your most differentially expressed pathways across all groups should be toward the top. + +This means `HALLMARK_UNFOLDED_PROTEIN_RESPONSE` appears to be the pathway that contains the most significant distribution of scores across subgroups. + +## Visualizing Results + +Let's make a plot for our most significant pathway, `HALLMARK_UNFOLDED_PROTEIN_RESPONSE`. + +### Sina plot + +First we need to get our GSVA scores for this pathway into a long data frame, an appropriate format for `ggplot2`. + +Let's look at the current format of `gsva_results`. + +```{r} +head(gsva_results[, 1:10]) +``` + +We can see that they are in a wide format with the GSVA scores for each sample spread across a row associated with each pathway. + +Now let's convert these results into a data frame and into a long format, using the `tidyr::pivot_longer()` function. + +```{r rownames.print=FALSE} +annotated_results_df <- gsva_results %>% + # Make this into a data frame + data.frame() %>% + # Gene set names are row names + tibble::rownames_to_column("pathway") %>% + # Get into long format using the `tidyr::pivot_longer()` function + tidyr::pivot_longer( + cols = -pathway, + names_to = "sample", + values_to = "gsva_score" + ) + +# Preview the annotated results object +head(annotated_results_df) +``` + +Now let's filter to include only the scores associated with our most significant pathway, `HALLMARK_UNFOLDED_PROTEIN_RESPONSE`, and join the relevant group labels from the metadata for plotting. + +```{r} +top_pathway_annotated_results_df <- annotated_results_df %>% + # Filter for only scores associated with our most significant pathway + dplyr::filter(pathway == "HALLMARK_UNFOLDED_PROTEIN_RESPONSE") %>% + # Join the column with the group labels that we would like to plot + dplyr::left_join(metadata %>% dplyr::select( + # Select the variables relevant to your data + refinebio_accession_code, + subgroup + ), + # Tell the join what columns are equivalent and should be used as a key + by = c("sample" = "refinebio_accession_code") + ) + +# Preview the filtered annotated results object +head(top_pathway_annotated_results_df) +``` + +Now let's make a sina plot so we can look at the differences between the `subgroup` groups using our GSVA scores. + +```{r} +# Sina plot comparing GSVA scores for `HALLMARK_UNFOLDED_PROTEIN_RESPONSE` +# the `subgroup` groups in our dataset +sina_plot <- + ggplot( + top_pathway_annotated_results_df, # Supply our annotated data frame + aes( + x = subgroup, # Replace with a grouping variable relevant to your data + y = gsva_score, # Column we previously created to store the GSVA scores + color = subgroup # Let's make the groups different colors too + ) + ) + + # Add a boxplot that will have summary stats + geom_boxplot(outlier.shape = NA) + + # Make a sina plot that shows individual values + ggforce::geom_sina() + + # Rename the y-axis label + labs(y = "HALLMARK_UNFOLDED_PROTEIN_RESPONSE_score") + + # Adjust the plot background for better visualization + theme_bw() + +# Display plot +sina_plot +``` + +Looks like the `Group 4` samples have lower GSVA scores for `HALLMARK_UNFOLDED_PROTEIN_RESPONSE` as compared to the `SHH` and `Group 3` scores. + +Let's save this plot to PNG. + +```{r} +ggsave( + file.path( + plots_dir, + "GSE37382_gsva_HALLMARK_UNFOLDED_PROTEIN_RESPONSE_sina_plot.png" + ), + plot = sina_plot +) +``` + +## Write results to file + +Now let's write all of our GSVA results to file. + +```{r} +gsva_results %>% + as.data.frame() %>% + tibble::rownames_to_column("pathway") %>% + readr::write_tsv(file.path( + results_dir, + "GSE37382_gsva_results.tsv" + )) +``` + +# Resources for further learning + +- [GSVA Paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-7) [@Hanzelmann2013] +- See this article on [Decoding Gene Set Variation Analysis](https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3) [@Malhotra2018] + +# Session info + +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. + +```{r} +# Print session info +sessioninfo::session_info() +``` + +# References diff --git a/02-microarray/pathway-analysis_microarray_03_gsva.html b/02-microarray/pathway-analysis_microarray_03_gsva.html new file mode 100644 index 00000000..78ef3c30 --- /dev/null +++ b/02-microarray/pathway-analysis_microarray_03_gsva.html @@ -0,0 +1,4773 @@ + + + + + + + + + + + + + + +Gene set variation analysis - Microarray + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + + + +
+

1 Purpose of this analysis

+

This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

+

In this example we will cover a method called Gene Set Variation Analysis (GSVA) to calculate gene set or pathway scores on a per-sample basis (Hänzelmann et al. 2013). You can use the GSVA scores for other downstream analyses. In this analysis, we will test GSVA scores for differential expression.

+

⬇️ Jump to the analysis code ⬇️

+
+

1.0.1 What is pathway analysis?

+

Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

+

We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

+
+
+

1.0.2 How to choose a pathway analysis?

+

This table summarizes the pathway analyses examples in this module.

+ +++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
AnalysisWhat is required for inputWhat output looks like✅ Pros⚠️ Cons
ORA (Over-representation Analysis)A list of gene IDs (no stats needed)A per-pathway hypergeometric test result- Simple

- Inexpensive computationally to calculate p-values
- Requires arbitrary thresholds and ignores any statistics associated with a gene

- Assumes independence of genes and pathways
GSEA (Gene Set Enrichment Analysis)A list of genes IDs with gene-level summary statisticsA per-pathway enrichment score- Includes all genes (no arbitrary threshold!)

- Attempts to measure coordination of genes
- Permutations can be expensive

- Does not account for pathway overlap

- Two-group comparisons not always appropriate/feasible
GSVA (Gene Set Variation Analysis)A gene expression matrix (like what you get from refine.bio directly)Pathway-level scores on a per-sample basis- Does not require two groups to compare upfront

- Normally distributed scores
- Scores are not a good fit for gene sets that contain genes that go up AND down

- Method doesn’t assign statistical significance itself

- Recommended sample size n > 10
+
+
+
+

2 How to run this example

+

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

+
+

2.1 Obtain the .Rmd file

+

To run this example yourself, download the .Rmd for this analysis by clicking this link.

+

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

+

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

+
+
+

2.2 Set up your analysis folders

+

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

+

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}
+
+# Define the file path to the gene_sets directory
+gene_sets_dir <- "gene_sets"
+
+# Create the gene_sets folder if it doesn't exist
+if (!dir.exists(gene_sets_dir)) {
+  dir.create(gene_sets_dir)
+}
+

In the same place you put this .Rmd file, you should now have four new empty folders called data, plots, results, and gene_sets!

+
+
+

2.3 Obtain the dataset from refine.bio

+

For general information about downloading data for these examples, see our ‘Getting Started’ section.

+

Go to this dataset’s page on refine.bio.

+

Click the “Download Now” button on the right side of this screen.

+

+

Fill out the pop up window with your email and our Terms and Conditions:

+

+

It may take a few minutes for the dataset to process. You will get an email when it is ready.

+
+
+

2.4 About the dataset we are using for this example

+

For this example analysis, we will use this medulloblastoma dataset (Northcott et al. 2012). The data that we downloaded from refine.bio for this analysis has 285 microarray samples obtained from patients with medulloblastoma. Medulloblastoma is the most common childhood brain cancer and is often categorized by subgroups. We will use these subgroup labels from our metadata to perform differential expression with our GSVA scores.

+
+
+

2.5 Place the dataset in your new data/ folder

+

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

+

+

For more details on the contents of this folder see these docs on refine.bio.

+

The GSE37382 folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

+

Copy and paste the GSE37382 folder into your newly created data/ folder.

+
+
+

2.6 Check out our file structure!

+

Your new analysis folder should contain:

+
    +
  • The example analysis .Rmd you downloaded
    +
  • +
  • A folder called “data” which contains: +
      +
    • The GSE37382 folder which contains: +
        +
      • The gene expression
        +
      • +
      • The metadata TSV
        +
      • +
    • +
  • +
  • A folder for plots (currently empty)
    +
  • +
  • A folder for results (currently empty)
    +
  • +
  • A folder for gene_sets (currently empty)
  • +
+

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

+

+

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

+

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

+
# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE37382")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE37382.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv")
+

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

+
# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
+
## [1] TRUE
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
+
## [1] TRUE
+

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

+

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

+
+
+
+

3 Using a different refine.bio dataset with this analysis?

+

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

+
+ +

 

+
+
+

4 Gene set variation analysis - Microarray

+
+

4.1 Install libraries

+

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

+

In this analysis, we will be using the GSVA package to perform GSVA and the qusage package to read in the GMT file containing the gene set data (Hänzelmann et al. 2013; Yaari et al. 2013).

+

We will also need the org.Hs.eg.db package to perform gene identifier conversion (Carlson 2020).

+

We’ll also be performing differential expression test on our GSVA scores, and for that we will use limma (Ritchie et al. 2015) and we’ll make a sina plot of the scores of our most significant pathway using a ggplot2 companion package, ggforce.

+
if (!("GSVA" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("GSVA", update = FALSE)
+}
+
+if (!("qusage" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("qusage", update = FALSE)
+}
+
+if (!("org.Hs.eg.db" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("org.Hs.eg.db", update = FALSE)
+}
+
+if (!("limma" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("limma", update = FALSE)
+}
+
+if (!("ggforce" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  install.packages("ggforce")
+}
+

Attach the packages we need for this analysis.

+
# Attach the `qusage` library
+library(qusage)
+
+# Attach the `GSVA` library
+library(GSVA)
+
+# Human annotation package we'll use for gene identifier conversion
+library(org.Hs.eg.db)
+
+# Attach the ggplot2 package for plotting
+library(ggplot2)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+
+

4.2 Import and set up data

+

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read both TSV files and add them as data frames to your environment.

+

We stored our file paths as objects named metadata_file and data_file in this previous step.

+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
+## cols(
+##   .default = col_character(),
+##   refinebio_age = col_double(),
+##   refinebio_cell_line = col_logical(),
+##   refinebio_compound = col_logical(),
+##   refinebio_disease = col_logical(),
+##   refinebio_disease_stage = col_logical(),
+##   refinebio_genetic_information = col_logical(),
+##   refinebio_processed = col_logical(),
+##   refinebio_race = col_logical(),
+##   refinebio_source_archive_url = col_logical(),
+##   refinebio_specimen_part = col_logical(),
+##   refinebio_subject = col_logical(),
+##   refinebio_time = col_logical(),
+##   refinebio_treatment = col_logical(),
+##   channel_count = col_double(),
+##   `contact_zip/postal_code` = col_double(),
+##   data_row_count = col_double(),
+##   taxid_ch1 = col_double()
+## )
+## ℹ Use `spec()` for the full column specifications.
+
# Read in data TSV file
+expression_df <- readr::read_tsv(data_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
+## cols(
+##   .default = col_double(),
+##   Gene = col_character()
+## )
+## ℹ Use `spec()` for the full column specifications.
+

Let’s ensure that the metadata and data are in the same sample order.

+
# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+  dplyr::select(c(Gene, metadata$geo_accession))
+
+# Check if this is in the same order
+all.equal(colnames(expression_df)[-1], metadata$geo_accession)
+
## [1] TRUE
+
+

4.2.1 Import Gene Sets

+

The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).

+

In this example, we will use a collection called Hallmark gene sets for GSVA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:

+
+

Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression.

+
+

The function that we will use to run GSVA requires the gene sets to be in a list. We are going to download a GMT file that contains the the Hallmark gene sets (Liberzon et al. 2015) from MSigDB (Subramanian et al. 2005; Liberzon et al. 2011) to the gene_sets directory.

+

Note that when downloading GMT files from MSigDB, only Homo sapiens gene sets are supported. If you’d like to use MSigDB gene sets with GSVA for a commonly studied model organism, check out our RNA-seq GSVA example.

+
hallmarks_gmt_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.2/h.all.v7.2.entrez.gmt"
+

We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.

+
hallmarks_gmt_file <- file.path(
+  gene_sets_dir,
+  "h.all.v7.2.entrez.gmt"
+)
+

Using the URL (hallmarks_gmt_url) and file path (hallmark_gmt_file) we can download the file and use the destfile argument to specify where it should be saved.

+
download.file(
+  hallmarks_gmt_url,
+  # The file will be saved to this location and with this name
+  destfile = hallmarks_gmt_file
+)
+

Now let’s double check that the file that contains the gene sets is in the right place.

+
# Check if the file exists
+file.exists(hallmarks_gmt_file)
+
## [1] TRUE
+

Now we’re ready to read the file into R with qusage::read.gmt().

+
# QuSAGE is another pathway analysis method, the `qusage` package has a function
+# for reading GMT files and turning them into a list that we can use with GSVA
+hallmarks_list <- qusage::read.gmt(hallmarks_gmt_file)
+

What does this hallmarks_list look like?

+
head(hallmarks_list, n = 2)
+
## $HALLMARK_TNFA_SIGNALING_VIA_NFKB
+##   [1] "3726"   "2920"   "467"    "4792"   "7128"   "5743"   "2919"  
+##   [8] "8870"   "9308"   "6364"   "2921"   "23764"  "4791"   "7127"  
+##  [15] "1839"   "1316"   "330"    "5329"   "7538"   "3383"   "3725"  
+##  [22] "1960"   "3553"   "597"    "23645"  "80149"  "6648"   "4929"  
+##  [29] "3552"   "5971"   "7185"   "7832"   "1843"   "1326"   "2114"  
+##  [36] "2152"   "6385"   "1958"   "3569"   "7124"   "23135"  "4790"  
+##  [43] "3976"   "5806"   "8061"   "3164"   "182"    "6351"   "2643"  
+##  [50] "6347"   "1827"   "1844"   "10938"  "9592"   "5966"   "8837"  
+##  [57] "8767"   "4794"   "8013"   "22822"  "51278"  "8744"   "2669"  
+##  [64] "1647"   "3627"   "10769"  "8553"   "1959"   "9021"   "11182" 
+##  [71] "5734"   "1847"   "5055"   "4783"   "5054"   "10221"  "25976" 
+##  [78] "5970"   "329"    "6372"   "9516"   "7130"   "960"    "3624"  
+##  [85] "5328"   "4609"   "3604"   "6446"   "10318"  "10135"  "2355"  
+##  [92] "10957"  "3398"   "969"    "3575"   "1942"   "7262"   "5209"  
+##  [99] "6352"   "79693"  "3460"   "8878"   "10950"  "4616"   "8942"  
+## [106] "50486"  "694"    "4170"   "7422"   "5606"   "1026"   "3491"  
+## [113] "10010"  "3433"   "3606"   "7280"   "3659"   "2353"   "4973"  
+## [120] "388"    "374"    "4814"   "65986"  "8613"   "9314"   "6373"  
+## [127] "6303"   "1435"   "1880"   "56937"  "5791"   "7097"   "57007" 
+## [134] "7071"   "4082"   "3914"   "1051"   "9322"   "2150"   "687"   
+## [141] "3949"   "7050"   "127544" "55332"  "2683"   "11080"  "1437"  
+## [148] "5142"   "8303"   "5341"   "6776"   "23258"  "595"    "23586" 
+## [155] "8877"   "941"    "25816"  "57018"  "2526"   "9034"   "80176" 
+## [162] "8848"   "9334"   "150094" "23529"  "4780"   "2354"   "5187"  
+## [169] "10725"  "490"    "3593"   "3572"   "9120"   "19"     "3280"  
+## [176] "604"    "8660"   "6515"   "1052"   "51561"  "4088"   "6890"  
+## [183] "9242"   "64135"  "3601"   "79155"  "602"    "24145"  "24147" 
+## [190] "1906"   "10209"  "650"    "1846"   "10611"  "23308"  "9945"  
+## [197] "10365"  "3371"   "5271"   "4084"  
+## 
+## $HALLMARK_HYPOXIA
+##   [1] "5230"   "5163"   "2632"   "5211"   "226"    "2026"   "5236"  
+##   [8] "10397"  "3099"   "230"    "2821"   "4601"   "6513"   "5033"  
+##  [15] "133"    "8974"   "2023"   "5214"   "205"    "26355"  "5209"  
+##  [22] "7422"   "665"    "7167"   "30001"  "55818"  "901"    "3939"  
+##  [29] "2997"   "2597"   "8553"   "51129"  "3725"   "5054"   "4015"  
+##  [36] "2645"   "8497"   "23764"  "54541"  "6515"   "3486"   "4783"  
+##  [43] "2353"   "3516"   "3098"   "10370"  "3669"   "2584"   "26118" 
+##  [50] "5837"   "6781"   "23036"  "694"    "123"    "1466"   "7436"  
+##  [57] "23210"  "2131"   "2152"   "5165"   "55139"  "7360"   "229"   
+##  [64] "8614"   "54206"  "2027"   "10957"  "3162"   "5228"   "26330" 
+##  [71] "9435"   "55076"  "63827"  "467"    "857"    "272"    "2719"  
+##  [78] "3340"   "8660"   "8819"   "2548"   "6385"   "8987"   "8870"  
+##  [85] "5313"   "3484"   "5329"   "112464" "8839"   "9215"   "25819" 
+##  [92] "6275"   "58528"  "7538"   "1956"   "1907"   "3423"   "1026"  
+##  [99] "6095"   "1843"   "4282"   "5507"   "10570"  "11015"  "1837"  
+## [106] "136"    "9957"   "284119" "2908"   "1316"   "2239"   "3491"  
+## [113] "7128"   "771"    "3073"   "633"    "23645"  "55276"  "5292"  
+## [120] "25824"  "55577"  "1027"   "680"    "8277"   "4493"   "538"   
+## [127] "4502"   "9672"   "25976"  "5317"   "302"    "5224"   "1649"  
+## [134] "5578"   "2542"   "7852"   "1944"   "1356"   "8609"   "1490"  
+## [141] "9469"   "7163"   "56925"  "124872" "10891"  "596"    "2651"  
+## [148] "3036"   "54800"  "949"    "6576"   "6383"   "839"    "7428"  
+## [155] "2309"   "5155"   "126792" "6518"   "8406"   "1942"   "2745"  
+## [162] "57007"  "5066"   "7045"   "1634"   "6478"   "51316"  "2203"  
+## [169] "8459"   "5260"   "4627"   "1028"   "9380"   "5105"   "3623"  
+## [176] "3309"   "8509"   "23327"  "7162"   "7511"   "3569"   "6533"  
+## [183] "4214"   "3948"   "9590"   "26136"  "3798"   "3906"   "1289"  
+## [190] "2817"   "3069"   "10994"  "1463"   "7052"   "2113"   "3219"  
+## [197] "8991"   "2355"   "6820"   "7043"
+

Looks like we have a list of gene sets with associated Entrez IDs.

+

In our gene expression data frame, expression_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into Entrez IDs for GSVA.

+
+
+

4.2.2 Gene identifier conversion

+

We’re going to convert our identifiers in expression_df to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

+

The annotation package org.Hs.eg.db contains information for different identifiers (Carlson 2020). org.Hs.eg.db is specific to Homo sapiens – this is what the Hs in the package name is referencing.

+

We can see what types of IDs are available to us in an annotation package with keytypes().

+
keytypes(org.Hs.eg.db)
+
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
+##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
+##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
+## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
+## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
+## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
+## [25] "UNIGENE"      "UNIPROT"
+

Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to Entrez IDs (ENTREZID), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS) to gene symbols (SYMBOL).

+

Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

+

The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds() and comes from the AnnotationDbi package.

+

Let’s create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs.

+
# First let's create a mapped data frame we can join to the gene expression
+# values
+mapped_df <- data.frame(
+  "entrez_id" = mapIds(
+    # Replace with annotation package for the organism relevant to your data
+    org.Hs.eg.db,
+    keys = expression_df$Gene,
+    # Replace with the type of gene identifiers in your data
+    keytype = "ENSEMBL",
+    # Replace with the type of gene identifiers you would like to map to
+    column = "ENTREZID",
+    # This will keep only the first mapped value for each Ensembl ID
+    multiVals = "first"
+  )
+) %>%
+  # If an Ensembl gene identifier doesn't map to a Entrez gene identifier, drop
+  # that from the data frame
+  dplyr::filter(!is.na(entrez_id)) %>%
+  # Make an `Ensembl` column to store the row names
+  tibble::rownames_to_column("Ensembl") %>%
+  # Now let's join the rest of the expression data
+  dplyr::inner_join(expression_df, by = c("Ensembl" = "Gene"))
+
## 'select()' returned 1:many mapping between keys and columns
+

This 1:many mapping between keys and columns message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSVA later in this notebook, we keep only the first mapped IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.

+

Let’s see a preview of mapped_df.

+
head(mapped_df)
+
+ +
+

We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as row names. In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers.

+

Let’s check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs.

+
any(duplicated(mapped_df$entrez_id))
+
## [1] TRUE
+

Looks like we do have duplicated Entrez IDs. Let’s find out which ones.

+
dup_entrez_ids <- mapped_df %>%
+  dplyr::filter(duplicated(entrez_id)) %>%
+  dplyr::pull(entrez_id)
+
+dup_entrez_ids
+
## [1] "6013" "3117"
+
+

4.2.2.1 Handling duplicate gene identifiers

+

As we mentioned earlier, we will not want any duplicate gene identifiers in our data frame when we convert it into a matrix in preparation for the GSVA steps later. GSVA is executed on a per sample basis so let’s keep the maximum expression value per sample associated with the duplicate Entrez gene identifiers. In other words, we will keep only the maximum expression value found across the duplicate Entrez gene identifier instances for each sample or column.

+

Let’s take a look at the rows associated with the duplicated Entrez IDs and see how this will play out.

+
mapped_df %>%
+  dplyr::filter(entrez_id %in% dup_entrez_ids)
+
+ +
+

As an example using the strategy we described, for GSM917111’s data in the first column, 0.2294387 is larger than 0.1104345 so moving forward, Entrez gene 6013 will have 0.2294387 value and the 0.1104345 would be dropped from the dataset. This is just one method of handling duplicate gene identifiers. See the Gene Set Enrichment Analysis (GSEA) User guide for more information on other commonly used strategies, such as taking the median expression value.

+

Now, let’s implement the maximum value method for all samples and Entrez IDs using tidyverse functions.

+
max_dup_df <- mapped_df %>%
+  # We won't be using Ensembl IDs moving forward, so we will drop this column
+  dplyr::select(-Ensembl) %>%
+  # Filter to include only the rows associated with the duplicate Entrez gene
+  # identifiers
+  dplyr::filter(entrez_id %in% dup_entrez_ids) %>%
+  # Group by Entrez IDs
+  dplyr::group_by(entrez_id) %>%
+  # Get the maximum expression value using all values associated with each
+  # duplicate Entrez ID for each column or sample in this case
+  dplyr::summarize_all(max)
+
+max_dup_df
+
+ +
+

We can see GSM917111 now has the 0.2294387 value for Entrez ID 6013 like expected. Looks like we were able to successfully get rid of the duplicate Entrez gene identifiers!

+

Now let’s combine our newly de-duplicated data with the rest of the mapped data!

+
filtered_mapped_df <- mapped_df %>%
+  # We won't be using Ensembl IDs moving forward, so we will drop this column
+  dplyr::select(-Ensembl) %>%
+  # First let's get the data associated with the Entrez gene identifiers that
+  # aren't duplicated
+  dplyr::filter(!entrez_id %in% dup_entrez_ids) %>%
+  # Now let's bind the rows of the maximum expression data we stored in
+  # `max_dup_df`
+  dplyr::bind_rows(max_dup_df)
+

As mentioned earlier, we need a matrix for GSVA. Let’s now convert our data frame into a matrix and prepare our object for GSVA.

+
filtered_mapped_matrix <- filtered_mapped_df %>%
+  # We need to store our gene identifiers as row names
+  tibble::column_to_rownames("entrez_id") %>%
+  # Now we can convert our object into a matrix
+  as.matrix()
+

Note that if we had duplicate gene identifiers here, we would not be able to set them as row names.

+
+
+
+
+

4.3 GSVA - Microarray

+

GSVA fits a model and ranks genes based on their expression level relative to the sample distribution (Hänzelmann et al. 2013). The pathway-level score calculated is a way of asking how genes within a gene set vary as compared to genes that are outside of that gene set (Malhotra 2018).

+

The idea here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) (Hänzelmann et al. 2013). This means that GSVA scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and ran GSVA again, you would expect the scores to change (Hänzelmann et al. 2013).

+

The output is a gene set by sample matrix of GSVA scores.

+
+

4.3.1 Perform GSVA

+

Let’s perform GSVA using the gsva() function. See ?gsva for more options.

+
gsva_results <- gsva(
+  filtered_mapped_matrix,
+  hallmarks_list,
+  method = "gsva",
+  # Appropriate for our log2-transformed microarray data
+  kcdf = "Gaussian",
+  # Minimum gene set size
+  min.sz = 15,
+  # Maximum gene set size
+  max.sz = 500,
+  # Compute Gaussian-distributed scores
+  mx.diff = TRUE,
+  # Don't print out the progress bar
+  verbose = FALSE
+)
+

Let’s explore what the output of gsva() looks like.

+
# First 6 rows, first 10 columns
+head(gsva_results[, 1:10])
+
##                                      GSM917111   GSM917250
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.2784726 -0.29221444
+## HALLMARK_HYPOXIA                    -0.1907117 -0.13033725
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.2307863 -0.22997233
+## HALLMARK_MITOTIC_SPINDLE            -0.2134439  0.09773602
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3061668  0.27041084
+## HALLMARK_TGF_BETA_SIGNALING         -0.2285640  0.08510027
+##                                       GSM917281  GSM917062
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.30693127 -0.2953894
+## HALLMARK_HYPOXIA                    -0.24058274 -0.2658532
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.25341066 -0.2214914
+## HALLMARK_MITOTIC_SPINDLE            -0.13886393 -0.2020978
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.06319446 -0.2363895
+## HALLMARK_TGF_BETA_SIGNALING         -0.14161796 -0.2284998
+##                                       GSM917288   GSM917230
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.22966329 -0.20914620
+## HALLMARK_HYPOXIA                     0.06741065 -0.02691280
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.08702648 -0.03084332
+## HALLMARK_MITOTIC_SPINDLE            -0.17902098  0.05763884
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING  0.21274606  0.08273239
+## HALLMARK_TGF_BETA_SIGNALING          0.01208862 -0.13097578
+##                                      GSM917152    GSM917242
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    0.33276903  0.001857506
+## HALLMARK_HYPOXIA                    0.18446386 -0.118269791
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    0.05273271  0.104042284
+## HALLMARK_MITOTIC_SPINDLE            0.14226250 -0.052122165
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.37981263 -0.037661623
+## HALLMARK_TGF_BETA_SIGNALING         0.15915374  0.300603909
+##                                      GSM917226    GSM917290
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.1329156 -0.385841741
+## HALLMARK_HYPOXIA                    -0.2641157 -0.145480093
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.2136088 -0.267519873
+## HALLMARK_MITOTIC_SPINDLE            -0.3753805 -0.001471942
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3570903 -0.006265662
+## HALLMARK_TGF_BETA_SIGNALING         -0.1973818 -0.130123427
+
+
+
+

4.4 Find differentially expressed pathways

+

If we want to identify most differentially expressed pathways across subgroups, we can use functionality in the limma package to test the GSVA scores.

+

This is one approach for working with GSVA scores; the mx.diff = TRUE argument that we supplied to the gsva() function in the previous section means the GSVA output scores should be normally distributed, which can be helpful if you want to perform downstream analyses with approaches that make assumptions of normality (Hänzelmann et al. 2020).

+
+

4.4.1 Create the design matrix

+

limma needs a numeric design matrix to indicate which subtype of medulloblastoma a sample originates from. Now we will create a model matrix based on our subgroup variable. We are using a + 0 in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. If you have a control group, you might want to leave off the + 0 so the model includes an intercept representing the control group expression level, with the remaining coefficients the changes relative to that expression level.

+
# Create the design matrix
+des_mat <- model.matrix(~ subgroup + 0, data = metadata)
+

Let’s take a look at the design matrix we created.

+
# Print out part of the design matrix
+head(des_mat)
+
##   subgroupGroup 3 subgroupGroup 4 subgroupSHH
+## 1               0               1           0
+## 2               0               1           0
+## 3               0               1           0
+## 4               1               0           0
+## 5               0               1           0
+## 6               0               1           0
+

The design matrix column names are a bit messy, so we will neaten them up by dropping the subgroup designation they all have and any spaces in names.

+
# Make the column names less messy
+colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "subgroup")
+
+# Do a similar thing but remove spaces in names
+colnames(des_mat) <- stringr::str_remove(colnames(des_mat), " ")
+
+
+
+

4.5 Perform differential expression on pathway scores

+

Run the linear model on each pathway (each row of gsva_results) with lmFit() and apply empirical Bayes smoothing with eBayes().

+
# Apply linear model to data
+fit <- limma::lmFit(gsva_results, design = des_mat)
+
+# Apply empirical Bayes to smooth standard errors
+fit <- limma::eBayes(fit)
+

Now that we have our basic model fitting, we will want to make the contrasts among all our groups. Depending on your scientific questions, you will need to customize the next steps. Consulting the limma users guide for how to set up your model based on your hypothesis is a good idea.

+

In this contrasts matrix, we are comparing each subgroup to the average of the other subgroups.
+We’re dividing by two in this expression so that each group is compared to the average of the other two groups (makeContrasts() doesn’t allow you to use functions like mean(); it wants a formula).

+
contrast_matrix <- makeContrasts(
+  "G3vsOther" = Group3 - (Group4 + SHH) / 2,
+  "G4vsOther" = Group4 - (Group3 + SHH) / 2,
+  "SHHvsOther" = SHH - (Group3 + Group4) / 2,
+  levels = des_mat
+)
+

Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulae would look like Group3 = Group3 - Control for each one. We highly recommend consulting the limma users guide for figuring out what your makeContrasts() and model.matrix() setups should look like (Ritchie et al. 2015).

+

Now that we have the contrasts matrix set up, we can use it to re-fit the model and re-smooth it with eBayes().

+
# Fit the model according to the contrasts matrix
+contrasts_fit <- contrasts.fit(fit, contrast_matrix)
+
+# Re-smooth the Bayes
+contrasts_fit <- eBayes(contrasts_fit)
+

Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).

+

Now let’s create the results table based on the contrasts fitted model.

+

This step will provide the Benjamini-Hochberg multiple testing correction. The topTable() function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method argument (see the ?topTable help page for more about the options).

+
# Apply multiple testing correction and obtain stats
+stats_df <- topTable(contrasts_fit, number = nrow(expression_df)) %>%
+  tibble::rownames_to_column("Gene")
+

Let’s take a peek at our results table.

+
head(stats_df)
+
+ +
+

For each pathway, each group’s fold change in expression, compared to the average of the other groups is reported.

+

By default, results are ordered from largest F value to the smallest, which means your most differentially expressed pathways across all groups should be toward the top.

+

This means HALLMARK_UNFOLDED_PROTEIN_RESPONSE appears to be the pathway that contains the most significant distribution of scores across subgroups.

+
+
+

4.6 Visualizing Results

+

Let’s make a plot for our most significant pathway, HALLMARK_UNFOLDED_PROTEIN_RESPONSE.

+
+

4.6.1 Sina plot

+

First we need to get our GSVA scores for this pathway into a long data frame, an appropriate format for ggplot2.

+

Let’s look at the current format of gsva_results.

+
head(gsva_results[, 1:10])
+
##                                      GSM917111   GSM917250
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.2784726 -0.29221444
+## HALLMARK_HYPOXIA                    -0.1907117 -0.13033725
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.2307863 -0.22997233
+## HALLMARK_MITOTIC_SPINDLE            -0.2134439  0.09773602
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3061668  0.27041084
+## HALLMARK_TGF_BETA_SIGNALING         -0.2285640  0.08510027
+##                                       GSM917281  GSM917062
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.30693127 -0.2953894
+## HALLMARK_HYPOXIA                    -0.24058274 -0.2658532
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.25341066 -0.2214914
+## HALLMARK_MITOTIC_SPINDLE            -0.13886393 -0.2020978
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.06319446 -0.2363895
+## HALLMARK_TGF_BETA_SIGNALING         -0.14161796 -0.2284998
+##                                       GSM917288   GSM917230
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.22966329 -0.20914620
+## HALLMARK_HYPOXIA                     0.06741065 -0.02691280
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.08702648 -0.03084332
+## HALLMARK_MITOTIC_SPINDLE            -0.17902098  0.05763884
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING  0.21274606  0.08273239
+## HALLMARK_TGF_BETA_SIGNALING          0.01208862 -0.13097578
+##                                      GSM917152    GSM917242
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    0.33276903  0.001857506
+## HALLMARK_HYPOXIA                    0.18446386 -0.118269791
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    0.05273271  0.104042284
+## HALLMARK_MITOTIC_SPINDLE            0.14226250 -0.052122165
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.37981263 -0.037661623
+## HALLMARK_TGF_BETA_SIGNALING         0.15915374  0.300603909
+##                                      GSM917226    GSM917290
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.1329156 -0.385841741
+## HALLMARK_HYPOXIA                    -0.2641157 -0.145480093
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.2136088 -0.267519873
+## HALLMARK_MITOTIC_SPINDLE            -0.3753805 -0.001471942
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3570903 -0.006265662
+## HALLMARK_TGF_BETA_SIGNALING         -0.1973818 -0.130123427
+

We can see that they are in a wide format with the GSVA scores for each sample spread across a row associated with each pathway.

+

Now let’s convert these results into a data frame and into a long format, using the tidyr::pivot_longer() function.

+
annotated_results_df <- gsva_results %>%
+  # Make this into a data frame
+  data.frame() %>%
+  # Gene set names are row names
+  tibble::rownames_to_column("pathway") %>%
+  # Get into long format using the `tidyr::pivot_longer()` function
+  tidyr::pivot_longer(
+    cols = -pathway,
+    names_to = "sample",
+    values_to = "gsva_score"
+  )
+
+# Preview the annotated results object
+head(annotated_results_df)
+
+ +
+

Now let’s filter to include only the scores associated with our most significant pathway, HALLMARK_UNFOLDED_PROTEIN_RESPONSE, and join the relevant group labels from the metadata for plotting.

+
top_pathway_annotated_results_df <- annotated_results_df %>%
+  # Filter for only scores associated with our most significant pathway
+  dplyr::filter(pathway == "HALLMARK_UNFOLDED_PROTEIN_RESPONSE") %>%
+  # Join the column with the group labels that we would like to plot
+  dplyr::left_join(metadata %>% dplyr::select(
+    # Select the variables relevant to your data
+    refinebio_accession_code,
+    subgroup
+  ),
+  # Tell the join what columns are equivalent and should be used as a key
+  by = c("sample" = "refinebio_accession_code")
+  )
+
+# Preview the filtered annotated results object
+head(top_pathway_annotated_results_df)
+
+ +
+

Now let’s make a sina plot so we can look at the differences between the subgroup groups using our GSVA scores.

+
# Sina plot comparing GSVA scores for `HALLMARK_UNFOLDED_PROTEIN_RESPONSE`
+# the `subgroup` groups in our dataset
+sina_plot <-
+  ggplot(
+    top_pathway_annotated_results_df, # Supply our annotated data frame
+    aes(
+      x = subgroup, # Replace with a grouping variable relevant to your data
+      y = gsva_score, # Column we previously created to store the GSVA scores
+      color = subgroup # Let's make the groups different colors too
+    )
+  ) +
+  # Add a boxplot that will have summary stats
+  geom_boxplot(outlier.shape = NA) +
+  # Make a sina plot that shows individual values
+  ggforce::geom_sina() +
+  # Rename the y-axis label
+  labs(y = "HALLMARK_UNFOLDED_PROTEIN_RESPONSE_score") +
+  # Adjust the plot background for better visualization
+  theme_bw()
+
+# Display plot
+sina_plot
+

+

Looks like the Group 4 samples have lower GSVA scores for HALLMARK_UNFOLDED_PROTEIN_RESPONSE as compared to the SHH and Group 3 scores.

+

Let’s save this plot to PNG.

+
ggsave(
+  file.path(
+    plots_dir,
+    "GSE37382_gsva_HALLMARK_UNFOLDED_PROTEIN_RESPONSE_sina_plot.png"
+  ),
+  plot = sina_plot
+)
+
## Saving 7 x 5 in image
+
+
+
+

4.7 Write results to file

+

Now let’s write all of our GSVA results to file.

+
gsva_results %>%
+  as.data.frame() %>%
+  tibble::rownames_to_column("pathway") %>%
+  readr::write_tsv(file.path(
+    results_dir,
+    "GSE37382_gsva_results.tsv"
+  ))
+
+
+
+

5 Resources for further learning

+ +
+
+

6 Session info

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ─────────────────────────────────────────────────────
+##  setting  value                       
+##  version  R version 4.0.2 (2020-06-22)
+##  os       Ubuntu 20.04 LTS            
+##  system   x86_64, linux-gnu           
+##  ui       X11                         
+##  language (EN)                        
+##  collate  en_US.UTF-8                 
+##  ctype    en_US.UTF-8                 
+##  tz       Etc/UTC                     
+##  date     2020-12-21                  
+## 
+## ─ Packages ─────────────────────────────────────────────────────────
+##  package              * version  date       lib source        
+##  annotate               1.68.0   2020-10-27 [1] Bioconductor  
+##  AnnotationDbi        * 1.52.0   2020-10-27 [1] Bioconductor  
+##  assertthat             0.2.1    2019-03-21 [1] RSPM (R 4.0.0)
+##  backports              1.1.10   2020-09-15 [1] RSPM (R 4.0.2)
+##  Biobase              * 2.50.0   2020-10-27 [1] Bioconductor  
+##  BiocGenerics         * 0.36.0   2020-10-27 [1] Bioconductor  
+##  BiocParallel           1.24.1   2020-11-06 [1] Bioconductor  
+##  bit                    4.0.4    2020-08-04 [1] RSPM (R 4.0.2)
+##  bit64                  4.0.5    2020-08-30 [1] RSPM (R 4.0.2)
+##  bitops                 1.0-6    2013-08-17 [1] RSPM (R 4.0.0)
+##  blob                   1.2.1    2020-01-20 [1] RSPM (R 4.0.0)
+##  cli                    2.1.0    2020-10-12 [1] RSPM (R 4.0.2)
+##  coda                   0.19-4   2020-09-30 [1] RSPM (R 4.0.2)
+##  colorspace             1.4-1    2019-03-18 [1] RSPM (R 4.0.0)
+##  crayon                 1.3.4    2017-09-16 [1] RSPM (R 4.0.0)
+##  DBI                    1.1.0    2019-12-15 [1] RSPM (R 4.0.0)
+##  DelayedArray           0.16.0   2020-10-27 [1] Bioconductor  
+##  digest                 0.6.25   2020-02-23 [1] RSPM (R 4.0.0)
+##  dplyr                  1.0.2    2020-08-18 [1] RSPM (R 4.0.2)
+##  ellipsis               0.3.1    2020-05-15 [1] RSPM (R 4.0.0)
+##  emmeans                1.5.1    2020-09-18 [1] RSPM (R 4.0.2)
+##  estimability           1.3      2018-02-11 [1] RSPM (R 4.0.0)
+##  evaluate               0.14     2019-05-28 [1] RSPM (R 4.0.0)
+##  fansi                  0.4.1    2020-01-08 [1] RSPM (R 4.0.0)
+##  farver                 2.0.3    2020-01-16 [1] RSPM (R 4.0.0)
+##  fastmatch              1.1-0    2017-01-28 [1] RSPM (R 4.0.0)
+##  fftw                   1.0-6    2020-02-24 [1] RSPM (R 4.0.2)
+##  generics               0.0.2    2018-11-29 [1] RSPM (R 4.0.0)
+##  GenomeInfoDb           1.26.1   2020-11-20 [1] Bioconductor  
+##  GenomeInfoDbData       1.2.4    2020-12-01 [1] Bioconductor  
+##  GenomicRanges          1.42.0   2020-10-27 [1] Bioconductor  
+##  getopt                 1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
+##  ggforce                0.3.2    2020-06-23 [1] RSPM (R 4.0.2)
+##  ggplot2              * 3.3.2    2020-06-19 [1] RSPM (R 4.0.1)
+##  glue                   1.4.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  graph                  1.68.0   2020-10-27 [1] Bioconductor  
+##  GSEABase               1.52.1   2020-12-11 [1] Bioconductor  
+##  GSVA                 * 1.38.0   2020-10-27 [1] Bioconductor  
+##  gtable                 0.3.0    2019-03-25 [1] RSPM (R 4.0.0)
+##  hms                    0.5.3    2020-01-08 [1] RSPM (R 4.0.0)
+##  htmltools              0.5.0    2020-06-16 [1] RSPM (R 4.0.1)
+##  httr                   1.4.2    2020-07-20 [1] RSPM (R 4.0.2)
+##  IRanges              * 2.24.0   2020-10-27 [1] Bioconductor  
+##  jsonlite               1.7.1    2020-09-07 [1] RSPM (R 4.0.2)
+##  knitr                  1.30     2020-09-22 [1] RSPM (R 4.0.2)
+##  labeling               0.3      2014-08-23 [1] RSPM (R 4.0.0)
+##  lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
+##  lifecycle              0.2.0    2020-03-06 [1] RSPM (R 4.0.0)
+##  limma                * 3.46.0   2020-10-27 [1] Bioconductor  
+##  magrittr             * 1.5      2014-11-22 [1] RSPM (R 4.0.0)
+##  MASS                   7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+##  Matrix                 1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
+##  MatrixGenerics         1.2.0    2020-10-27 [1] Bioconductor  
+##  matrixStats            0.57.0   2020-09-25 [1] RSPM (R 4.0.2)
+##  memoise                1.1.0    2017-04-21 [1] RSPM (R 4.0.0)
+##  munsell                0.5.0    2018-06-12 [1] RSPM (R 4.0.0)
+##  mvtnorm                1.1-1    2020-06-09 [1] RSPM (R 4.0.0)
+##  nlme                   3.1-148  2020-05-24 [2] CRAN (R 4.0.2)
+##  optparse             * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
+##  org.Hs.eg.db         * 3.12.0   2020-12-01 [1] Bioconductor  
+##  pillar                 1.4.6    2020-07-10 [1] RSPM (R 4.0.2)
+##  pkgconfig              2.0.3    2019-09-22 [1] RSPM (R 4.0.0)
+##  polyclip               1.10-0   2019-03-14 [1] RSPM (R 4.0.0)
+##  ps                     1.4.0    2020-10-07 [1] RSPM (R 4.0.2)
+##  purrr                  0.3.4    2020-04-17 [1] RSPM (R 4.0.0)
+##  qusage               * 2.24.0   2020-10-27 [1] Bioconductor  
+##  R.cache                0.14.0   2019-12-06 [1] RSPM (R 4.0.0)
+##  R.methodsS3            1.8.1    2020-08-26 [1] RSPM (R 4.0.2)
+##  R.oo                   1.24.0   2020-08-26 [1] RSPM (R 4.0.2)
+##  R.utils                2.10.1   2020-08-26 [1] RSPM (R 4.0.2)
+##  R6                     2.4.1    2019-11-12 [1] RSPM (R 4.0.0)
+##  Rcpp                   1.0.5    2020-07-06 [1] RSPM (R 4.0.2)
+##  RCurl                  1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0)
+##  readr                  1.4.0    2020-10-05 [1] RSPM (R 4.0.2)
+##  rematch2               2.1.2    2020-05-01 [1] RSPM (R 4.0.0)
+##  rlang                  0.4.8    2020-10-08 [1] RSPM (R 4.0.2)
+##  rmarkdown              2.4      2020-09-30 [1] RSPM (R 4.0.2)
+##  RSQLite                2.2.1    2020-09-30 [1] RSPM (R 4.0.2)
+##  rstudioapi             0.11     2020-02-07 [1] RSPM (R 4.0.0)
+##  S4Vectors            * 0.28.0   2020-10-27 [1] Bioconductor  
+##  scales                 1.1.1    2020-05-11 [1] RSPM (R 4.0.0)
+##  sessioninfo            1.1.1    2018-11-05 [1] RSPM (R 4.0.0)
+##  stringi                1.5.3    2020-09-09 [1] RSPM (R 4.0.2)
+##  stringr                1.4.0    2019-02-10 [1] RSPM (R 4.0.0)
+##  styler                 1.3.2    2020-02-23 [1] RSPM (R 4.0.0)
+##  SummarizedExperiment   1.20.0   2020-10-27 [1] Bioconductor  
+##  tibble                 3.0.4    2020-10-12 [1] RSPM (R 4.0.2)
+##  tidyr                  1.1.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  tidyselect             1.1.0    2020-05-11 [1] RSPM (R 4.0.0)
+##  tweenr                 1.0.1    2018-12-14 [1] RSPM (R 4.0.2)
+##  vctrs                  0.3.4    2020-08-29 [1] RSPM (R 4.0.2)
+##  withr                  2.3.0    2020-09-22 [1] RSPM (R 4.0.2)
+##  xfun                   0.18     2020-09-29 [1] RSPM (R 4.0.2)
+##  XML                    3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2)
+##  xtable                 1.8-4    2019-04-21 [1] RSPM (R 4.0.0)
+##  XVector                0.30.0   2020-10-27 [1] Bioconductor  
+##  yaml                   2.2.1    2020-02-01 [1] RSPM (R 4.0.0)
+##  zlibbioc               1.36.0   2020-10-27 [1] Bioconductor  
+## 
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
+
+
+

References

+
+
+

Carlson M., 2020 org.Hs.eg.db: Genome wide annotation for human. http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html

+
+
+

Hänzelmann S., R. Castelo, and J. Guinney, 2013 Biases in Illumina transcriptome sequencing caused by random hexamer priming. BMC Bioinformatics 14. https://doi.org/10.1186/1471-2105-14-7

+
+
+

Hänzelmann S., R. Castelo, and J. Guinney, 2020 GSVA: The gene set variation analysis package for microarray and rna-seq data. https://www.bioconductor.org/packages/release/bioc/vignettes/GSVA/inst/doc/GSVA.pdf

+
+
+

Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

+
+
+

Liberzon A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, and J. P. Mesirov et al., 2015 The molecular signatures database hallmark gene set collection. Cell Systems 1. https://doi.org/10.1016/j.cels.2015.12.004

+
+
+

Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, and P. Tamayo et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260

+
+
+

Malhotra S., 2018 Decoding gene set variation analysis. https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3

+
+
+

Northcott P., D. Shih, J. Peacock, L. Garzia, and S. Morrissy et al., 2012 Subgroup specific structural variation across 1,000 medulloblastoma genomes. Nature 488. https://doi.org/10.1038/nature11327

+
+
+

Ritchie M. E., B. Phipson, D. Wu, Y. Hu, and C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007

+
+
+

Robinson D., Understanding empirical Bayes estimation (using baseball statistics). http://varianceexplained.org/r/empirical_bayes_baseball/

+
+
+

Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

+
+
+

Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660

+
+
+
+ + + + +
+
+ +
+ + + + + + + + + + + + + + + + diff --git a/02-microarray/pathway-analysis_microarray_03_qusage_meta_analysis.Rmd b/02-microarray/pathway-analysis_microarray_03_qusage_meta_analysis.Rmd deleted file mode 100644 index a3192ca9..00000000 --- a/02-microarray/pathway-analysis_microarray_03_qusage_meta_analysis.Rmd +++ /dev/null @@ -1,512 +0,0 @@ ---- -title: "Pathway analysis with QuSAGE: Meta-analysis of medulloblastoma" -output: - html_notebook: - toc: TRUE - toc_float: TRUE -author: J. Taroni for ALSF CCDL -date: 2019 ---- - -## Background - -The Quantitative Set Analysis of Gene Expression (QuSAGE) -([](https://doi.org/10.1093/nar/gkt660)) framework -has advantages that we outline in -[`qusage_single_dataset`](./qusage_single_dataset.nb.html), including the -fact that it returns more than just a p-value. -Specifically, QuSAGE quantifies gene set activity with a full probability -density function (PDF). -If we're interested in pathway analysis of multiple datasets, QuSAGE allows -us to perform a _meta-analysis_ by combining distributions from the QuSAGE -results from each dataset. -Meta-analysis with QuSAGE is described in -[ et al. _PLOS Comp Bio._ 2019.](https://doi.org/10.1371/journal.pcbi.1006899) -and implemented in the [`qusage` bioconductor package](https://bioconductor.org/packages/release/bioc/html/qusage.html). -The [`qusage` vignette](https://bioconductor.org/packages/release/bioc/vignettes/qusage/inst/doc/qusage.pdf) -contains a section on meta-analysis. - -## Datasets - -We will use two medulloblastoma datasets— -[Northcott et al.](https://doi.org/10.1038/nature11327) and -[Robinson et al.](https://doi.org/10.1038/nature11213)—to demonstrate how to -perform meta-analysis with `qusage`. -Specifically, we'll identify pathways that are differentially active between -the SHH subgroup vs. the Group 3 and 4 subgroups in both datasets. -These datasets are -[`GSE37382`](https://www.refine.bio/experiments/GSE37382/subgroup-specific-somatic-copy-number-aberrations-in-the-medulloblastoma-genome-mrna) and -[`GSE37418`](https://www.refine.bio/experiments/GSE37418/novel-mutations-target-distinct-subgroups-of-medulloblastoma), respectively. - -## Set up - -### Package installation and loading - -```{r} -if (!("qusage" %in% installed.packages())) { - BiocManager::install("qusage", update = FALSE) -} - -if (!("org.Hs.eg.db" %in% installed.packages())) { - BiocManager::install("org.Hs.eg.db", update = FALSE) -} -``` - -```{r} -`%>%` <- dplyr::`%>%` -``` - -```{r} -library(org.Hs.eg.db) -library(qusage) -``` - -### Directories - -Make directories to hold the plots and results if they do not yet exist. - -```{r} -# Define the file path to the plots directory -plots_dir <- "plots" # Replace with path to desired output plots directory - -# Make a plots directory if it isn't created yet -if (!dir.exists(plots_dir)) { - dir.create(plots_dir) -} - -# Define the file path to the results directory -results_dir <- "results" # Replace with path to desired output results directory - -# Make a results directory if it isn't created yet -if (!dir.exists(results_dir)) { - dir.create(results_dir) -} - -# Define the file path to the data directory -data_dir <- "data" -``` - -### Function - -Function to perform gene identifier conversion -- we'll do this once for -each dataset. -We 'functionalize' it to keep from repeating ourselves. -For an example where we do the conversion without using a custom function, see -[`qusage_single_dataset`](./qusage_single_dataset.nb.html) or -[`qusage_replicate_vignette`](./qusage_replicate_vignette.nb.html). - -```{r} -convert_ensembl_to_entrez_mat <- function(exprs_df) { - # Given a data.frame that contains human gene expression values and a column - # (named ENSEMBL) that contains Ensembl gene IDs, return an expression matrix - # where the rownames are Entrez IDs. In the case of duplicate Entrez - # identifiers, we summarize to the mean value for an Entrez ID. Mapping - # is performed with AnnotationDbi::mapIDs. - # - # Args: - # exprs_df: A data.frame of gene expression data. The first column should be - # named ENSEMBL and contain Ensembl gene IDs. Rows are genes; - # columns are samples. - # Returns: - # An expression matrix where the rownames are Entrez gene IDs - - `%>%` <- dplyr::`%>%` - require(org.Hs.eg.db) - - # error-handling: we need an ENSEMBL column - if (!("ENSEMBL" %in% colnames(exprs_df))) { - stop("'ENSEMBL' column expected in exprs_df") - } - - # using the default behavior for 1:many mappings, where only the first one is - # selected - entrez_mappings <- mapIds(org.Hs.eg.db, - keys = exprs_df$ENSEMBL, - column = "ENTREZID", keytype = "ENSEMBL" - ) - - # if this is not returned in the same order as the keys for some reason, stop - if (!all.equal(names(entrez_mappings), exprs_df$ENSEMBL)) { - stop("Something happened to the gene order!") - } - - # annotation with Entrez IDs - entrez_exprs_df <- exprs_df %>% - # add a new column that contains the Entrez IDs - # this gets added as the last column - dplyr::mutate(ENTREZID = entrez_mappings) %>% - # drop the Ensembl gene IDs - dplyr::select(-ENSEMBL) %>% - # reorder such that the Entrez IDs are in the first column - dplyr::select(ENTREZID, dplyr::everything()) %>% - # drop any genes without an Entrez ID - dplyr::filter(!is.na(ENTREZID)) - - # if there are any duplicate Entrez gene IDs, collapse to the mean value - if (any(duplicated(entrez_exprs_df$ENTREZID))) { - message("Collapsing to mean value...") - entrez_exprs_df <- entrez_exprs_df %>% - dplyr::group_by(ENTREZID) %>% - dplyr::summarise_all(mean) - } - - # expression matrix where the rownames are the gene identifiers - exprs_mat <- entrez_exprs_df %>% - tibble::column_to_rownames("ENTREZID") %>% - as.matrix() -} -``` - -### Gene sets - -`qusage` allows you to read in gene sets that are in the [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). -[MSigDB](http://software.broadinstitute.org/gsea/msigdb) offers gene sets in this format. - -For more information or recommendations about gene sets, [see the **Gene Sets** -section of the module README](https://github.com/AlexsLemonade/refinebio-examples/tree/master/pathway-analysis#choosing-gene-sets). - -We need to download the the MSigDB v6.2 KEGG gene sets that use Entrez gene IDs -and place them at the following relative path: - -``` -gene-sets/c2.cp.kegg.v6.2.entrez.gmt -``` - -We can download the file if it's not available locally yet. - -```{r} -# the kegg gmt file should be located in the spot we mention above -kegg_file <- file.path("gene-sets", "c2.cp.kegg.v6.2.entrez.gmt") -# since we do not track this file in our repository, let's check to make sure -# it exists where we expect it and download it if we don't find it -if (!file.exists(kegg_file)) { - message(paste( - "KEGG GMT file is not found at", kegg_file, - ", downloading now..." - )) - # need gene-sets directory - if (!dir.exists("gene-sets")) { - dir.create("gene-sets") - } - download.file("https://data.broadinstitute.org/gsea-msigdb/msigdb/release/6.2/c2.cp.kegg.v6.2.entrez.gmt", - destfile = kegg_file - ) -} -``` - -## Read in and prep refine.bio data - -### Northcott et al. - -This dataset is too large to be tracked with git without compression. -We need to unzip it here if we have not already. -Note that if your file is not zipped you can skip this chunk. - -```{r} -# if the unzipped folder does not exist -- skip this if you do not have a zipped file -if (!dir.exists(file.path( - data_dir, - "GSE37382" # Replace with the name of your file without the .zip extension -))) { - compressed_file <- file.path( - data_dir, - "GSE37382.zip" # Replace with the name of your zipped file - ) - unzip(compressed_file, exdir = data_dir) -} -``` - -#### Expression data - -```{r} -northcott_dir <- file.path( - data_dir, - "GSE37382" # Replace with name of the folder in which your northcott expression file is stored -) -northcott_expression_file <- file.path( - northcott_dir, - "GSE37382.tsv" # Replace with name of your northcott expression file -) -northcott_exprs_df <- readr::read_tsv(northcott_expression_file, - progress = FALSE -) -# first column is currently named 'Gene' and contains Ensembl gene IDs -colnames(northcott_exprs_df)[1] <- "ENSEMBL" -``` - -Convert to expression matrix that uses Entrez IDs as rownames using a -[custom function](#function). - -```{r} -northcott_mat <- convert_ensembl_to_entrez_mat(northcott_exprs_df) -``` - -We no longer need the `data.frame` that contains the expression values. - -```{r} -rm(northcott_exprs_df) -``` - -#### Metadata - -```{r} -northcott_metadata_file <- file.path( - northcott_dir, - "metadata_GSE37382.tsv" # Replace with name of your northcott metadata file -) -northcott_metadata_df <- readr::read_tsv(northcott_metadata_file) %>% - # drop columns that are all NA - dplyr::select(-which(apply(is.na(.), 2, all))) -``` - -We're going to compare the SHH subgroup to all others, so we need to encode -this information in a new column (`shh_v_other`). - -```{r} -northcott_metadata_df <- northcott_metadata_df %>% - # retain only pertinent columns - dplyr::select( - refinebio_accession_code, refinebio_title, refinebio_age, - refinebio_sex, subgroup - ) %>% - # new column that sets up pathway analysis - dplyr::mutate(shh_v_other = dplyr::case_when( - subgroup != "SHH" ~ "Other", - TRUE ~ "SHH" - )) -``` - -Reorder expression matrix to match the metadata. - -```{r} -northcott_mat <- northcott_mat[, northcott_metadata_df$refinebio_accession_code] -``` - -### Robinson et al. - -#### Expression data - -```{r} -robinson_dir <- file.path( - data_dir, - "GSE37418" # Replace with name of the folder in which your robinson expression file is stored -) -robinson_expression_file <- file.path( - robinson_dir, - "GSE37418.tsv" # Replace with name of your robinson expression file -) -robinson_exprs_df <- readr::read_tsv(robinson_expression_file, - progress = FALSE -) -colnames(robinson_exprs_df)[1] <- "ENSEMBL" -``` - -Convert to expression matrix that uses Entrez IDs as rownames using a -[custom function](#function). - -```{r} -robinson_mat <- convert_ensembl_to_entrez_mat(robinson_exprs_df) -``` - -```{r} -rm(robinson_exprs_df) -``` - -#### Metadata - -```{r} -robinson_metadata_file <- file.path( - robinson_dir, - "metadata_GSE37418.tsv" # Replace with name of your robinson metadata file -) -robinson_metadata_df <- readr::read_tsv(robinson_metadata_file) %>% - # drop columns that are all NA - dplyr::select(-which(apply(is.na(.), 2, all))) -``` - -To make this more comparable to the Northcott et al. dataset, we're going to -remove the WNT subgroup and outliers such that we are comparing the SHH group -to Group 3 and Group 4. - -```{r} -robinson_metadata_df <- robinson_metadata_df %>% - # retain only pertinent columns - dplyr::select( - refinebio_accession_code, refinebio_title, age, `m stage`, - subgroup - ) %>% - # make more comparable to Northcott dataset - dplyr::filter(!(subgroup %in% c("WNT", "SHH OUTLIER", "U"))) %>% - # we'll use SHH vs. Other for our pathway analysis - dplyr::mutate(shh_v_other = dplyr::case_when( - subgroup != "SHH" ~ "Other", - TRUE ~ "SHH" - )) -``` - -Reorder expression data to match metadata. - -```{r} -robinson_mat <- robinson_mat[, robinson_metadata_df$refinebio_accession_code] -``` - -## Pathway analysis - -### Read in KEGG pathways - -```{r} -kegg_pathways <- read.gmt(kegg_file) -``` - -### Northcott et al. - -```{r} -northcott_results <- qusage( - eset = northcott_mat, - labels = northcott_metadata_df$shh_v_other, - contrast = "SHH-Other", - geneSets = kegg_pathways -) -``` - -Save the Northcott et al. results to file. - -```{r} -northcott_results_file <- file.path( - results_dir, - "Northcott_SHH-Other_QSarray.RDS" # Replace with a relevant output name for the first results RDS file -) -readr::write_rds(northcott_results, northcott_results_file) -``` - -### Robinson et al. - -```{r} -robinson_results <- qusage( - eset = robinson_mat, - labels = robinson_metadata_df$shh_v_other, - contrast = "SHH-Other", - geneSets = kegg_pathways -) -``` - -Save the Robinson et al. results to file. - -```{r} -robinson_results_file <- file.path( - results_dir, - "Robinson_SHH-Other_QSarray.RDS" # Replace with a relevant output name for the second results RDS file -) -readr::write_rds(robinson_results, robinson_results_file) -``` - -## Meta-analysis - -### Combine probability density function - -`combinePDFs` is a `qusage` function we can use for meta-analysis. -This function accepts a list of `QSArray` results. - -```{r} -results_list <- list( - Northcott = northcott_results, - Robinson = robinson_results -) -``` - -Because there are more samples in the Northcott et al. dataset, it will be -weighted more highly when combining the distribution. - -```{r} -combined_results <- combinePDFs(results_list) -``` - -Just as in a single dataset, we can extract relevant information from -the combined `QSArray` with built-in `qusage` functions. - -Let's look at the top 20 pathways with `qsTable`. -The row numbers in the output of `qsTable` can serve as input to the -`path.index` argument of other `qusage` functions as we'll see below. - -```{r} -qsTable(combined_results) -``` - -### Plotting - -We can plot the distributions with `plotCombinedPDF`. -First let's plot KEGG Pathways in Cancer, which has elevated expression in -the SHH group as it has a _positive_ fold change. -In a two group comparison in QuSAGE, **pathway activity** is the mean difference -of the log expression of all genes in a pathway. - -```{r} -plotCombinedPDF(combined_results, path.index = 162) -legend("topleft", - legend = c("Northcott", "Robinson", "Meta-analysis"), - lty = 1, col = c("#E41A1C", "#377EB8", "black") -) -``` - -The positive pathway activity curves indicate that the genes in this -pathway have higher expression values in the SHH group for both datasets. -The pathway activity (mean difference) in the Northcott dataset which has -a larger sample size. - -Let's plot KEGG Dorsoventral Axis Formation, which has higher expression in -Group 3 and 4 samples. - -```{r} -plotCombinedPDF(combined_results, path.index = 107) -legend("topleft", - legend = c("Northcott", "Robinson", "Meta-analysis"), - lty = 1, col = c("#E41A1C", "#377EB8", "black") -) -``` - -#### Plot individual gene mean and 95% confidence intervals (CI) - -We know that the directionality of the KEGG Dorsoventral Axis Formation pathway -agrees between datasets. -We can look into what genes are driving the pathway activity with the -`plotCIsGenes` function. -Gene activity, which will be plotted on the y-axis, is difference between the -two groups. -The _pathway_ CI will also be displayed on the plot as a gray band by default. - -```{r} -plotCIsGenes(northcott_results, - path.index = 107, addGrid = FALSE, - cex.xaxis = 1.25 -) -``` - -We can see that, in the Northcott dataset, there are about 5 genes really -driving the negative pathway activity. -Let's take a look at the same pathway in Robinson. - -```{r} -plotCIsGenes(robinson_results, - path.index = 107, addGrid = FALSE, - cex.xaxis = 1.25 -) -``` - -Looks like some of the top genes are the same between datasets: `56907`, `56776` -Because Entrez IDs are not particularly human-readable, we can use the same -annotation package to convert these to gene symbols and gene names. - -```{r} -AnnotationDbi::select(org.Hs.eg.db, - keys = c("56907", "56776"), - keytype = "ENTREZID", columns = c("SYMBOL", "GENENAME") -) -``` - -It looks like these two genes promote the polymerization of actin filaments. - -## Session Info - -```{r} -sessioninfo::session_info() -``` diff --git a/02-microarray/pathway-analysis_microarray_04_qusage_replicate_vignette.Rmd b/02-microarray/pathway-analysis_microarray_04_qusage_replicate_vignette.Rmd deleted file mode 100644 index 397a6b3d..00000000 --- a/02-microarray/pathway-analysis_microarray_04_qusage_replicate_vignette.Rmd +++ /dev/null @@ -1,465 +0,0 @@ ---- -title: "Pathway analysis with QuSAGE: Replicate vignette" -output: - html_notebook: - toc: TRUE - toc_float: TRUE -author: J. Taroni for ALSF CCDL -date: 2019 ---- - -## Background - -Here, we will replicate the [`qusage` package vignette](https://bioconductor.org/packages/release/bioc/vignettes/qusage/inst/doc/qusage.pdf) ( C.). -Specifically, we'll use the same dataset and analysis as the vignette, but the -expression data and sample metadata we will use is processed with refine.bio. -This allows us to explore formatting refine.bio datasets for use with -`qusage` and to compare the results using refine.bio data to the -results in the package vignette. -We will briefly cover how to obtain gene sets for pathway analysis from the -[Molecular Signatures Database (MSigDB)](http://software.broadinstitute.org/gsea/msigdb) -as well. - -## Pathway analysis - -### Set up - -```{r} -# Set seed -set.seed(12345) - -`%>%` <- dplyr::`%>%` -``` - -We need to install `qusage` if we have not already done so. -We'll need `org.Hs.eg.db` as well. - -```{r} -if (!("qusage" %in% installed.packages())) { - BiocManager::install("qusage", update = FALSE) -} - -if (!("org.Hs.eg.db" %in% installed.packages())) { - BiocManager::install("org.Hs.eg.db", update = FALSE) -} -``` - -```{r} -library(org.Hs.eg.db) -library(qusage) -``` - -Make directories to hold the plots and results. - -```{r} -# Define the file path to the plots directory -plots_dir <- "plots" # Replace with path to desired output plots directory - -# Make a plots directory if it isn't created yet -if (!dir.exists(plots_dir)) { - dir.create(plots_dir) -} - -# Define the file path to the results directory -results_dir <- "results" # Replace with path to desired output results directory - -# Make a results directory if it isn't created yet -if (!dir.exists(results_dir)) { - dir.create(results_dir) -} - -# Define the file path to the data directory -data_dir <- "data" -``` - -### Gene sets - -`qusage` allows you to read in gene sets that are in the [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). -[MSigDB](http://software.broadinstitute.org/gsea/msigdb) offers gene sets in this format. - -We need to download the the MSigDB v6.2 KEGG gene sets that use Entrez gene IDs and placed them at the following relative path: - -``` -gene-sets/c2.cp.kegg.v6.2.entrez.gmt -``` - -We can download the file if we don't find it where we expect! - -```{r} -# the kegg gmt file should be located in the spot we mention above -kegg_file <- file.path("gene-sets", "c2.cp.kegg.v6.2.entrez.gmt") -# since we do not track this file in our repository, let's check to make sure -# it exists where we expect it and download it if we don't find it -if (!file.exists(kegg_file)) { - message(paste( - "KEGG GMT file is not found at", kegg_file, - ", downloading now..." - )) - # need gene-sets directory - if (!dir.exists("gene-sets")) { - dir.create("gene-sets") - } - download.file("https://data.broadinstitute.org/gsea-msigdb/msigdb/release/6.2/c2.cp.kegg.v6.2.entrez.gmt", - destfile = kegg_file - ) -} -``` - -### Conversion to Entrez gene IDs - -All refine.bio data uses Ensembl gene IDs. -As noted above, the gene sets we'll be using for our pathway analysis use -Entrez gene identifiers. -As a result, we'll need to convert the expression data identifiers to Entrez -IDs. - -```{r} -# read in gene expression data -expression_file <- file.path( - data_dir, - "GSE30550", # Replace with name of the folder in which your expression file is stored - "GSE30550.tsv" # Replace with name of your expression file -) -exprs_df <- readr::read_tsv(expression_file, progress = FALSE) -# first column is currently named 'Gene' and contains Ensembl gene IDs -colnames(exprs_df)[1] <- "ENSEMBL" -``` - -We're using the default behavior for 1:many mappings, where only the first one -is selected ([docs](https://www.rdocumentation.org/packages/AnnotationDbi/versions/1.30.1/topics/AnnotationDb-objects)). - -```{r} -entrez_mappings <- mapIds(org.Hs.eg.db, - keys = exprs_df$ENSEMBL, - column = "ENTREZID", keytype = "ENSEMBL" -) -``` - -```{r} -all.equal(names(entrez_mappings), exprs_df$ENSEMBL) -``` - -We'll add this to the `data.frame` of expression data and save to file. - -```{r} -exprs_df <- exprs_df %>% - dplyr::mutate(ENTREZID = entrez_mappings) %>% - dplyr::select(ENTREZID, ENSEMBL, dplyr::everything()) -readr::write_tsv(exprs_df, - path = file.path( - data_dir, - "GSE30550", # Replace with name of the folder in which tour expression data is stored - "GSE30550_entrez.tsv" # Replace with name of your expression file containing Entrez gene IDs - ) -) -``` - -### Read in metadata and filter - -```{r} -metadata_file <- file.path( - data_dir, - "GSE30550", # Replace with name of the folder in which your relevant metadata is stored - "metadata_GSE30550.tsv" # Replace with name of the metadata file relevant to your expression file -) -metadata_df <- readr::read_tsv(metadata_file) %>% - # drop columns that are all NA - dplyr::select(-which(apply(is.na(.), 2, all))) -``` - -In the `qusage` vignette, the authors compared samples 0 hours after exposure to -influenza and sample 77 hours after exposure to influenza. -We'll subset to just these samples. - -```{r} -filtered_metadata_df <- metadata_df %>% - dplyr::filter(time_hpi %in% c("Hour 00", "Hour 077")) %>% - # the subject & timepoint info are combined in the 'title' column - dplyr::mutate( - subject = stringr::word(title, 1, sep = ","), - # relabel the timepoints, the lack of spaces will help - # with the contrast call later - timepoint = dplyr::case_when( - time_hpi == "Hour 00" ~ "t0", - time_hpi == "Hour 077" ~ "t1" - ) - ) %>% - dplyr::arrange(time_hpi, subject) %>% - dplyr::select( - refinebio_accession_code, subject, refinebio_title, class_blu, - clinic_pheno, time_hpi, timepoint - ) -``` - -```{r} -time_filtered_exprs <- exprs_df %>% - dplyr::select(-ENSEMBL) %>% - # we're reordering samples here to match the metadata ordering - dplyr::select(ENTREZID, filtered_metadata_df$refinebio_accession_code) -``` - -Are there duplicated identifiers? - -```{r} -anyDuplicated(time_filtered_exprs$ENTREZID) -``` - -Collapse to duplicated gene identifiers to their mean values. - -```{r} -collapsed_exprs <- time_filtered_exprs %>% - dplyr::group_by(ENTREZID) %>% - dplyr::summarise_all(mean) %>% - # drop unannotated gene - dplyr::filter(!is.na(ENTREZID)) -``` - -We need to pass `qusage` a matrix. -Note that we're departing a bit from the advice of the authors by -refraining from filtering the expression matrix. - -```{r} -exprs_mat <- collapsed_exprs %>% - tibble::column_to_rownames("ENTREZID") %>% - as.matrix() -``` - -Let's get rid of some of the objects in the workspace we don't need. - -```{r} -rm(collapsed_exprs, time_filtered_exprs, exprs_df, metadata_df) -``` - -### Read in gene sets - -```{r} -kegg_genesets <- read.gmt(kegg_file) -``` - -### Perform pathway analysis - -```{r} -timepoint_qusage_results <- qusage( - eset = exprs_mat, - labels = filtered_metadata_df$timepoint, - contrast = "t1-t0", - geneSets = kegg_genesets -) -``` - -We can look at the results with the `qsTable` function. -This will give us the top 20 pathways by default. -(This can be changed via the `number` argument to `qsTable`.) - -```{r} -qsTable(timepoint_qusage_results) -``` - -Let's save the results to file. -First, let's save a nicely formatted table. - -```{r} -timepoint_table <- qsTable(timepoint_qusage_results, - # numPathways let's us snag all tested gene sets - number = numPathways(timepoint_qusage_results) -) -readr::write_tsv( - timepoint_table, - file.path(results_dir, "qusage_t1-t0_kegg_results.tsv") -) -``` - -Note that gene sets with negative fold changes are higher in `t0`. - -We're also going to save the `QSarray` output of `qusage` because, as we'll see -shortly, there's lots you can do with qusage results! - -```{r} -readr::write_rds( - timepoint_qusage_results, - file.path(results_dir, "qusage_t1-t0_kegg_QSarray.RDS") -) -``` - -#### Plotting - -We can look at the overall results with the `plotCIs` function. - -```{r} -plotCIs(timepoint_qusage_results, cex.xaxis = 0.25) -``` - -It's a bit difficult to see in this notebook, but we'll save the plot as a PDF. - -```{r} -pdf(file.path(plots_dir, "qusage_t1-t0_plotCIs.pdf"), width = 14, height = 8) -# adjusting the figure margins to better fit pathway names -par(mar = c(8, 4, 1, 2)) -plotCIs(timepoint_qusage_results, cex.xaxis = 0.5) -dev.off() -``` - -Let's look more closely at the T Cell Receptor Signaling Pathway. -We can see above that this pathway has a negative fold change and therefore -should be higher at `t0`. - -```{r} -pathway_index <- - which(names(kegg_genesets) == "KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY") # Replace with exact name of desired pathway -- names of pathways can be seen using the `names(kegg_genesets)` function -``` - -Plot the overall distribution of the pathway (thick black line) and the -distribution of the individual genes (thin lines colored by their standard -deviations). - -```{r} -plotGeneSetDistributions(timepoint_qusage_results, path.index = pathway_index) -``` - -We can see the majority of genes in this pathway don't change much, but a -handful have lower values in `t1`. - -Now, plot the mean and confidence intervals for each gene in the pathway. -This gives us insight into _which_ genes changed the most. - -```{r} -plotCIsGenes(timepoint_qusage_results, - path.index = pathway_index, - cex.xaxis = 0.75 -) -``` - -We can easily check to see if the directionality is as we expect by making -a boxplot of the gene `387` which has a negative value and is on right side of -the graph. - -```{r} -graphics::boxplot(exprs_mat["387", ] ~ filtered_metadata_df$timepoint) -``` - -### Two-way comparison - -Patients in this influenza data set were either _symptomatic_ or _asymptomatic_. -We can check if there are differences in pathway _responses_ to influenza virus -(e.g., between time points) between the two groups. -The ability to do this more complex comparison is a great feature of `qusage`. -Again, we'll mirror the analysis performed in the vignette. - -The symptomatic or asymptomatic information is in the `clinic_pheno` column -of the metadata. - -```{r} -head(filtered_metadata_df$clinic_pheno) -``` - -Generate labels for the two-way comparison. - -```{r} -two_way_labels <- paste(filtered_metadata_df$clinic_pheno, - filtered_metadata_df$timepoint, - sep = "." -) -two_way_labels -``` - -Symptomatic comparison - -```{r} -sx_results <- qusage( - eset = exprs_mat, - labels = two_way_labels, - contrast = "Symptomatic.t1-Symptomatic.t0", - geneSets = kegg_genesets, - # timepoints are paired samples, so we can specify that - # information here - pairVector = filtered_metadata_df$subject -) -``` - -Asymptomatic comparison - -```{r} -asx_results <- qusage( - eset = exprs_mat, - labels = two_way_labels, - contrast = "Asymptomatic.t1-Asymptomatic.t0", - geneSets = kegg_genesets, - pairVector = filtered_metadata_df$subject -) -``` - -Calculate the difference between the two comparisons - -```{r} -sx_vs_asx <- qusage( - eset = exprs_mat, - labels = two_way_labels, - contrast = "(Symptomatic.t1-Symptomatic.t0) - (Asymptomatic.t1-Asymptomatic.t0)", - geneSets = kegg_genesets, - pairVector = filtered_metadata_df$subject -) -``` - -Let's look at the cytosolic DNA sensing pathway like the authors' do in the -vignette and see if we get similar results. - -```{r} -names(kegg_genesets)[125] -``` - -```{r} -plotDensityCurves(asx_results, - path.index = 125, - col = "#D55E00", - main = "CYTOSOLIC DNA SENSING", - xlim = c(-1.0, 2.5) -) -plotDensityCurves(sx_results, path.index = 125, col = "#56B4E9", add = TRUE) -plotDensityCurves(sx_vs_asx, - path.index = 125, col = "#000000", add = TRUE, - lwd = 3 -) -legend("topleft", - legend = c("Asymptomatic", "Symptomatic", "Symptomatic-Asymptomatic"), - lty = 1, col = c("#D55E00", "#56B4E9", "#000000", lwd = c(1, 1, 3)), - cex = 0.5 -) -``` - -This tells us that this pathway is active in the Symptomatic patients, but -not much is happening in the Asymptomatic patients. -This is the result in the vignette, although the magnitude of the values (`x`) -are different. - -```{r} -plotGeneSetDistributions(sx_results, asx_results, path.index = 125) -``` - -A few genes appear to be driving the elevation of the cytosolic DNA sensing -patients in symptomatic patients after exposure (`t1`). -We can take a closer look with `plotCIsGenes`. - -```{r} -plotCIsGenes(sx_results, path.index = 125) -``` - -Let's look at the Asymptomatic results, where we'd expect no difference between -timepoints for most of the genes based on the plots above. - -```{r} -plotCIsGenes(asx_results, path.index = 125) -``` - -We expect the gene `3665` ([_IRF7_](https://www.genecards.org/cgi-bin/carddisp.pl?gene=IRF7&keywords=3665)) -to be elevated in `t1` compared to `t0` but _only_ in Symptomatic patients. -Let's make a boxplot to be sure. - -```{r} -graphics::boxplot(exprs_mat["3665", ] ~ interaction(filtered_metadata_df$timepoint, filtered_metadata_df$clinic_pheno)) -``` - -## Session info - -```{r} -sessioninfo::session_info() -``` diff --git a/02-microarray/pathway-analysis_microarray_05_qusage_single_dataset.Rmd b/02-microarray/pathway-analysis_microarray_05_qusage_single_dataset.Rmd deleted file mode 100644 index 864f1dc3..00000000 --- a/02-microarray/pathway-analysis_microarray_05_qusage_single_dataset.Rmd +++ /dev/null @@ -1,481 +0,0 @@ ---- -title: "Pathway analysis with QuSAGE: Single dataset" -output: - html_notebook: - toc: TRUE - toc_float: TRUE -author: J. Taroni for ALSF CCDL -date: 2019 ---- - -## Background - -In this module, we'll demonstrate how to perform pathway analysis using -Quantitative Set Analysis of Gene Expression (QuSAGE) -([](https://doi.org/10.1093/nar/gkt660)). -QuSAGE, implemented in the [`qusage` bioconductor package](https://bioconductor.org/packages/release/bioc/html/qusage.html), -has some nice features: - -* It takes into account inter-gene correlation (a source of type I error). -* It returns more information than just a p-value. -That's useful for analyses you might want to perform downstream. -* Built-in visualization functionality. - -We recommend taking a look at the original publication (Yaari et al.) and -the R package documentation to learn more. - -## Gene sets - -`qusage` allows you to read in gene sets that are in the [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). -[Curated gene sets from MSigDB](http://software.broadinstitute.org/gsea/msigdb/collections.jsp#C2) -like [KEGG](https://www.genome.jp/kegg/) are popular for pathway analysis, but -MSigDB only distributes human pathway data. -Here, we'll work with a mouse dataset. - -In the [`kegg_ortholog_mapping`](./kegg_ortholog_mapping.nb.html) notebook in -this module, we mapped human Entrez IDs to mouse symbols using the -[`hcop` package](https://github.com/stephenturner/hcop). -When there was a 1:many mapping between human Entrez IDs and mouse gene symbols, -we selected the mouse gene symbol with the highest number of resources -supporting the mapping. -This decision might not be suitable for every experiment and may result in -some loss of information. - -## Dataset - -We're using [`GSE75574`](https://www.refine.bio/experiments/GSE75574/gene-expression-in-mouse-tissues-in-response-to-short-term-calorie-restriction) in this notebook. -This dataset assays the gene expression response to short-term calorie -restriction in multiple tissues from multiple mouse strains. -We'll test for pathways that change in response to short-term calorie -restriction. - -## Set up - -Package installation and loading - -```{r} -if (!("qusage" %in% installed.packages())) { - BiocManager::install("qusage", update = FALSE) -} - -if (!("org.Mm.eg.db" %in% installed.packages())) { - BiocManager::install("org.Mm.eg.db", update = FALSE) -} - -if (!("pheatmap" %in% installed.packages())) { - install.packages("pheatmap") -} -``` - -```{r} -`%>%` <- dplyr::`%>%` -``` - -```{r} -library(qusage) -library(org.Mm.eg.db) -library(pheatmap) -``` - -Create directories to hold plots and results if they do not yet exist. - -```{r} -# Define the file path to the plots directory -plots_dir <- "plots" # Replace with path to desired output plots directory - -# Make a plots directory if it isn't created yet -if (!dir.exists(plots_dir)) { - dir.create(plots_dir) -} - -# Define the file path to the results directory -results_dir <- "results" # Replace with path to desired output results directory - -# Make a results directory if it isn't created yet -if (!dir.exists(results_dir)) { - dir.create(results_dir) -} - -# Define the file path to the data directory -data_dir <- "data" -``` - -## Read in refine.bio data - -The gene expression matrix of the dataset we'll be working with is too large -to be tracked with git without compression, so we need to unzip it if we have -not already. - -Note that if your file is not zipped you can skip this chunk. - -```{r} -# if the unzipped folder does not exist -- skip this if you do not have a zipped file -if (!dir.exists(file.path( - data_dir, - "GSE75574" # Replace with the name of your file without the .zip extension -))) { - compressed_file <- file.path( - data_dir, - "GSE75574.zip" # Replace with the name of your zipped file - ) - unzip(compressed_file, exdir = data_dir) -} -``` - -### Expression data - -```{r} -expression_file <- file.path( - data_dir, - "GSE75574", # Replace with name of the folder in which your expression file is stored - "GSE75574.tsv" # Replace with name of your expression file -) -exprs_df <- readr::read_tsv(expression_file, progress = FALSE) -# first column is currently named 'Gene' and contains Ensembl gene IDs -colnames(exprs_df)[1] <- "ENSEMBL" -``` - -Because our gene sets use gene symbols and expression data from refine.bio uses -Ensembl IDs, we need to do a conversion. -We're using the default behavior for 1:many mappings here, where only the first -one is selected -([docs](https://www.rdocumentation.org/packages/AnnotationDbi/versions/1.30.1/topics/AnnotationDb-objects)). - -```{r} -symbol_mappings <- mapIds(org.Mm.eg.db, - keys = exprs_df$ENSEMBL, - column = "SYMBOL", keytype = "ENSEMBL" -) -``` - -```{r} -head(symbol_mappings) -``` - -```{r} -# mapIds returns this in the same order as the keys we passed it to, let's -# demonstrate that -all.equal(names(symbol_mappings), exprs_df$ENSEMBL) -``` - -We'll annotate these expression data with the gene symbols. - -```{r} -symbol_exprs_df <- exprs_df %>% - # add a new column that contains the gene symbols - # this gets added as the last column - dplyr::mutate(SYMBOL = symbol_mappings) %>% - # drop the Ensembl gene IDs - dplyr::select(-ENSEMBL) %>% - # reorder such that the gene symbols are in the first column - dplyr::select(SYMBOL, dplyr::everything()) %>% - # drop any genes without a gene symbol - dplyr::filter(!is.na(SYMBOL)) -``` - -How many gene symbols have duplicate values? - -```{r} -sum(duplicated(symbol_exprs_df$SYMBOL)) -``` - -Collapse to the mean value for that gene symbol. - -```{r} -symbol_exprs_df <- symbol_exprs_df %>% - dplyr::group_by(SYMBOL) %>% - dplyr::summarise_all(mean) -``` - -Write to file. - -```{r} -readr::write_tsv(symbol_exprs_df, file.path( - data_dir, - "GSE75574", # Replace with name of the folder in which you would like to store your output file - "GSE75574_symbols.tsv" # Replace with relevant output name -)) -``` - -### Metadata - -```{r} -metadata_file <- file.path( - data_dir, - "GSE75574", # Replace with name of the folder in which your relevant metadata file is stored - "metadata_GSE75574.tsv" # Replace with name of relevant metadata file -) -metadata_df <- readr::read_tsv(metadata_file) %>% - # drop any metadata columns that are all NAs - dplyr::select(-which(apply(is.na(.), 2, all))) -``` - -Retain only the most pertinent columns. - -```{r} -metadata_df <- metadata_df %>% - dplyr::select( - refinebio_accession_code, refinebio_title, - refinebio_specimen_part, refinebio_sex, strain - ) -``` - -This particular accession ([`GSE75574`](https://www.refine.bio/experiments/GSE75574/gene-expression-in-mouse-tissues-in-response-to-short-term-calorie-restriction)) is a SuperSeries comprised of experiments -in multiple tissues from multiple mouse strains. -We'll use white adipose tissue for all strains in this example. - -```{r} -metadata_df <- metadata_df %>% - dplyr::filter(refinebio_specimen_part == "white adipose") -``` - -The two groups we are interested in for comparison are calorie restricted vs. -controls. -This information is in the title. -We can extract this and put it in its own column called `condition`. - -```{r} -metadata_df <- metadata_df %>% - dplyr::mutate(condition = dplyr::case_when( - grepl("control", refinebio_title) ~ "control", - grepl("calorierestricted", refinebio_title) ~ "calorierestricted" - )) -``` - -## Pathway Analysis - -### Read in KEGG pathways - -First, we need the sets of genes that represent pathways. -Again, these were prepared in the -[`kegg_ortholog_mapping`](./kegg_ortholog_mapping.nb.html) notebook (see -[Gene Sets](#gene-sets) above). - -```{r} -kegg_file <- file.path( - "gene-sets", - "c2.cp.kegg.v6.2.entrez_mouse_symbol_hcop.gmt" -) -kegg_pathways <- read.gmt(kegg_file) -``` - -### Prep expression matrix - -```{r} -exprs_mat <- symbol_exprs_df %>% - tibble::column_to_rownames("SYMBOL") %>% - # same order as the metdata - dplyr::select(metadata_df$refinebio_accession_code) %>% - # qusage takes a matrix - as.matrix() -``` - -We want to compare calorie restricted mice to controls. - -```{r} -qusage_results <- qusage( - eset = exprs_mat, - labels = metadata_df$condition, - contrast = "calorierestricted-control", - geneSets = kegg_pathways -) -``` - -Save the `QSArray` output. -We can use this object for visualization or other downstream analyses. - -```{r} -readr::write_rds( - qusage_results, - file.path( - results_dir, - "GSE75574_adipose_calorierest-ctrl_QSarray.RDS" # Replace with file name relevant to output results - ) -) -``` - -### Overall results - -We can get a look at the general trend of the results with the `plotCIs` -function. -This plots the means and 95% confidence intervals of each pathway we tested, -sorted such that the gene sets with the highest mean will be on the left of the -plot. -These are gene sets that are elevated in calorie restricted mice. -Error bars are colored by the directionality and corrected p-value (FDR by -default). -Unfortunately the p-value color scheme is red-green, which does not work well -for people with green or red color-vision deficiency. - -```{r} -plotCIs(qusage_results, cex.xaxis = 0.25, main = "Calorie Restricted - Control") -``` - -Save to plot. - -```{r} -pdf(file.path( - plots_dir, - "GSE75574_adipose_calorierestricted-control_plotCIs.pdf" -), -width = 14, height = 8.5 -) -plotCIs(qusage_results, cex.xaxis = 0.25, main = "Calorie Restricted - Control") -dev.off() -``` - -We can also look at the log fold-change and FDR values for pathways with the -`qsTable` function. -By default, this function shows you the top 20 pathways sorted by FDR. -We can change the `number` argument to `qsTable` to decrease or increase the -number of pathways returned. -We can use the `numPathways` function to get _all_ the pathways we tested. - -```{r} -qsTable(qusage_results, number = numPathways(qusage_results)) -``` - -Write to results. - -```{r} -qsTable(qusage_results, number = numPathways(qusage_results)) %>% - readr::write_tsv(file.path( - results_dir, - "GSE75574_adipose_calorierestricted-control.tsv" # Replace with file name relevant to results output - )) -``` - -`qusage` has functionality that lets us dig into our results a bit more to -get an idea of which genes are contributing to the differences between our two -groups. - -### KEGG ECM Receptor Interaction - -The KEGG ECM Receptor Interaction pathway expression is reduced in response -to calorie restriction. - -We can look at the distribution of genes in this pathway with the -`plotGeneSetDistributions` function. -We tell this function which pathway to plot with the `path.index` argument. - -```{r} -grep( - "ECM_RECEPTOR", # Replace with an exact word or phrase that would filter in your desired pathway(s) as the `grep` function looks for all instances of the pattern given -- names of pathways can be seen using the `names(kegg_pathways)` function - names(kegg_pathways) -) -``` - -```{r} -plotGeneSetDistributions(qusage_results, path.index = 114) -``` - -_Most_ genes are around zero (no activity), but we see some genes with -negative activity values. -We can dig into the _what_ the genes are with the `plotCIsGenes` function. - -```{r} -plotCIsGenes(qusage_results, path.index = 114) -``` - -Let's look at another example with the opposite directionality. - -### KEGG Steroid Biosynthesis - -The KEGG Steroid Biosynthesis pathway expression is increased in calorie -restricted adipose tissue. - -```{r} -plotGeneSetDistributions(qusage_results, path.index = 9) -``` - -This looks about half of the genes have positive activity values. - -```{r} -plotCIsGenes(qusage_results, path.index = 9) -``` - -We can make a heatmap with only genes from this pathway. -The heatmap will have similar information as `plotCIsGenes`, but we'll also -get a sense of how samples relate to one another. - -#### Heatmap - -Subset the expression matrix to only genes in the KEGG Steroid Biosynthesis -pathways. - -```{r} -steroid_genes <- kegg_pathways$KEGG_STEROID_BIOSYNTHESIS -# rows (genes) in the pathway -steroid_mat <- exprs_mat[which(rownames(exprs_mat) %in% steroid_genes), ] -``` - -We'll use annotation bars for the columns (samples). - -```{r} -# need a data.frame that contains the sample metadata that we're interested in -annotation_col_df <- metadata_df %>% - dplyr::select(refinebio_accession_code, condition, strain) %>% - tibble::column_to_rownames("refinebio_accession_code") - -# colors to be used in the annotation bars -- we need to assign one for -# each factor level in the two columns -annot_colors <- list( - condition = c( - control = "#FFFFFF", - calorierestricted = "#CD2626" - ), - # palette from - # https://github.com/clauswilke/colorblindr/blob/1ac3d4d62dad047b68bb66c06cee927a4517d678/R/palettes.R#L7 - strain = c( - `129S1/SvImJ` = "#E69F00", - `B6C3F1/J` = "#56B4E9", - `Balbc/J` = "#009E73", - `C3H/HeJ` = "#F0E442", - `C57BL6/J` = "#0072B2", - `CBA/J` = "#D55E00", - `DBA/2J` = "#CC79A7" - ) -) -``` - -Let's make the heatmap with annotation bars. - -```{r} -hm_plot <- pheatmap::pheatmap(steroid_mat, - # using blue, white, red color scheme - color = colorRampPalette(c( - "#0000FF", "#FFFFFF", - "#FF0000" - ))(25), - clustering_distance_cols = "correlation", - clustering_distance_rows = "correlation", - clustering_method = "average", - # scale the expression values of genes for - # visualization - scale = "row", - fontsize_col = 2, - angle_col = "45", - annotation_col = annotation_col_df, - annotation_colors = annot_colors, - main = "KEGG Steroid Biosynthesis" -) -``` - -Save the plot. - -```{r} -pdf(file.path( - plots_dir, - "GSE75574_steroid_biosynthesis_heatmap.pdf" # Replace with relevant output plot name -), -width = 7, height = 5 -) -print(hm_plot) -dev.off() -``` - -## Session info - -```{r} -sessioninfo::session_info() -``` diff --git a/02-microarray/pathway-analysis_microarray_06_ssgsea.Rmd b/02-microarray/pathway-analysis_microarray_06_ssgsea.Rmd deleted file mode 100644 index 18d4da57..00000000 --- a/02-microarray/pathway-analysis_microarray_06_ssgsea.Rmd +++ /dev/null @@ -1,445 +0,0 @@ ---- -title: "Pathway analysis: ssGSEA" -output: - html_notebook: - toc: TRUE - toc_float: TRUE -author: J. Taroni for ALSF CCDL -date: 2019 ---- - -## Background - -Pathway or gene set analysis methods like Quantitative Set Analysis of Gene -Expression (QuSAGE) -([](https://doi.org/10.1093/nar/gkt660)) or Gene Set -Enrichment Analysis (GSEA) -([ et al. _PNAS_. 2005.](https://doi.org/10.1073/pnas.0506580102)) -require us to specify group labels. -We may want a better idea of what pathways are up- or down-regulated in -_individual samples_ if we, for example, suspect that there are subgroups of -patients during exploratory data analysis. -We can use single-sample GSEA (ssGSEA) -([Barbie et al. _Nature_. 2009.](https://dx.doi.org/10.1038/nature08460)), -which is implemented in the -[`GSVA` bioconductor package](https://bioconductor.org/packages/release/bioc/html/GSVA.html). -Note that `GSVA` contains _multiple_ gene set enrichment methods and has an -excellent [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GSVA/inst/doc/GSVA.pdf). - -## Gene sets - -We will use KEGG (Kyoto Encyclopedia of Genes and Genomes) Pathways in this -analysis. -We acquired the KEGG `v6.2` pathway set that used human Entrez IDs from the -[Molecular Signatures Database (MSigDB)](http://software.broadinstitute.org/gsea/msigdb). - -In the [`kegg_ortholog_mapping`](./kegg_ortholog_mapping.nb.html) notebook in -this module, we mapped human Entrez IDs to mouse symbols using the -[`hcop` package](https://github.com/stephenturner/hcop). -When there was a 1:many mapping between human Entrez IDs and mouse gene symbols, -we selected the mouse gene symbol with the highest number of resources -supporting the mapping. -This decision might not be suitable for every experiment and may result in -some loss of information. - -## Dataset - -We're using [`GSE75574`](https://www.refine.bio/experiments/GSE75574/gene-expression-in-mouse-tissues-in-response-to-short-term-calorie-restriction) -in this notebook. -This dataset assays the gene expression response to short-term calorie -restriction in multiple tissues from multiple mouse strains. -This is the same dataset we used in one of our QuSAGE examples -([`qusage_single_dataset`](./qusage_single_dataset.nb.html)), where we only -looked at white adipose tissue. - -## Set up - -Package installation and loading. - -```{r} -if (!("qusage" %in% installed.packages())) { - BiocManager::install("qusage", update = FALSE) -} - -if (!("org.Mm.eg.db" %in% installed.packages())) { - BiocManager::install("org.Mm.eg.db", update = FALSE) -} - -if (!("GSVA" %in% installed.packages())) { - BiocManager::install("GSVA", update = FALSE) -} - -if (!("matrixStats" %in% installed.packages())) { - install.packages("matrixStats") -} -``` - -```{r} -`%>%` <- dplyr::`%>%` -``` - -```{r} -library(org.Mm.eg.db) -library(GSVA) -``` - -Create directories to hold plots and results if they do not yet exist. - -```{r} -# Define the file path to the plots directory -plots_dir <- "plots" # Replace with path to desired output plots directory - -# Make a plots directory if it isn't created yet -if (!dir.exists(plots_dir)) { - dir.create(plots_dir) -} - -# Define the file path to the results directory -results_dir <- "results" # Replace with path to desired output results directory - -# Make a results directory if it isn't created yet -if (!dir.exists(results_dir)) { - dir.create(results_dir) -} - -# Define the file path to the data directory -data_dir <- "data" -``` - -## Read in refine.bio data - -The gene expression matrix of the dataset we'll be working with is too large -to be tracked with git without compression, so we need to unzip it if we have -not already. -This compressed folder contains both the expression matrix and the metadata. - -```{r} -# if the unzipped folder does not exist -if (!dir.exists(file.path( - data_dir, - "GSE75574" # Replace with the name of your file without the .zip extension -))) { - compressed_file <- file.path( - data_dir, - "GSE75574.zip" # Replace with the name of your zipped file - ) - unzip(compressed_file, exdir = data_dir) -} -``` - -### Expression data - -Because we used this in another notebook, it's possible that you already have -a local copy of the prepared expression matrix. -If a local copy exists, we don't want to go through the steps to reprocess it, -so we can check for this first. - -If not, we'll process the data in the following ways: - -* Convert the Ensembl gene IDs that are used in refine.bio data to gene symbols. -* If duplicate gene symbols exist, we'll collapse to the mean value for that -gene symbol. - -```{r} -# check for the existence of the file that would have been generated in the -# other notebook -symbol_file <- file.path( - data_dir, - "GSE75574", # Replace with name of your unzipped folder - "GSE75574_symbols.tsv" # Replace with relevant output name -) -if (file.exists(symbol_file)) { - symbol_exprs_df <- readr::read_tsv(symbol_file, progress = FALSE) -} else { - - # read in the refine.bio expression data - expression_file <- file.path( - data_dir, - "GSE75574", # Replace with name of your unzipped folder - "GSE75574.tsv" # Replace with name of your expression file - ) - exprs_df <- readr::read_tsv(expression_file, progress = FALSE) - # first column is currently named 'Gene' and contains Ensembl gene IDs - colnames(exprs_df)[1] <- "ENSEMBL" - - # using the default behavior for 1:many mappings here -- select the first one - symbol_mappings <- mapIds(org.Mm.eg.db, - keys = exprs_df$ENSEMBL, - column = "SYMBOL", keytype = "ENSEMBL" - ) - - # mapIds should give us back IDs in the same order as the keys, but we can - # check to be sure - if (!(all.equal(names(symbol_mappings), exprs_df$ENSEMBL))) { - stop("Gene order is not as expected!") - } - - symbol_exprs_df <- exprs_df %>% - # add a new column that contains the gene symbols - # this gets added as the last column - dplyr::mutate(SYMBOL = symbol_mappings) %>% - # drop the Ensembl gene IDs - dplyr::select(-ENSEMBL) %>% - # reorder such that the gene symbols are in the first column - dplyr::select(SYMBOL, dplyr::everything()) %>% - # drop any genes without a gene symbol - dplyr::filter(!is.na(SYMBOL)) - - # if there are any duplicate gene symbols, collapse to the mean value - if (any(duplicated(symbol_exprs_df$SYMBOL))) { - symbol_exprs_df <- symbol_exprs_df %>% - dplyr::group_by(SYMBOL) %>% - dplyr::summarise_all(mean) - } - - # write to file - readr::write_tsv(symbol_exprs_df, symbol_file) -} -``` - -### Metadata - -```{r} -metadata_file <- file.path( - data_dir, - "GSE75574", # Replace with name of your unzipped folder - "metadata_GSE75574.tsv" # Replace with name of relevant metadata file -) -metadata_df <- readr::read_tsv(metadata_file) %>% - # drop any metadata columns that are all NAs - dplyr::select(-which(apply(is.na(.), 2, all))) -``` - -Retain only the most pertinent columns. - -```{r} -metadata_df <- metadata_df %>% - dplyr::select( - refinebio_accession_code, refinebio_title, - refinebio_specimen_part, refinebio_sex, strain - ) -``` - -We're interested in the are calorie-restricted vs. controls information, which -is in the sample title. -We can extract this and put it in its own column called `condition` as is good -practice. - -```{r} -metadata_df <- metadata_df %>% - dplyr::mutate(condition = dplyr::case_when( - grepl("control", refinebio_title) ~ "control", - grepl("calorierestricted", refinebio_title) ~ "calorie-restricted" - )) -``` - -## Pathway Analysis - -### Read in KEGG pathways - -First, we need the sets of genes that represent pathways. -Again, these were prepared in the -[`kegg_ortholog_mapping`](./kegg_ortholog_mapping.nb.html) notebook (see -[Gene Sets](#gene-sets) above). - -```{r} -kegg_file <- file.path( - "gene-sets", - "c2.cp.kegg.v6.2.entrez_mouse_symbol_hcop.gmt" -) -kegg_pathways <- qusage::read.gmt(kegg_file) -``` - -### Prep expression matrix - -```{r} -exprs_mat <- symbol_exprs_df %>% - tibble::column_to_rownames("SYMBOL") %>% - # same order as the metdata -- this may be helpful in downstream analyses - dplyr::select(metadata_df$refinebio_accession_code) %>% - # qusage takes a matrix - as.matrix() -``` - -### ssGSEA - -We can specify that we'd like to perform ssGSEA with the `method` argument to -`gsva`. - -```{r} -ssgsea_results <- gsva( - expr = exprs_mat, - gset.idx.list = kegg_pathways, - method = "ssgsea", - verbose = FALSE -) -``` - -This returns a matrix of ssGSEA enrichment scores where the columns are samples -and the rows are the input gene sets. - -```{r} -str(ssgsea_results) -``` - -#### Identifying "interesting" features - -One way to figure out what pathways may be interesting is to find which pathways -have the highest _variance_ in their ssGSEA scores. -We can calculate the row variances with the `matrixStats` package. - -```{r} -score_row_var <- matrixStats::rowVars(ssgsea_results) -# for convenience, let's use the pathway names with this vector -names(score_row_var) <- rownames(ssgsea_results) -# look at the 'top' pathways by this metric -head(sort(score_row_var, decreasing = TRUE)) -``` - -Note that many of these pathways seem to have something to do with the immune -system and may very well share a large number of genes. -Let's look at an example pair of these pathways. - -```{r} -intersect( - kegg_pathways$KEGG_ALLOGRAFT_REJECTION, - kegg_pathways$KEGG_GRAFT_VERSUS_HOST_DISEASE -) -``` - -These ssGSEA scores should not be treated as if they are independent. - -#### Plotting features - -What do the patterns across strains and tissues look like? -Let's look at the KEGG Complement and Coagulation Cascades pathway as an -example. - -We'll be using `ggplot2` for plotting, so we'll want our ssGSEA values as a -`data.frame`. - -```{r} -comp_coag_df <- data.frame( - sample_accession = colnames(ssgsea_results), - ssgsea_score = ssgsea_results["KEGG_COMPLEMENT_AND_COAGULATION_CASCADES", ] # Replace with desired pathway -- note that this is distinct from `grep` in that you'd have to write the name exactly as it is in the matrix -) -``` - -We need to add in the strain, tissue, and condition metadata to make an -informative plot. - -```{r} -comp_coag_df <- metadata_df %>% - dplyr::inner_join(comp_coag_df, - by = c("refinebio_accession_code" = "sample_accession") - ) -``` - -```{r fig.height=11, fig.width=8.5} -comp_coag_df %>% - ggplot(aes( - x = condition, - y = ssgsea_score - )) + - geom_boxplot() + - facet_grid(strain ~ refinebio_specimen_part) + - labs(y = "KEGG Complement and Coagulation Cascades") + - theme_bw() + - # x-axis text at a 45 degree angle to increase readability - theme( - axis.text.x = element_text(angle = 45, hjust = 1), - text = element_text(size = 15) - ) -``` - -```{r} -# saving to file -ggsave(file.path( - plots_dir, - "GSE75574_ssgsea_comp_coag_facet.pdf" # Replace with relevant output plot name -), -plot = last_plot() + - theme(text = element_text(size = 10)) -) -``` - -Looks like there may be some differences between tissues in this pathway, with -some strains showing more pronounced differences in white adipose tissue in the -calorie-restricted vs. control comparison. -We should always follow up with more formal analysis (e.g., QuSAGE), but -this series of steps is a good way to explore one's data. - -#### Aside: Gene set size influences ssGSEA score values - -ssGSEA score values are not necessarily comparable between gene sets of -different sizes (number of genes). -If you are comparing ssGSEA scores between samples _within a gene set_, this is -not a concern. - -To demonstrate the effect of gene set size on scores, we'll perform a short -experiment with random gene sets. - -```{r} -# use all the gene symbols in the dataset as the pool of possible genes -all_genes <- symbol_exprs_df$SYMBOL - -# set a seed for reproducibility -set.seed(123) - -# GSVA::gsva takes gene sets as a list, so that's how we'll store the random -# gene sets -random_gene_sets <- list() - -# testing 5 different gene set sizes -for (pathway_size in c(25, 50, 100, 250, 500)) { - # generate 10 random sets of pathway_size - for (path_iter in 1:10) { - random_pathway_name <- paste("size", pathway_size, path_iter, sep = "_") - current_gene_set <- base::sample(x = all_genes, size = pathway_size) - random_gene_sets[[random_pathway_name]] <- current_gene_set - } -} - -# calculate ssGSEA scores for the random gene sets -random_ssgsea_results <- gsva( - expr = exprs_mat, - gset.idx.list = random_gene_sets, - method = "ssgsea", - verbose = FALSE -) - -# plot the results with ggplot2 -# first we need to get this in a form that is amenable for plotting, as gsva -# returns a matrix where columns are samples and rows are gene sets -random_long_df <- as.data.frame(random_ssgsea_results) %>% - # gene set names are rownames - tibble::rownames_to_column("gene_set") %>% - # long format - reshape2::melt(variable.name = "sample", value.name = "ssgsea_score") %>% - # extract the gene set size from the gene set name - dplyr::mutate(gene_set_size = stringr::word(gene_set, 2, sep = "_")) %>% - # we want to plot smallest no. genes -> largest no. genes - dplyr::mutate(gene_set_size = factor(gene_set_size, - levels = c(25, 50, 100, 250, 500) - )) - -# violin plot comparing no. genes / gene set size -random_long_df %>% - ggplot(aes(x = gene_set_size, y = ssgsea_score)) + - geom_violin() + - coord_flip() + - labs( - title = "Random gene set scores", - x = "gene set size", - y = "ssGSEA score" - ) + - theme_bw() -``` - -Note how a smaller gene set results in a larger range of scores. - -## Session info - -```{r} -sessioninfo::session_info() -``` diff --git a/03-rnaseq/00-intro-to-rnaseq.Rmd b/03-rnaseq/00-intro-to-rnaseq.Rmd index 909232c0..1842dbc8 100644 --- a/03-rnaseq/00-intro-to-rnaseq.Rmd +++ b/03-rnaseq/00-intro-to-rnaseq.Rmd @@ -9,7 +9,7 @@ output: Data analyses are generally not "one size fits all"; this is particularly true when with approaches used to analyze RNA-seq and microarray data. The characteristics of the data produced by these two technologies can be quite different. -This tutorial has example analyses [organized by technology](../01-getting-started/getting-started.html#about-how-this-tutorial-book-is-structured) so you can follow examples that are more closely tailored to the nature of the data at hand. +This tutorial has example analyses [organized by technology](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#about-how-this-tutorial-book-is-structured) so you can follow examples that are more closely tailored to the nature of the data at hand. diff --git a/03-rnaseq/00-intro-to-rnaseq.html b/03-rnaseq/00-intro-to-rnaseq.html index bda2b6cc..f58cfced 100644 --- a/03-rnaseq/00-intro-to-rnaseq.html +++ b/03-rnaseq/00-intro-to-rnaseq.html @@ -2550,6 +2550,617 @@ PagedTableDoc.initAll(); }; + + + + @@ -2856,15 +3534,20 @@ @@ -3114,6 +3803,11 @@

References

+ diff --git a/03-rnaseq/clustering_rnaseq_01_heatmap.Rmd b/03-rnaseq/clustering_rnaseq_01_heatmap.Rmd index cfcfc9e5..705ed182 100644 --- a/03-rnaseq/clustering_rnaseq_01_heatmap.Rmd +++ b/03-rnaseq/clustering_rnaseq_01_heatmap.Rmd @@ -44,7 +44,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -52,7 +52,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -93,7 +93,7 @@ The data that we downloaded from refine.bio for this analysis has 19 samples (ob ## Place the dataset in your new `data/` folder -Refine.bio will send you a download button in the email when it is ready. +refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. Double clicking should unzip this for you and create a folder of the same name. @@ -130,19 +130,24 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "SRP070849") # Replace with accession number which will be the name of the folder the files will be in - -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "SRP070849.tsv") # Replace with file path to your dataset - -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv") # Replace with file path to your metadata +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "SRP070849") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "SRP070849.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` @@ -186,7 +191,7 @@ if (!("DESeq2" %in% installed.packages())) { Attach the `pheatmap` and `DESeq2` libraries: -```{r} +```{r message=FALSE} # Attach the `pheatmap` library library(pheatmap) @@ -212,8 +217,9 @@ We stored our file paths as objects named `metadata_file` and `data_file` in [th metadata <- readr::read_tsv(metadata_file) # Read in data TSV file -df <- readr::read_tsv(data_file) %>% - # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later +expression_df <- readr::read_tsv(data_file) %>% + # Here we are going to store the gene IDs as row names so that + # we can have only numeric values to perform calculations on later tibble::column_to_rownames("Gene") ``` @@ -227,14 +233,35 @@ Now let's ensure that the metadata and data are in the same sample order. ```{r} # Make the data in the order of the metadata -df <- df %>% dplyr::select(metadata$refinebio_accession_code) +expression_df <- expression_df %>% + dplyr::select(metadata$refinebio_accession_code) # Check if this is in the same order -all.equal(colnames(df), metadata$refinebio_accession_code) +all.equal(colnames(expression_df), metadata$refinebio_accession_code) ``` Now we are going to use a combination of functions from the `DESeq2` and `pheatmap` packages to look at how are samples and genes are clustering. +## Define a minimum counts cutoff + +We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis. +We are going to do some pre-filtering to keep only genes with 10 or more reads total. +Note that rows represent gene data and the columns represent sample data in our dataset. + +```{r} +# Define a minimum counts cutoff and filter the data to include +# only rows (genes) that have total counts above the cutoff +filtered_expression_df <- expression_df %>% + dplyr::filter(rowSums(.) >= 10) +``` + +We also need our counts to be rounded before we can use them with the `DESeqDataSetFromMatrix()` function. + +```{r} +# The `DESeqDataSetFromMatrix()` function needs the values to be integers +filtered_expression_df <- round(filtered_expression_df) +``` + ## Create a DESeqDataset We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object. @@ -242,46 +269,31 @@ We turn the data frame (or matrix) into a [`DESeqDataSet` object](https://alexsl In this chunk of code, we will not provide a specific model to the `design` argument because we are not performing a differential expression analysis. ```{r} -# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers -df <- df %>% - # Mutate numeric variables to be integers - dplyr::mutate_if(is.numeric, round) - # Create a `DESeqDataSet` object dds <- DESeqDataSetFromMatrix( - countData = df, # This is the data.frame with the counts values for all replicates in our dataset - colData = metadata, # This is the data.frame with the annotation data for the replicates in the counts data.frame - design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis + countData = filtered_expression_df, # the counts values for all samples + colData = metadata, # annotation data for the samples + design = ~1 # Here we are not specifying a model + # Replace with an appropriate design variable for your analysis ) ``` -## Define a minimum counts cutoff - -We want to filter out the genes that have not been expressed or that have low expression counts because we want to remove any possible noise from our data before we normalize the data and create our heatmap. -We are going to do some pre-filtering to keep only genes with 10 or more reads total. -Note that rows represent gene data and the columns represent sample data in our dataset. - -```{r} -# Define a minimum counts cutoff and filter `DESeqDataSet` object to include -# only rows that have counts above the cutoff -genes_to_keep <- rowSums(counts(dds)) >= 10 -dds <- dds[genes_to_keep, ] -``` - ## Perform DESeq2 normalization and transformation We are going to use the `rlog()` function from the `DESeq2` package to normalize and transform the data. For more information about these transformation methods, [see here](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods). ```{r} -# Normalize the data in the `DESeqDataSet` object using the `rlog()` function from the `DESEq2` R package +# Normalize the data in the `DESeqDataSet` object +# using the `rlog()` function from the `DESEq2` R package dds_norm <- rlog(dds) ``` ## Choose genes of interest -Although you may want to create a heatmap including all of the genes in the set, alternatively, the heatmap could be created using only genes of interest. -For this example, we will sort genes by variance, but there are many alternative criterion by which you may want to sort your genes e.g. fold change, t-statistic, membership to a particular gene ontology, so on. +Although you may want to create a heatmap including all of the genes in the dataset, this can produce a very large image that is hard to interpret. +Alternatively, the heatmap could be created using only genes of interest. +For this example, we will sort genes by variance and select genes in the upper quartile, but there are many alternative criterion by which you may want to sort your genes, e.g. fold change, t-statistic, membership in a particular gene ontology, so on. ```{r} # Calculate the variance for each gene @@ -290,7 +302,7 @@ variances <- apply(assay(dds_norm), 1, var) # Determine the upper quartile variance cutoff value upper_var <- quantile(variances, 0.75) -# Subset the data choosing only genes whose variances are in the upper quartile +# Filter the data choosing only genes whose variances are in the upper quartile df_by_var <- data.frame(assay(dds_norm)) %>% dplyr::filter(variances > upper_var) ``` @@ -301,21 +313,20 @@ To further customize the heatmap, see a vignette for a guide at this [link](http ```{r} # Create and store the heatmap object -pheatmap <- - pheatmap( - df_by_var, - cluster_rows = TRUE, # We want to cluster the heatmap by rows (genes in this case) - cluster_cols = TRUE, # We also want to cluster the heatmap by columns (samples in this case), - show_rownames = FALSE, # We don't want to show the rownames because there are too many genes for the labels to be clearly seen - main = "Non-Annotated Heatmap", - colorRampPalette(c( - "deepskyblue", - "black", - "yellow" - ))(25 - ), - scale = "row" # Scale values in the direction of genes (rows) - ) +heatmap <- pheatmap( + df_by_var, + cluster_rows = TRUE, # Cluster the rows of the heatmap (genes in this case) + cluster_cols = TRUE, # Cluster the columns of the heatmap (samples), + show_rownames = FALSE, # There are too many genes to clearly show the labels + main = "Non-Annotated Heatmap", + colorRampPalette(c( + "deepskyblue", + "black", + "yellow" + ))(25 + ), + scale = "row" # Scale values in the direction of genes (rows) +) ``` We've created a heatmap but although our genes and samples are clustered, there is not much information that we can gather here because we did not provide the `pheatmap()` function with annotation labels for our samples. @@ -329,11 +340,11 @@ You can easily switch this to save to a JPEG or tiff by changing the function an # Open a PNG file png(file.path( plots_dir, - "SRP070849_heatmap_non_annotated.png" # Replace file name with a relevant output plot name + "SRP070849_heatmap_non_annotated.png" # Replace with a relevant file name )) # Print your heatmap -pheatmap +heatmap # Close the PNG file: dev.off() @@ -347,7 +358,8 @@ From the accompanying [paper](https://pubmed.ncbi.nlm.nih.gov/28193779/), we kno We are going to manipulate the metadata and add variables with the information for each sample, from the experimental design briefly described above, that we would like to use to annotate the heatmap. ```{r} -# Let's prepare the annotation data.frame for the uncollapsed `DESeqData` set object which will be used to create the technical replicates heatmap +# Let's prepare the annotation for the uncollapsed `DESeqData` set object +# which will be used to annotate the heatmap annotation_df <- metadata %>% # Create a variable to store the cancer type information dplyr::mutate( @@ -355,16 +367,19 @@ annotation_df <- metadata %>% startsWith(refinebio_title, "TET2") ~ "TET2", startsWith(refinebio_title, "IDH2") ~ "IDH2", startsWith(refinebio_title, "WT") ~ "WT", - TRUE ~ "unknown" # If none of the above criteria are true, we mark the `mutation` variable as "unknown" + # If none of the above criteria are satisfied, + # we mark the `mutation` variable as "unknown" + TRUE ~ "unknown" ) ) %>% - # We want to select the variables that we want for annotating the technical replicates heatmap + # select only the columns we need for annotation dplyr::select( refinebio_accession_code, mutation, refinebio_treatment ) %>% - # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our `DESeaDataSet` object + # The `pheatmap()` function requires that the row names of our annotation + # data frame match the column names of our `DESeaDataSet` object tibble::column_to_rownames("refinebio_accession_code") ``` @@ -374,7 +389,7 @@ You can create an annotated heatmap by providing our annotation object to the `a ```{r} # Create and store the annotated heatmap object -pheatmap_annotated <- +heatmap_annotated <- pheatmap( df_by_var, cluster_rows = TRUE, @@ -397,17 +412,17 @@ Now that we have annotation bars on our heatmap, we have a better idea of the sa Let's save our annotated heatmap. ### Save annotated heatmap as a PNG -You can easily switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix. +You can switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix. ```{r} # Open a PNG file png(file.path( plots_dir, - "SRP070849_heatmap_annotated.png" # Replace file name with a relevant output plot name + "SRP070849_heatmap_annotated.png" # Replace with a relevant file name )) # Print your heatmap -pheatmap_annotated +heatmap_annotated # Close the PNG file: dev.off() diff --git a/03-rnaseq/clustering_rnaseq_01_heatmap.html b/03-rnaseq/clustering_rnaseq_01_heatmap.html index 4b5eb4c5..966e71e1 100644 --- a/03-rnaseq/clustering_rnaseq_01_heatmap.html +++ b/03-rnaseq/clustering_rnaseq_01_heatmap.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2971,26 +3797,26 @@

2.1 Obtain the .Rmd

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

-
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
-  dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
-  dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir)
-}
+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

@@ -3012,7 +3838,7 @@

2.4 About the dataset we are usin

2.5 Place the dataset in your new data/ folder

-

Refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

+

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

For more details on the contents of this folder see these docs on refine.bio.

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

@@ -3041,20 +3867,25 @@

2.6 Check out our file structure!

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

-
# Define the file path to the data directory
-data_dir <- file.path("data", "SRP070849") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "SRP070849.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv") # Replace with file path to your metadata
+
# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "SRP070849")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "SRP070849.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv")

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

-
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+
# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-
# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

@@ -3072,87 +3903,37 @@

4 Clustering Heatmap - RNA-seq

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

-

In this analysis, we will be using the R package DESeq2 (Love et al. 2014) for normalization and the R package pheatmap (Slowikowski 2017) for clustering and creating a heatmap.

-
if (!("pheatmap" %in% installed.packages())) {
-  # Install pheatmap
-  install.packages("pheatmap", update = FALSE)
-}
-
-if (!("DESeq2" %in% installed.packages())) {
-  # Install DESeq2
-  BiocManager::install("DESeq2", update = FALSE)
-}
+

In this analysis, we will be using the R package DESeq2 (Love et al. 2014) for normalization and the R package pheatmap (Slowikowski 2017) for clustering and creating a heatmap.

+
if (!("pheatmap" %in% installed.packages())) {
+  # Install pheatmap
+  install.packages("pheatmap", update = FALSE)
+}
+
+if (!("DESeq2" %in% installed.packages())) {
+  # Install DESeq2
+  BiocManager::install("DESeq2", update = FALSE)
+}

Attach the pheatmap and DESeq2 libraries:

-
# Attach the `pheatmap` library
-library(pheatmap)
-
-# Attach the `DESeq2` library
-library(DESeq2)
-
## Loading required package: S4Vectors
-
## Loading required package: stats4
-
## Loading required package: BiocGenerics
-
## Loading required package: parallel
-
## 
-## Attaching package: 'BiocGenerics'
-
## The following objects are masked from 'package:parallel':
-## 
-##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
-##     clusterExport, clusterMap, parApply, parCapply, parLapply,
-##     parLapplyLB, parRapply, parSapply, parSapplyLB
-
## The following objects are masked from 'package:stats':
-## 
-##     IQR, mad, sd, var, xtabs
-
## The following objects are masked from 'package:base':
-## 
-##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
-##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
-##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
-##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
-##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
-##     union, unique, unsplit, which, which.max, which.min
-
## 
-## Attaching package: 'S4Vectors'
-
## The following object is masked from 'package:base':
-## 
-##     expand.grid
-
## Loading required package: IRanges
-
## Loading required package: GenomicRanges
-
## Loading required package: GenomeInfoDb
-
## Loading required package: SummarizedExperiment
-
## Loading required package: Biobase
-
## Welcome to Bioconductor
-## 
-##     Vignettes contain introductory material; view with
-##     'browseVignettes()'. To cite Bioconductor, see
-##     'citation("Biobase")', and for packages 'citation("pkgname")'.
-
## Loading required package: DelayedArray
-
## Loading required package: matrixStats
-
## 
-## Attaching package: 'matrixStats'
-
## The following objects are masked from 'package:Biobase':
-## 
-##     anyMissing, rowMedians
-
## 
-## Attaching package: 'DelayedArray'
-
## The following objects are masked from 'package:matrixStats':
-## 
-##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
-
## The following objects are masked from 'package:base':
-## 
-##     aperm, apply, rowsum
-
# We will need this so we can use the pipe: %>%
-library(magrittr)
-
-# Set the seed so our results are reproducible:
-set.seed(12345)
+
# Attach the `pheatmap` library
+library(pheatmap)
+
+# Attach the `DESeq2` library
+library(DESeq2)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+# Set the seed so our results are reproducible:
+set.seed(12345)

4.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read in both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

-
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-
## Parsed with column specification:
+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   .default = col_logical(),
 ##   refinebio_accession_code = col_character(),
@@ -3164,113 +3945,115 @@ 

4.2 Import and set up data

## refinebio_subject = col_character(), ## refinebio_title = col_character(), ## refinebio_treatment = col_character() -## )
-
## See spec(...) for full column specifications.
-
# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
-  # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
-  tibble::column_to_rownames("Gene")
-
## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+
# Read in data TSV file
+expression_df <- readr::read_tsv(data_file) %>%
+  # Here we are going to store the gene IDs as row names so that
+  # we can have only numeric values to perform calculations on later
+  tibble::column_to_rownames("Gene")
+
## 
+## ── Column specification ──────────────────────────────────────────────
 ## cols(
 ##   .default = col_double(),
 ##   Gene = col_character()
 ## )
-## See spec(...) for full column specifications.
+## ℹ Use `spec()` for the full column specifications.

Let’s take a look at the metadata object that we read into the R environment.

-
head(metadata)
+
head(metadata)

Now let’s ensure that the metadata and data are in the same sample order.

-
# Make the data in the order of the metadata
-df <- df %>% dplyr::select(metadata$refinebio_accession_code)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$refinebio_accession_code)
+
# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+  dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(expression_df), metadata$refinebio_accession_code)
## [1] TRUE

Now we are going to use a combination of functions from the DESeq2 and pheatmap packages to look at how are samples and genes are clustering.

+
+

4.3 Define a minimum counts cutoff

+

We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis. We are going to do some pre-filtering to keep only genes with 10 or more reads total. Note that rows represent gene data and the columns represent sample data in our dataset.

+
# Define a minimum counts cutoff and filter the data to include
+# only rows (genes) that have total counts above the cutoff
+filtered_expression_df <- expression_df %>%
+  dplyr::filter(rowSums(.) >= 10)
+

We also need our counts to be rounded before we can use them with the DESeqDataSetFromMatrix() function.

+
# The `DESeqDataSetFromMatrix()` function needs the values to be integers
+filtered_expression_df <- round(filtered_expression_df)
+
-

4.3 Create a DESeqDataset

-

We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object. ) and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

-
# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers
-df <- df %>%
-  # Mutate numeric variables to be integers
-  dplyr::mutate_if(is.numeric, round)
-
-# Create a `DESeqDataSet` object
-dds <- DESeqDataSetFromMatrix(
-  countData = df, # This is the data.frame with the counts values for all replicates in our dataset
-  colData = metadata, # This is the data.frame with the annotation data for the replicates in the counts data.frame
-  design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis
-)
+

4.4 Create a DESeqDataset

+

We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object. ) and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

+
# Create a `DESeqDataSet` object
+dds <- DESeqDataSetFromMatrix(
+  countData = filtered_expression_df, # the counts values for all samples
+  colData = metadata, # annotation data for the samples
+  design = ~1 # Here we are not specifying a model
+  # Replace with an appropriate design variable for your analysis
+)
## converting counts to integer mode
-
-

4.4 Define a minimum counts cutoff

-

We want to filter out the genes that have not been expressed or that have low expression counts because we want to remove any possible noise from our data before we normalize the data and create our heatmap. We are going to do some pre-filtering to keep only genes with 10 or more reads total. Note that rows represent gene data and the columns represent sample data in our dataset.

-
# Define a minimum counts cutoff and filter `DESeqDataSet` object to include
-# only rows that have counts above the cutoff
-genes_to_keep <- rowSums(counts(dds)) >= 10
-dds <- dds[genes_to_keep, ]
-

4.5 Perform DESeq2 normalization and transformation

We are going to use the rlog() function from the DESeq2 package to normalize and transform the data. For more information about these transformation methods, see here.

-
# Normalize the data in the `DESeqDataSet` object using the `rlog()` function from the `DESEq2` R package
-dds_norm <- rlog(dds)
+
# Normalize the data in the `DESeqDataSet` object
+# using the `rlog()` function from the `DESEq2` R package
+dds_norm <- rlog(dds)

4.6 Choose genes of interest

-

Although you may want to create a heatmap including all of the genes in the set, alternatively, the heatmap could be created using only genes of interest. For this example, we will sort genes by variance, but there are many alternative criterion by which you may want to sort your genes e.g. fold change, t-statistic, membership to a particular gene ontology, so on.

-
# Calculate the variance for each gene
-variances <- apply(assay(dds_norm), 1, var)
-
-# Determine the upper quartile variance cutoff value
-upper_var <- quantile(variances, 0.75)
-
-# Subset the data choosing only genes whose variances are in the upper quartile
-df_by_var <- data.frame(assay(dds_norm)) %>%
-  dplyr::filter(variances > upper_var)
+

Although you may want to create a heatmap including all of the genes in the dataset, this can produce a very large image that is hard to interpret. Alternatively, the heatmap could be created using only genes of interest. For this example, we will sort genes by variance and select genes in the upper quartile, but there are many alternative criterion by which you may want to sort your genes, e.g. fold change, t-statistic, membership in a particular gene ontology, so on.

+
# Calculate the variance for each gene
+variances <- apply(assay(dds_norm), 1, var)
+
+# Determine the upper quartile variance cutoff value
+upper_var <- quantile(variances, 0.75)
+
+# Filter the data choosing only genes whose variances are in the upper quartile
+df_by_var <- data.frame(assay(dds_norm)) %>%
+  dplyr::filter(variances > upper_var)

4.7 Create a heatmap

-

To further customize the heatmap, see a vignette for a guide at this link (Slowikowski 2017).

-
# Create and store the heatmap object
-pheatmap <-
-  pheatmap(
-    df_by_var,
-    cluster_rows = TRUE, # We want to cluster the heatmap by rows (genes in this case)
-    cluster_cols = TRUE, # We also want to cluster the heatmap by columns (samples in this case),
-    show_rownames = FALSE, # We don't want to show the rownames because there are too many genes for the labels to be clearly seen
-    main = "Non-Annotated Heatmap",
-    colorRampPalette(c(
-      "deepskyblue",
-      "black",
-      "yellow"
-    ))(25
-    ),
-    scale = "row" # Scale values in the direction of genes (rows)
-  )
-

+

To further customize the heatmap, see a vignette for a guide at this link (Slowikowski 2017).

+
# Create and store the heatmap object
+heatmap <- pheatmap(
+  df_by_var,
+  cluster_rows = TRUE, # Cluster the rows of the heatmap (genes in this case)
+  cluster_cols = TRUE, # Cluster the columns of the heatmap (samples),
+  show_rownames = FALSE, # There are too many genes to clearly show the labels
+  main = "Non-Annotated Heatmap",
+  colorRampPalette(c(
+    "deepskyblue",
+    "black",
+    "yellow"
+  ))(25
+  ),
+  scale = "row" # Scale values in the direction of genes (rows)
+)
+

We’ve created a heatmap but although our genes and samples are clustered, there is not much information that we can gather here because we did not provide the pheatmap() function with annotation labels for our samples.

First let’s save our clustered heatmap.

4.7.1 Save heatmap as a PNG

You can easily switch this to save to a JPEG or tiff by changing the function and file name within the function to the respective file suffix.

-
# Open a PNG file
-png(file.path(
-  plots_dir,
-  "SRP070849_heatmap_non_annotated.png" # Replace file name with a relevant output plot name
-))
-
-# Print your heatmap
-pheatmap
-
-# Close the PNG file:
-dev.off()
+
# Open a PNG file
+png(file.path(
+  plots_dir,
+  "SRP070849_heatmap_non_annotated.png" # Replace with a relevant file name
+))
+
+# Print your heatmap
+heatmap
+
+# Close the PNG file:
+dev.off()
## png 
 ##   2

Now, let’s add some annotation bars to our heatmap.

@@ -3278,64 +4061,68 @@

4.7.1 Save heatmap as a PNG

4.8 Prepare metadata for annotation

-

From the accompanying paper, we know that the mice with IDH2 mutant AML were treated with vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials) and the mice with TET2 mutant AML were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent). (Shih et al. 2017) We are going to manipulate the metadata and add variables with the information for each sample, from the experimental design briefly described above, that we would like to use to annotate the heatmap.

-
# Let's prepare the annotation data.frame for the uncollapsed `DESeqData` set object which will be used to create the technical replicates heatmap
-annotation_df <- metadata %>%
-  # Create a variable to store the cancer type information
-  dplyr::mutate(
-    mutation = dplyr::case_when(
-      startsWith(refinebio_title, "TET2") ~ "TET2",
-      startsWith(refinebio_title, "IDH2") ~ "IDH2",
-      startsWith(refinebio_title, "WT") ~ "WT",
-      TRUE ~ "unknown" # If none of the above criteria are true, we mark the `mutation` variable as "unknown"
-    )
-  ) %>%
-  # We want to select the variables that we want for annotating the technical replicates heatmap
-  dplyr::select(
-    refinebio_accession_code,
-    mutation,
-    refinebio_treatment
-  ) %>%
-  # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our `DESeaDataSet` object
-  tibble::column_to_rownames("refinebio_accession_code")
+

From the accompanying paper, we know that the mice with IDH2 mutant AML were treated with vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials) and the mice with TET2 mutant AML were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent). (Shih et al. 2017) We are going to manipulate the metadata and add variables with the information for each sample, from the experimental design briefly described above, that we would like to use to annotate the heatmap.

+
# Let's prepare the annotation for the uncollapsed `DESeqData` set object
+# which will be used to annotate the heatmap
+annotation_df <- metadata %>%
+  # Create a variable to store the cancer type information
+  dplyr::mutate(
+    mutation = dplyr::case_when(
+      startsWith(refinebio_title, "TET2") ~ "TET2",
+      startsWith(refinebio_title, "IDH2") ~ "IDH2",
+      startsWith(refinebio_title, "WT") ~ "WT",
+      # If none of the above criteria are satisfied,
+      # we mark the `mutation` variable as "unknown"
+      TRUE ~ "unknown"
+    )
+  ) %>%
+  # select only the columns we need for annotation
+  dplyr::select(
+    refinebio_accession_code,
+    mutation,
+    refinebio_treatment
+  ) %>%
+  # The `pheatmap()` function requires that the row names of our annotation
+  # data frame match the column names of our `DESeaDataSet` object
+  tibble::column_to_rownames("refinebio_accession_code")

4.8.1 Create annotated heatmap

You can create an annotated heatmap by providing our annotation object to the annotation_col argument of the pheatmap() function.

-
# Create and store the annotated heatmap object
-pheatmap_annotated <-
-  pheatmap(
-    df_by_var,
-    cluster_rows = TRUE,
-    cluster_cols = TRUE,
-    show_rownames = FALSE,
-    annotation_col = annotation_df,
-    main = "Annotated Heatmap",
-    colorRampPalette(c(
-      "deepskyblue",
-      "black",
-      "yellow"
-    ))(25
-    ),
-    scale = "row" # Scale values in the direction of genes (rows)
-  )
-

+
# Create and store the annotated heatmap object
+heatmap_annotated <-
+  pheatmap(
+    df_by_var,
+    cluster_rows = TRUE,
+    cluster_cols = TRUE,
+    show_rownames = FALSE,
+    annotation_col = annotation_df,
+    main = "Annotated Heatmap",
+    colorRampPalette(c(
+      "deepskyblue",
+      "black",
+      "yellow"
+    ))(25
+    ),
+    scale = "row" # Scale values in the direction of genes (rows)
+  )
+

Now that we have annotation bars on our heatmap, we have a better idea of the sample variable groups that appear to cluster together.

Let’s save our annotated heatmap.

4.8.2 Save annotated heatmap as a PNG

-

You can easily switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.

-
# Open a PNG file
-png(file.path(
-  plots_dir,
-  "SRP070849_heatmap_annotated.png" # Replace file name with a relevant output plot name
-))
-
-# Print your heatmap
-pheatmap_annotated
-
-# Close the PNG file:
-dev.off()
+

You can switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.

+
# Open a PNG file
+png(file.path(
+  plots_dir,
+  "SRP070849_heatmap_annotated.png" # Replace with a relevant file name
+))
+
+# Print your heatmap
+heatmap_annotated
+
+# Close the PNG file:
+dev.off()
## png 
 ##   2
@@ -3344,16 +4131,16 @@

4.8.2 Save annotated heatmap as a

5 Further learning resources about this analysis

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

-
# Print session info
-sessioninfo::session_info()
-
## ─ Session info ───────────────────────────────────────────────────────────────
+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ─────────────────────────────────────────────────────
 ##  setting  value                       
 ##  version  R version 4.0.2 (2020-06-22)
 ##  os       Ubuntu 20.04 LTS            
@@ -3363,46 +4150,47 @@ 

6 Session info

## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2020-10-16 +## date 2020-12-18 ## -## ─ Packages ─────────────────────────────────────────────────────────────────── +## ─ Packages ───────────────────────────────────────────────────────── ## package * version date lib source -## annotate 1.66.0 2020-04-27 [1] Bioconductor -## AnnotationDbi 1.50.3 2020-07-25 [1] Bioconductor +## annotate 1.68.0 2020-10-27 [1] Bioconductor +## AnnotationDbi 1.52.0 2020-10-27 [1] Bioconductor ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0) ## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2) -## Biobase * 2.48.0 2020-04-27 [1] Bioconductor -## BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor -## BiocParallel 1.22.0 2020-04-27 [1] Bioconductor +## Biobase * 2.50.0 2020-10-27 [1] Bioconductor +## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor +## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor ## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2) ## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2) ## bitops 1.0-6 2013-08-17 [1] RSPM (R 4.0.0) ## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0) -## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) +## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2) ## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0) -## DelayedArray * 0.14.1 2020-07-14 [1] Bioconductor -## DESeq2 * 1.28.1 2020-05-12 [1] Bioconductor +## DelayedArray 0.16.0 2020-10-27 [1] Bioconductor +## DESeq2 * 1.30.0 2020-10-27 [1] Bioconductor ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) ## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2) ## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0) ## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0) ## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0) ## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0) -## genefilter 1.70.0 2020-04-27 [1] Bioconductor -## geneplotter 1.66.0 2020-04-27 [1] Bioconductor +## genefilter 1.72.0 2020-10-27 [1] Bioconductor +## geneplotter 1.68.0 2020-10-27 [1] Bioconductor ## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0) -## GenomeInfoDb * 1.24.2 2020-06-15 [1] Bioconductor -## GenomeInfoDbData 1.2.3 2020-10-06 [1] Bioconductor -## GenomicRanges * 1.40.0 2020-04-27 [1] Bioconductor +## GenomeInfoDb * 1.26.2 2020-12-08 [1] Bioconductor +## GenomeInfoDbData 1.2.4 2020-12-16 [1] Bioconductor +## GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor ## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0) ## ggplot2 3.3.2 2020-06-19 [1] RSPM (R 4.0.1) ## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2) ## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0) ## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0) ## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1) -## IRanges * 2.22.2 2020-05-21 [1] Bioconductor +## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.2) +## IRanges * 2.24.1 2020-12-12 [1] Bioconductor ## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2) ## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2) ## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2) @@ -3410,6 +4198,7 @@

6 Session info

## locfit 1.5-9.4 2020-03-25 [1] RSPM (R 4.0.0) ## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0) ## Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2) +## MatrixGenerics * 1.2.0 2020-10-27 [1] Bioconductor ## matrixStats * 0.57.0 2020-09-25 [1] RSPM (R 4.0.2) ## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0) ## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0) @@ -3417,6 +4206,7 @@

6 Session info

## pheatmap * 1.0.12 2019-01-04 [1] RSPM (R 4.0.0) ## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0) +## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2) ## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0) ## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0) ## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2) @@ -3426,30 +4216,30 @@

6 Session info

## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0) ## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2) ## RCurl 1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0) -## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2) +## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) ## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0) -## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2) +## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2) ## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2) -## RSQLite 2.2.0 2020-01-07 [1] RSPM (R 4.0.2) +## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2) ## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0) -## S4Vectors * 0.26.1 2020-05-16 [1] Bioconductor +## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor ## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0) ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0) ## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0) ## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0) -## SummarizedExperiment * 1.18.2 2020-07-09 [1] Bioconductor +## SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor ## survival 3.1-12 2020-04-10 [2] CRAN (R 4.0.2) -## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2) +## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2) ## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0) ## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2) ## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) ## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2) ## XML 3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2) ## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.0.0) -## XVector 0.28.0 2020-04-27 [1] Bioconductor +## XVector 0.30.0 2020-10-27 [1] Bioconductor ## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0) -## zlibbioc 1.34.0 2020-04-27 [1] Bioconductor +## zlibbioc 1.36.0 2020-10-27 [1] Bioconductor ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library
@@ -3458,20 +4248,25 @@

6 Session info

References

-

Gu Z., R. Eils, and M. Schlesner, 2016 Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics.

+

Gu Z., R. Eils, and M. Schlesner, 2016 Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btw313

-

Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8

+

Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8

-

Shih A. H., C. Meydan, K. Shank, F. E. Garrett-Bakelman, and P. S. Ward et al., 2017 Combination targeted therapy to disrupt aberrant oncogenic signaling and reverse epigenetic dysfunction in idh2- and tet2-mutant acute myeloid leukemia. Cancer Discovery 7. https://doi.org/10.1158/2159-8290.CD-16-1049

+

Shih A. H., C. Meydan, K. Shank, F. E. Garrett-Bakelman, and P. S. Ward et al., 2017 Combination targeted therapy to disrupt aberrant oncogenic signaling and reverse epigenetic dysfunction in IDH2- and TET2-mutant acute myeloid leukemia. Cancer Discovery 7. https://doi.org/10.1158/2159-8290.CD-16-1049

-

Slowikowski K., 2017 Make heatmaps in r with pheatmap

+

Slowikowski K., 2017 Make heatmaps in R with pheatmap. https://slowkow.com/notes/pheatmap-tutorial/

+

diff --git a/03-rnaseq/differential-expression_rnaseq_01.Rmd b/03-rnaseq/differential-expression_rnaseq_01.Rmd index 277a8956..5a7e43b8 100644 --- a/03-rnaseq/differential-expression_rnaseq_01.Rmd +++ b/03-rnaseq/differential-expression_rnaseq_01.Rmd @@ -1,7 +1,7 @@ --- title: "Differential Expression - RNA-seq" author: "CCDL for ALSF" -date: "October 2020" +date: "December 2020" output: html_notebook: toc: true @@ -11,7 +11,13 @@ output: # Purpose of this analysis -This notebook takes RNA-seq data and metadata from refine.bio and identifies differentially expressed genes between experimental groups. +This notebook takes RNA-seq expression data and metadata from refine.bio and identifies differentially expressed genes between two experimental groups. + +Differential expression analysis identifies genes with significantly varying expression among experimental groups by comparing the variation among samples within a group to the variation between groups. +The simplest version of this analysis is comparing two groups where one of those groups is a control group. + +Our refine.bio RNA-seq examples use DESeq2 for these analyses because it handles RNA-seq data well and has great documentation. +Read more about DESeq2 and why we like it on our [Getting Started page](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2). ⬇️ [**Jump to the analysis code**](#analysis) ⬇️ @@ -44,7 +50,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -52,7 +58,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -66,7 +72,7 @@ In the same place you put this `.Rmd` file, you should now have three new empty For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). -Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP078441/rna-seq-of-primary-patient-aml-samples). +Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP123625). Click the "Download Now" button on the right side of this screen. @@ -87,14 +93,14 @@ You will get an email when it is ready. ## About the dataset we are using for this example -For this example analysis, we will use this [acute myeloid leukemia (AML) dataset](https://www.refine.bio/experiments/SRP078441/rna-seq-of-primary-patient-aml-samples) [@Micol2017] - -@Micol2017 performed RNA-seq on primary peripheral blood and bone marrow samples from AML patients with and without _ASXL1/2_ mutations. +For this example analysis, we are using RNA-seq data from an [acute lymphoblastic leukemia (ALL) mouse lymphoid cell model](https://www.refine.bio/experiments/SRP123625) [@Kampen2019]. +All of the lymphoid mouse cell samples in this experiment have a human RPL10 gene; three with a reference (wild-type) RPL10 gene and three with the R98S mutation. +We will perform our differential expression using these knock-in and wild-type mice designations. ## Place the dataset in your new `data/` folder refine.bio will send you a download button in the email when it is ready. -Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. +Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. Double clicking should unzip this for you and create a folder of the same name. @@ -104,7 +110,7 @@ For more details on the contents of this folder see [these docs on refine.bio](h The `` folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like `GSE1235` or `SRP12345`. -Copy and paste the `SRP078441` folder into your newly created `data/` folder. +Copy and paste the `SRP123625` folder into your newly created `data/` folder. ## Check out our file structure! @@ -112,7 +118,7 @@ Your new analysis folder should contain: - The example analysis `.Rmd` you downloaded - A folder called "data" which contains: - - The `SRP078441` folder which contains: + - The `SRP123625` folder which contains: - The gene expression - The metadata TSV - A folder for `plots` (currently empty) @@ -130,19 +136,24 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "SRP078441") # Replace with accession number which will be the name of the folder the files will be in - -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "SRP078441.tsv") # Replace with file path to your dataset - -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_SRP078441.tsv") # Replace with file path to your metadata +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "SRP123625") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "SRP123625.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_SRP123625.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` @@ -171,7 +182,7 @@ From here you can customize this analysis example to fit your own scientific que See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. In this analysis, we will be using `DESeq2` [@Love2014] for the differential expression testing. -We will also use `EnhancedVolcano` for plotting and `apeglm` for some log fold change estimates in the results table [@Blighe2020; @Zhu2018] +We will also use `EnhancedVolcano` [@Blighe2020] for plotting and `apeglm` [@Zhu2018] for some log fold change estimates in the results table ```{r} if (!("DESeq2" %in% installed.packages())) { @@ -190,7 +201,7 @@ if (!("apeglm" %in% installed.packages())) { Attach the libraries we need for this analysis: -```{r} +```{r message=FALSE} # Attach the DESeq2 library library(DESeq2) @@ -221,7 +232,7 @@ We stored our file paths as objects named `metadata_file` and `data_file` in [th metadata <- readr::read_tsv(metadata_file) # Read in data TSV file -df <- readr::read_tsv(data_file) %>% +expression_df <- readr::read_tsv(data_file) %>% tibble::column_to_rownames("Gene") ``` @@ -229,11 +240,11 @@ Let's ensure that the metadata and data are in the same sample order. ```{r} # Make the data in the order of the metadata -df <- df %>% +expression_df <- expression_df %>% dplyr::select(metadata$refinebio_accession_code) # Check if this is in the same order -all.equal(colnames(df), metadata$refinebio_accession_code) +all.equal(colnames(expression_df), metadata$refinebio_accession_code) ``` The information we need to make the comparison is in the `refinebio_title` column of the metadata data.frame. @@ -244,31 +255,25 @@ head(metadata$refinebio_title) ## Set up metadata -This dataset includes data from patients with and without ASXL gene mutations. -The authors of this data have ASXL mutation status along with other information is stored all in one string (this is not very convenient for us). +This dataset includes data from mouse lymphoid cells with human RPL10, with and without a `R98S` mutation. +The mutation status is stored along with other information in a single string (this is not very convenient for us). We need to extract the mutation status information into its own column to make it easier to use. ```{r} metadata <- metadata %>% - # The last bit of the title, separated by "-" contains the mutation - # information that we want to extract - dplyr::mutate(asxl_mutation_status = stringr::word(refinebio_title, - -1, - sep = "-" - )) %>% - # Now let's summarized the ASXL1 mutation status from this variable - dplyr::mutate(asxl_mutation_status = dplyr::case_when( - grepl("ASXL1|ASXL2", asxl_mutation_status) ~ "asxl_mutation", - grepl("ASXLwt", asxl_mutation_status) ~ "no_mutation" + # Let's get the RPL10 mutation status from this variable + dplyr::mutate(mutation_status = dplyr::case_when( + stringr::str_detect(refinebio_title, "R98S") ~ "R98S", + stringr::str_detect(refinebio_title, "WT") ~ "reference" )) ``` -Let's take a look at `metadata_df` to see if this worked. +Let's take a look at `metadata` to see if this worked by looking at the `refinebio_title` and `mutation_status` columns. ```{r} -# looking at the first 6 rows of the metadata_df and only at the columns that -# contain the title and the mutation status we extracted from the title -head(dplyr::select(metadata, refinebio_title, asxl_mutation_status)) +# Let's take a look at the original metadata column's info +# and our new `mutation_status` column +dplyr::select(metadata, refinebio_title, mutation_status) ``` Before we set up our model in the next step, we want to check if our modeling variable is set correctly. @@ -276,70 +281,80 @@ We want our "control" to to be set as the first level in the variable we provide Here we will use the `str()` function to print out a preview of the **str**ucture of our variable ```{r} -# Print out a preview of `asxl_mutation_status` -str(metadata$asxl_mutation_status) +# Print out a preview of `mutation_status` +str(metadata$mutation_status) ``` -Currently, `asxl_mutation_status` is a character. -To make sure it is set how we want for the `DESeq` object and subsequent testing, let's mutate it to a factor so we can explicitly set the levels. +Currently, `mutation_status` is stored as a character, which is not necessarily what we want. +To make sure it is set how we want for the `DESeq` object and subsequent testing, let's change it to a factor so we can explicitly set the levels. + +In the `levels` argument, we will list `reference` first since that is our control group. ```{r} -# Make asxl_mutation_status a factor and set the levels appropriately +# Make mutation_status a factor and set the levels appropriately metadata <- metadata %>% dplyr::mutate( - # Here we will set up the factor aspect of our new variable. - asxl_mutation_status = factor(asxl_mutation_status, levels = c("no_mutation", "asxl_mutation")) + # Here we define the values our factor variable can have and their order. + mutation_status = factor(mutation_status, levels = c("reference", "R98S")) ) ``` +Note if you don't specify `levels`, the `factor()` function will set levels in alphabetical order -- which sometimes means your control group will not be listed first! + Let's double check if the levels are what we want using the `levels()` function. ```{r} -levels(metadata$asxl_mutation_status) +levels(metadata$mutation_status) +``` + +Yes! `reference` is the first level as we want it to be. +We're all set and ready to move on to making our `DESeq2Dataset` object. + +## Define a minimum counts cutoff + +We want to filter out the genes that have not been expressed or that have low expression counts, since these do not have high enough counts to yield reliable differential expression results. +Removing these genes saves on memory usage during the tests. +We are going to do some pre-filtering to keep only genes with 10 or more reads in total across the samples. + +```{r} +# Define a minimum counts cutoff and filter the data to include +# only rows (genes) that have total counts above the cutoff +filtered_expression_df <- expression_df %>% + dplyr::filter(rowSums(.) >= 10) ``` -Yes! `no_mutation` is the first level as we want it to be. We're all set and ready to move on to making our `DESeq2Dataset` object. +If you have a bigger dataset, you will probably want to make this cutoff larger. ## Create a DESeq2Dataset We will be using the `DESeq2` package for differential expression testing, which requires us to format our data into a `DESeqDataSet` object. -First we need to prep our gene expression data frame so it's in the format that is compatible with the `DESeqDataSetFromMatrix()` function in the next step. +First we need to prep our gene expression data frame so that all of the count values are integers, making it compatible with the `DESeqDataSetFromMatrix()` function in the next step. ```{r} -# We are making our data frame into a matrix and rounding the numbers -gene_matrix <- round(as.matrix(df)) +# round all expression counts +gene_matrix <- round(filtered_expression_df) ``` -Now we need to create `DESeqDataSet` from our expression dataset. -We use the `asxl_mutation_status` variable we created in the design formula because that will allow us -to model the presence/absence of _ASXL1/2_ mutation. +Now we need to create a `DESeqDataSet` from our expression dataset. +We use the `mutation_status` variable we created in the design formula because that will allow us +to model the presence/absence of _R98S_ mutation. ```{r} ddset <- DESeqDataSetFromMatrix( + # Here we supply non-normalized count data countData = gene_matrix, + # Supply the `colData` with our metadata data frame colData = metadata, - design = ~asxl_mutation_status + # Supply our experimental variable to `design` + design = ~mutation_status ) ``` -## Define a minimum counts cutoff - -We want to filter out the genes that have not been expressed or that have low expression counts, since these do not have high enough counts to yield reliable differential expression results. -Removing these genes saves on memory usage during the tests. -We are going to do some pre-filtering to keep only genes with 10 or more reads in total across the samples. - -```{r} -# Define a minimum counts cutoff and filter `DESeqDataSet` object to include -# only rows that have counts above the cutoff -genes_to_keep <- rowSums(counts(ddset)) >= 10 -ddset <- ddset[genes_to_keep, ] -``` - ## Run differential expression analysis We'll use the wrapper function `DESeq()` to do our differential expression analysis. -In our `DESeq2` object we designated our `asxl_mutation_status` variable as the `model` argument. -Because of this, the `DESeq` function will use groups defined by `asxl_mutation_status` to test for differential expression. +In our `DESeq2` object we designated our `mutation_status` variable as the `model` argument. +Because of this, the `DESeq` function will use groups defined by `mutation_status` to test for differential expression. ```{r} deseq_object <- DESeq(ddset) @@ -351,43 +366,44 @@ Let's extract the results table from the `DESeq` object. deseq_results <- results(deseq_object) ``` -Here we will use `lfcShrink()` function to obtain shrunken log fold change estimates based on negative binomial distribution. +Here we will use `lfcShrink()` function to obtain shrunken log fold change estimates based on negative binomial distribution. This will add the estimates to your results table. -Using `lfcShrink()` can help decrease noise and preserve large differences between groups (it requires that `apeglm` package be installed). +Using `lfcShrink()` can help decrease noise and preserve large differences between groups (it requires that `apeglm` package be installed) [@Zhu2018]. ```{r} -deseq_results <- lfcShrink(deseq_object, # This is the original DESeq2 object with DESeq() already having been ran - coef = 2, # This is based on what log fold change coefficient was used in DESeq(), the default is 2. - res = deseq_results # This needs to be the DESeq2 results table +deseq_results <- lfcShrink( + deseq_object, # The original DESeq2 object after running DESeq() + coef = 2, # The log fold change coefficient used in DESeq(); the default is 2. + res = deseq_results # The original DESeq2 results table ) ``` -Now let's take a peek at what our results table looks like. +Now let's take a peek at what our new results table looks like. + ```{r} head(deseq_results) ``` -Note it is not filtered or sorted, so we will use tidyverse to do this before saving our results to a file. -Sort and filter the results. +Note it is not filtered or sorted, so we will use tidyverse to do this before saving our results to a file. ```{r} # this is of class DESeqResults -- we want a data frame deseq_df <- deseq_results %>% # make into data.frame as.data.frame() %>% - # the gene names are rownames -- let's make this it's own column for easy - # display + # the gene names are row names -- let's make them a column for easy display tibble::rownames_to_column("Gene") %>% + # add a column for significance threshold results dplyr::mutate(threshold = padj < 0.05) %>% - # let's sort by statistic -- the highest values should be what is up in the - # ASXL mutated samples + # sort by statistic -- the highest values will be genes with + # higher expression in RPL10 mutated samples dplyr::arrange(dplyr::desc(log2FoldChange)) ``` -Let's print out what the top results are. +Let's print out the top results. -```{r} +```{r rownames.print = FALSE} head(deseq_df) ``` @@ -396,10 +412,10 @@ head(deseq_df) To double check what a differentially expressed gene looks like, we can plot one with `DESeq2::plotCounts()` function. ```{r} -plotCounts(ddset, gene = "ENSG00000196074", intgroup = "asxl_mutation_status") +plotCounts(ddset, gene = "ENSMUSG00000026623", intgroup = "mutation_status") ``` -The `mutation` group samples have higher expression of this gene than the control group, which helps assure us that the results are showing us what we are looking for. +The `R98S` mutated samples have higher expression of this gene than the control group, which helps assure us that the results are showing us what we are looking for. ## Save results to TSV @@ -410,34 +426,25 @@ readr::write_tsv( deseq_df, file.path( results_dir, - "SRP078441_differential_expression_results.tsv" # Replace with a relevant output file name + "SRP123625_diff_expr_results.tsv" # Replace with a relevant output file name ) ) ``` ## Create a volcano plot -We'll use the `EnhancedVolcano` package's main function to plot our data [@Zhu2018]. +We'll use the `EnhancedVolcano` package's main function to plot our data [@Blighe2020]. + Here we are plotting the `log2FoldChange` (which was estimated by `lfcShrink` step) on the x axis and `padj` on the y axis. The `padj` variable are the p values corrected with `Benjamini-Hochberg` (the default from the `results()` step). -```{r} -EnhancedVolcano::EnhancedVolcano( - deseq_df, - lab = deseq_df$Gene, # A vector that contains our gene names - x = "log2FoldChange", # The variable in `deseq_df` you want to be plotted on the x axis - y = "padj" # The variable in `deseq_df` you want to be plotted on the y axis -) -``` - -Here the red point is the gene that meets both the default p value and log2 fold change cutoff (which are 10e-6 and 1 respectively). - -We used the adjusted p values for our plot above, so you may want to loosen this cutoff with the `pCutoff` argument (Take a look at all the options for tailoring this plot using `?EnhancedVolcano`). +Because we are using adjusted p values we can feel safe in making our `pCutoff` argument `0.01` (default is `1e-05`). +Take a look at all the options for tailoring this plot using `?EnhancedVolcano`. -Let's make the same plot again, but adjust the `pCutoff` since we are using multiple-testing corrected p values and this time we will assign the plot to our environment as `volcano_plot`. +We will save the plot to our environment as `volcano_plot` to make it easier to save the figure separately later. ```{r} -# We'll assign this as `volcano_plot` this time +# We'll assign this as `volcano_plot` volcano_plot <- EnhancedVolcano::EnhancedVolcano( deseq_df, lab = deseq_df$Gene, @@ -450,13 +457,13 @@ volcano_plot <- EnhancedVolcano::EnhancedVolcano( volcano_plot ``` -This looks pretty good. +This looks pretty good! Let's save it to a PNG. ```{r} ggsave( plot = volcano_plot, - file.path(plots_dir, "SRP078441_volcano_plot.png") + file.path(plots_dir, "SRP123625_volcano_plot.png") ) # Replace with a plot name relevant to your data ``` diff --git a/03-rnaseq/differential-expression_rnaseq_01.html b/03-rnaseq/differential-expression_rnaseq_01.html index 9f050daf..9057f38b 100644 --- a/03-rnaseq/differential-expression_rnaseq_01.html +++ b/03-rnaseq/differential-expression_rnaseq_01.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2948,14 +3774,17 @@

Differential Expression - RNA-seq

CCDL for ALSF

-

October 2020

+

December 2020

1 Purpose of this analysis

-

This notebook takes RNA-seq data and metadata from refine.bio and identifies differentially expressed genes between experimental groups.

+

This notebook takes RNA-seq expression data and metadata from refine.bio and identifies differentially expressed genes between two experimental groups.

+

Differential expression analysis identifies genes with significantly varying expression among experimental groups by comparing the variation among samples within a group to the variation between groups. The simplest version of this analysis is comparing two groups where one of those groups is a control group.

+

Our refine.bio RNA-seq examples use DESeq2 for these analyses because it handles RNA-seq data well and has great documentation.
+Read more about DESeq2 and why we like it on our Getting Started page.

⬇️ Jump to the analysis code ⬇️

@@ -2971,32 +3800,32 @@

2.1 Obtain the .Rmd

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

-
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
-  dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
-  dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir)
-}
+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

2.3 Obtain the dataset from refine.bio

For general information about downloading data for these examples, see our ‘Getting Started’ section.

-

Go to this dataset’s page on refine.bio.

+

Go to this dataset’s page on refine.bio.

Click the “Download Now” button on the right side of this screen.

Fill out the pop up window with your email and our Terms and Conditions:

@@ -3007,8 +3836,7 @@

2.3 Obtain the dataset from refin

2.4 About the dataset we are using for this example

-

For this example analysis, we will use this acute myeloid leukemia (AML) dataset (Micol et al. 2017)

-

Micol et al. (2017) performed RNA-seq on primary peripheral blood and bone marrow samples from AML patients with and without ASXL1/2 mutations.

+

For this example analysis, we are using RNA-seq data from an acute lymphoblastic leukemia (ALL) mouse lymphoid cell model (Kampen et al. 2019). All of the lymphoid mouse cell samples in this experiment have a human RPL10 gene; three with a reference (wild-type) RPL10 gene and three with the R98S mutation. We will perform our differential expression using these knock-in and wild-type mice designations.

2.5 Place the dataset in your new data/ folder

@@ -3016,7 +3844,7 @@

2.5 Place the dataset in your new

For more details on the contents of this folder see these docs on refine.bio.

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

-

Copy and paste the SRP078441 folder into your newly created data/ folder.

+

Copy and paste the SRP123625 folder into your newly created data/ folder.

2.6 Check out our file structure!

@@ -3026,7 +3854,7 @@

2.6 Check out our file structure!
  • A folder called “data” which contains:
      -
    • The SRP078441 folder which contains: +
    • The SRP123625 folder which contains:
      • The gene expression
      • @@ -3041,20 +3869,25 @@

        2.6 Check out our file structure!

        In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

        First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

        -
        # Define the file path to the data directory
        -data_dir <- file.path("data", "SRP078441") # Replace with accession number which will be the name of the folder the files will be in
        -
        -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
        -data_file <- file.path(data_dir, "SRP078441.tsv") # Replace with file path to your dataset
        -
        -# Declare the file path to the metadata file using the data directory saved as `data_dir`
        -metadata_file <- file.path(data_dir, "metadata_SRP078441.tsv") # Replace with file path to your metadata
        +
        # Define the file path to the data directory
        +# Replace with the path of the folder the files will be in
        +data_dir <- file.path("data", "SRP123625")
        +
        +# Declare the file path to the gene expression matrix file
        +# inside directory saved as `data_dir`
        +# Replace with the path to your dataset file
        +data_file <- file.path(data_dir, "SRP123625.tsv")
        +
        +# Declare the file path to the metadata file
        +# inside the directory saved as `data_dir`
        +# Replace with the path to your metadata file
        +metadata_file <- file.path(data_dir, "metadata_SRP123625.tsv")

        Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

        -
        # Check if the gene expression matrix file is at the file path stored in `data_file`
        -file.exists(data_file)
        +
        # Check if the gene expression matrix file is at the path stored in `data_file`
        +file.exists(data_file)
        ## [1] TRUE
        -
        # Check if the metadata file is at the file path stored in `metadata_file`
        -file.exists(metadata_file)
        +
        # Check if the metadata file is at the file path stored in `metadata_file`
        +file.exists(metadata_file)
        ## [1] TRUE

        If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

        If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

        @@ -3072,89 +3905,39 @@

        4 Differential Expression

        4.1 Install libraries

        See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

        -

        In this analysis, we will be using DESeq2 (Love et al. 2014) for the differential expression testing. We will also use EnhancedVolcano for plotting and apeglm for some log fold change estimates in the results table (Zhu et al. 2018; Blighe et al. 2020)

        -
        if (!("DESeq2" %in% installed.packages())) {
        -  # Install this package if it isn't installed yet
        -  BiocManager::install("DESeq2", update = FALSE)
        -}
        -if (!("EnhancedVolcano" %in% installed.packages())) {
        -  # Install this package if it isn't installed yet
        -  BiocManager::install("EnhancedVolcano", update = FALSE)
        -}
        -if (!("apeglm" %in% installed.packages())) {
        -  # Install this package if it isn't installed yet
        -  BiocManager::install("apeglm", update = FALSE)
        -}
        +

        In this analysis, we will be using DESeq2 (Love et al. 2014) for the differential expression testing. We will also use EnhancedVolcano (Blighe et al. 2020) for plotting and apeglm (Zhu et al. 2018) for some log fold change estimates in the results table

        +
        if (!("DESeq2" %in% installed.packages())) {
        +  # Install this package if it isn't installed yet
        +  BiocManager::install("DESeq2", update = FALSE)
        +}
        +if (!("EnhancedVolcano" %in% installed.packages())) {
        +  # Install this package if it isn't installed yet
        +  BiocManager::install("EnhancedVolcano", update = FALSE)
        +}
        +if (!("apeglm" %in% installed.packages())) {
        +  # Install this package if it isn't installed yet
        +  BiocManager::install("apeglm", update = FALSE)
        +}

        Attach the libraries we need for this analysis:

        -
        # Attach the DESeq2 library
        -library(DESeq2)
        -
        ## Loading required package: S4Vectors
        -
        ## Loading required package: stats4
        -
        ## Loading required package: BiocGenerics
        -
        ## Loading required package: parallel
        -
        ## 
        -## Attaching package: 'BiocGenerics'
        -
        ## The following objects are masked from 'package:parallel':
        -## 
        -##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
        -##     clusterExport, clusterMap, parApply, parCapply, parLapply,
        -##     parLapplyLB, parRapply, parSapply, parSapplyLB
        -
        ## The following objects are masked from 'package:stats':
        -## 
        -##     IQR, mad, sd, var, xtabs
        -
        ## The following objects are masked from 'package:base':
        -## 
        -##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
        -##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
        -##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
        -##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
        -##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
        -##     union, unique, unsplit, which, which.max, which.min
        -
        ## 
        -## Attaching package: 'S4Vectors'
        -
        ## The following object is masked from 'package:base':
        -## 
        -##     expand.grid
        -
        ## Loading required package: IRanges
        -
        ## Loading required package: GenomicRanges
        -
        ## Loading required package: GenomeInfoDb
        -
        ## Loading required package: SummarizedExperiment
        -
        ## Loading required package: Biobase
        -
        ## Welcome to Bioconductor
        -## 
        -##     Vignettes contain introductory material; view with
        -##     'browseVignettes()'. To cite Bioconductor, see
        -##     'citation("Biobase")', and for packages 'citation("pkgname")'.
        -
        ## Loading required package: DelayedArray
        -
        ## Loading required package: matrixStats
        -
        ## 
        -## Attaching package: 'matrixStats'
        -
        ## The following objects are masked from 'package:Biobase':
        -## 
        -##     anyMissing, rowMedians
        -
        ## 
        -## Attaching package: 'DelayedArray'
        -
        ## The following objects are masked from 'package:matrixStats':
        -## 
        -##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
        -
        ## The following objects are masked from 'package:base':
        -## 
        -##     aperm, apply, rowsum
        -
        # Attach the ggplot2 library for plotting
        -library(ggplot2)
        -
        -# We will need this so we can use the pipe: %>%
        -library(magrittr)
        +
        # Attach the DESeq2 library
        +library(DESeq2)
        +
        +# Attach the ggplot2 library for plotting
        +library(ggplot2)
        +
        +# We will need this so we can use the pipe: %>%
        +library(magrittr)

        The jitter plot we make later on with the DESeq2::plotCounts() function involves some randomness. As is good practice when our analysis involves randomness, we will set the seed.

        -
        set.seed(12345)
        +
        set.seed(12345)

        4.2 Import data and metadata

        Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

        We stored our file paths as objects named metadata_file and data_file in this previous step.

        -
        # Read in metadata TSV file
        -metadata <- readr::read_tsv(metadata_file)
        -
        ## Parsed with column specification:
        +
        # Read in metadata TSV file
        +metadata <- readr::read_tsv(metadata_file)
        +
        ## 
        +## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
         ## cols(
         ##   .default = col_logical(),
         ##   refinebio_accession_code = col_character(),
        @@ -3165,212 +3948,207 @@ 

        4.2 Import data and metadata

        ## refinebio_specimen_part = col_character(), ## refinebio_subject = col_character(), ## refinebio_title = col_character() -## )
        -
        ## See spec(...) for full column specifications.
        -
        # Read in data TSV file
        -df <- readr::read_tsv(data_file) %>%
        -  tibble::column_to_rownames("Gene")
        -
        ## Parsed with column specification:
        +## )
        +## ℹ Use `spec()` for the full column specifications.
        +
        # Read in data TSV file
        +expression_df <- readr::read_tsv(data_file) %>%
        +  tibble::column_to_rownames("Gene")
        +
        ## 
        +## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
         ## cols(
         ##   Gene = col_character(),
        -##   SRR3895734 = col_double(),
        -##   SRR3895735 = col_double(),
        -##   SRR3895736 = col_double(),
        -##   SRR3895737 = col_double(),
        -##   SRR3895738 = col_double(),
        -##   SRR3895739 = col_double(),
        -##   SRR3895740 = col_double(),
        -##   SRR3895741 = col_double(),
        -##   SRR3895742 = col_double(),
        -##   SRR3895743 = col_double(),
        -##   SRR3895744 = col_double(),
        -##   SRR3895745 = col_double(),
        -##   SRR3895746 = col_double(),
        -##   SRR3895747 = col_double(),
        -##   SRR3895748 = col_double(),
        -##   SRR3895749 = col_double()
        +##   SRR6255584 = col_double(),
        +##   SRR6255585 = col_double(),
        +##   SRR6255586 = col_double(),
        +##   SRR6255587 = col_double(),
        +##   SRR6255588 = col_double(),
        +##   SRR6255589 = col_double()
         ## )

        Let’s ensure that the metadata and data are in the same sample order.

        -
        # Make the data in the order of the metadata
        -df <- df %>%
        -  dplyr::select(metadata$refinebio_accession_code)
        -
        -# Check if this is in the same order
        -all.equal(colnames(df), metadata$refinebio_accession_code)
        +
        # Make the data in the order of the metadata
        +expression_df <- expression_df %>%
        +  dplyr::select(metadata$refinebio_accession_code)
        +
        +# Check if this is in the same order
        +all.equal(colnames(expression_df), metadata$refinebio_accession_code)
        ## [1] TRUE

        The information we need to make the comparison is in the refinebio_title column of the metadata data.frame.

        -
        head(metadata$refinebio_title)
        -
        ## [1] "CBF54-BM-ASXLwt"      "49060-2010-BM-ASXLwt" "CBF234-BM-ASXLwt"    
        -## [4] "CBF124-BM-ASXLwt"     "41267-BM-ASXL2"       "45565-BM-ASXL1"
        +
        head(metadata$refinebio_title)
        +
        ## [1] "R98S11_mRNA_Suppl" "R98S13_mRNA_Suppl" "R98S35_mRNA_Suppl"
        +## [4] "WT28_mRNA_Suppl"   "WT29_mRNA_Suppl"   "WT36_mRNA_Suppl"

        4.3 Set up metadata

        -

        This dataset includes data from patients with and without ASXL gene mutations. The authors of this data have ASXL mutation status along with other information is stored all in one string (this is not very convenient for us). We need to extract the mutation status information into its own column to make it easier to use.

        -
        metadata <- metadata %>%
        -  # The last bit of the title, separated by "-" contains the mutation
        -  # information that we want to extract
        -  dplyr::mutate(asxl_mutation_status = stringr::word(refinebio_title,
        -    -1,
        -    sep = "-"
        -  )) %>%
        -  # Now let's summarized the ASXL1 mutation status from this variable
        -  dplyr::mutate(asxl_mutation_status = dplyr::case_when(
        -    grepl("ASXL1|ASXL2", asxl_mutation_status) ~ "asxl_mutation",
        -    grepl("ASXLwt", asxl_mutation_status) ~ "no_mutation"
        -  ))
        -

        Let’s take a look at metadata_df to see if this worked.

        -
        # looking at the first 6 rows of the metadata_df and only at the columns that
        -# contain the title and the mutation status we extracted from the title
        -head(dplyr::select(metadata, refinebio_title, asxl_mutation_status))
        +

        This dataset includes data from mouse lymphoid cells with human RPL10, with and without a R98S mutation. The mutation status is stored along with other information in a single string (this is not very convenient for us). We need to extract the mutation status information into its own column to make it easier to use.

        +
        metadata <- metadata %>%
        +  # Let's get the RPL10 mutation status from this variable
        +  dplyr::mutate(mutation_status = dplyr::case_when(
        +    stringr::str_detect(refinebio_title, "R98S") ~ "R98S",
        +    stringr::str_detect(refinebio_title, "WT") ~ "reference"
        +  ))
        +

        Let’s take a look at metadata to see if this worked by looking at the refinebio_title and mutation_status columns.

        +
        # Let's take a look at the original metadata column's info
        +# and our new `mutation_status` column
        +dplyr::select(metadata, refinebio_title, mutation_status)

        Before we set up our model in the next step, we want to check if our modeling variable is set correctly. We want our “control” to to be set as the first level in the variable we provide as our experimental variable. Here we will use the str() function to print out a preview of the structure of our variable

        -
        # Print out a preview of `asxl_mutation_status`
        -str(metadata$asxl_mutation_status)
        -
        ##  chr [1:16] "no_mutation" "no_mutation" "no_mutation" "no_mutation" ...
        -

        Currently, asxl_mutation_status is a character. To make sure it is set how we want for the DESeq object and subsequent testing, let’s mutate it to a factor so we can explicitly set the levels.

        -
        # Make asxl_mutation_status a factor and set the levels appropriately
        -metadata <- metadata %>%
        -  dplyr::mutate(
        -    # Here we will set up the factor aspect of our new variable.
        -    asxl_mutation_status = factor(asxl_mutation_status, levels = c("no_mutation", "asxl_mutation"))
        -  )
        +
        # Print out a preview of `mutation_status`
        +str(metadata$mutation_status)
        +
        ##  chr [1:6] "R98S" "R98S" "R98S" "reference" "reference" ...
        +

        Currently, mutation_status is stored as a character, which is not necessarily what we want. To make sure it is set how we want for the DESeq object and subsequent testing, let’s change it to a factor so we can explicitly set the levels.

        +

        In the levels argument, we will list reference first since that is our control group.

        +
        # Make mutation_status a factor and set the levels appropriately
        +metadata <- metadata %>%
        +  dplyr::mutate(
        +    # Here we define the values our factor variable can have and their order.
        +    mutation_status = factor(mutation_status, levels = c("reference", "R98S"))
        +  )
        +

        Note if you don’t specify levels, the factor() function will set levels in alphabetical order – which sometimes means your control group will not be listed first!

        Let’s double check if the levels are what we want using the levels() function.

        -
        levels(metadata$asxl_mutation_status)
        -
        ## [1] "no_mutation"   "asxl_mutation"
        -

        Yes! no_mutation is the first level as we want it to be. We’re all set and ready to move on to making our DESeq2Dataset object.

        -
        -
        -

        4.4 Create a DESeq2Dataset

        -

        We will be using the DESeq2 package for differential expression testing, which requires us to format our data into a DESeqDataSet object. First we need to prep our gene expression data frame so it’s in the format that is compatible with the DESeqDataSetFromMatrix() function in the next step.

        -
        # We are making our data frame into a matrix and rounding the numbers
        -gene_matrix <- round(as.matrix(df))
        -

        Now we need to create DESeqDataSet from our expression dataset. We use the asxl_mutation_status variable we created in the design formula because that will allow us to model the presence/absence of ASXL1/2 mutation.

        -
        ddset <- DESeqDataSetFromMatrix(
        -  countData = gene_matrix,
        -  colData = metadata,
        -  design = ~asxl_mutation_status
        -)
        -
        ## converting counts to integer mode
        +
        levels(metadata$mutation_status)
        +
        ## [1] "reference" "R98S"
        +

        Yes! reference is the first level as we want it to be. We’re all set and ready to move on to making our DESeq2Dataset object.

        -

        4.5 Define a minimum counts cutoff

        +

        4.4 Define a minimum counts cutoff

        We want to filter out the genes that have not been expressed or that have low expression counts, since these do not have high enough counts to yield reliable differential expression results. Removing these genes saves on memory usage during the tests. We are going to do some pre-filtering to keep only genes with 10 or more reads in total across the samples.

        -
        # Define a minimum counts cutoff and filter `DESeqDataSet` object to include
        -# only rows that have counts above the cutoff
        -genes_to_keep <- rowSums(counts(ddset)) >= 10
        -ddset <- ddset[genes_to_keep, ]
        +
        # Define a minimum counts cutoff and filter the data to include
        +# only rows (genes) that have total counts above the cutoff
        +filtered_expression_df <- expression_df %>%
        +  dplyr::filter(rowSums(.) >= 10)
        +

        If you have a bigger dataset, you will probably want to make this cutoff larger.

        +
        +
        +

        4.5 Create a DESeq2Dataset

        +

        We will be using the DESeq2 package for differential expression testing, which requires us to format our data into a DESeqDataSet object. First we need to prep our gene expression data frame so that all of the count values are integers, making it compatible with the DESeqDataSetFromMatrix() function in the next step.

        +
        # round all expression counts
        +gene_matrix <- round(filtered_expression_df)
        +

        Now we need to create a DESeqDataSet from our expression dataset. We use the mutation_status variable we created in the design formula because that will allow us to model the presence/absence of R98S mutation.

        +
        ddset <- DESeqDataSetFromMatrix(
        +  # Here we supply non-normalized count data
        +  countData = gene_matrix,
        +  # Supply the `colData` with our metadata data frame
        +  colData = metadata,
        +  # Supply our experimental variable to `design`
        +  design = ~mutation_status
        +)
        +
        ## converting counts to integer mode

        4.6 Run differential expression analysis

        -

        We’ll use the wrapper function DESeq() to do our differential expression analysis. In our DESeq2 object we designated our asxl_mutation_status variable as the model argument. Because of this, the DESeq function will use groups defined by asxl_mutation_status to test for differential expression.

        -
        deseq_object <- DESeq(ddset)
        +

        We’ll use the wrapper function DESeq() to do our differential expression analysis. In our DESeq2 object we designated our mutation_status variable as the model argument. Because of this, the DESeq function will use groups defined by mutation_status to test for differential expression.

        +
        deseq_object <- DESeq(ddset)
        ## estimating size factors
        ## estimating dispersions
        ## gene-wise dispersion estimates
        ## mean-dispersion relationship
        ## final dispersion estimates
        ## fitting model and testing
        -
        ## -- replacing outliers and refitting for 745 genes
        -## -- DESeq argument 'minReplicatesForReplace' = 7 
        -## -- original counts are preserved in counts(dds)
        -
        ## estimating dispersions
        -
        ## fitting model and testing

        Let’s extract the results table from the DESeq object.

        -
        deseq_results <- results(deseq_object)
        -

        Here we will use lfcShrink() function to obtain shrunken log fold change estimates based on negative binomial distribution. This will add the estimates to your results table. Using lfcShrink() can help decrease noise and preserve large differences between groups (it requires that apeglm package be installed).

        -
        deseq_results <- lfcShrink(deseq_object, # This is the original DESeq2 object with DESeq() already having been ran
        -  coef = 2, # This is based on what log fold change coefficient was used in DESeq(), the default is 2.
        -  res = deseq_results # This needs to be the DESeq2 results table
        -)
        +
        deseq_results <- results(deseq_object)
        +

        Here we will use lfcShrink() function to obtain shrunken log fold change estimates based on negative binomial distribution. This will add the estimates to your results table. Using lfcShrink() can help decrease noise and preserve large differences between groups (it requires that apeglm package be installed) (Zhu et al. 2018).

        +
        deseq_results <- lfcShrink(
        +  deseq_object, # The original DESeq2 object after running DESeq()
        +  coef = 2, # The log fold change coefficient used in DESeq(); the default is 2.
        +  res = deseq_results # The original DESeq2 results table
        +)
        ## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
         ##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
         ##     sequence count data: removing the noise and preserving large differences.
         ##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
        -

        Now let’s take a peek at what our results table looks like.

        -
        head(deseq_results)
        -
        ## log2 fold change (MAP): asxl mutation status asxl mutation vs no mutation 
        -## Wald test p-value: asxl mutation status asxl mutation vs no mutation 
        +

        Now let’s take a peek at what our new results table looks like.

        +
        head(deseq_results)
        +
        ## log2 fold change (MAP): mutation status R98S vs reference 
        +## Wald test p-value: mutation status R98S vs reference 
         ## DataFrame with 6 rows and 5 columns
        -##                    baseMean log2FoldChange      lfcSE    pvalue      padj
        -##                   <numeric>      <numeric>  <numeric> <numeric> <numeric>
        -## ENSG00000000003   52.852059    9.64776e-07 0.00144269 0.6604581  0.998525
        -## ENSG00000000005    0.260056    2.03308e-07 0.00144269 0.5791283        NA
        -## ENSG00000000419  406.161355   -1.38160e-06 0.00144268 0.7076357  0.998525
        -## ENSG00000000457  564.784021   -2.36497e-06 0.00144268 0.3719972  0.998525
        -## ENSG00000000460  401.130684    3.87280e-06 0.00144269 0.1627644  0.998525
        -## ENSG00000000938 1500.527448    3.07435e-06 0.00144269 0.0354596  0.720601
        -

        Note it is not filtered or sorted, so we will use tidyverse to do this before saving our results to a file. Sort and filter the results.

        -
        # this is of class DESeqResults -- we want a data frame
        -deseq_df <- deseq_results %>%
        -  # make into data.frame
        -  as.data.frame() %>%
        -  # the gene names are rownames -- let's make this it's own column for easy
        -  # display
        -  tibble::rownames_to_column("Gene") %>%
        -  dplyr::mutate(threshold = padj < 0.05) %>%
        -  # let's sort by statistic -- the highest values should be what is up in the
        -  # ASXL mutated samples
        -  dplyr::arrange(dplyr::desc(log2FoldChange))
        -

        Let’s print out what the top results are.

        -
        head(deseq_df)
        +## baseMean log2FoldChange lfcSE pvalue +## <numeric> <numeric> <numeric> <numeric> +## ENSMUSG00000000001 9579.0571 -0.4349384 0.160640 2.59595e-03 +## ENSMUSG00000000028 1199.7333 0.0647514 0.134708 6.04429e-01 +## ENSMUSG00000000056 1287.5086 0.3243824 0.272978 1.02032e-01 +## ENSMUSG00000000058 20.1703 5.0170059 1.515508 6.85780e-05 +## ENSMUSG00000000078 4939.6277 -0.9574237 0.234363 4.75060e-06 +## ENSMUSG00000000085 1150.9626 0.0929495 0.126941 4.32755e-01 +## padj +## <numeric> +## ENSMUSG00000000001 0.019791734 +## ENSMUSG00000000028 0.808664075 +## ENSMUSG00000000056 0.283225795 +## ENSMUSG00000000058 0.001074535 +## ENSMUSG00000000078 0.000113951 +## ENSMUSG00000000085 0.682936007
        +

        Note it is not filtered or sorted, so we will use tidyverse to do this before saving our results to a file.

        +
        # this is of class DESeqResults -- we want a data frame
        +deseq_df <- deseq_results %>%
        +  # make into data.frame
        +  as.data.frame() %>%
        +  # the gene names are row names -- let's make them a column for easy display
        +  tibble::rownames_to_column("Gene") %>%
        +  # add a column for significance threshold results
        +  dplyr::mutate(threshold = padj < 0.05) %>%
        +  # sort by statistic -- the highest values will be genes with
        +  # higher expression in RPL10 mutated samples
        +  dplyr::arrange(dplyr::desc(log2FoldChange))
        +

        Let’s print out the top results.

        +
        head(deseq_df)

        4.6.1 Check results by plotting one gene

        To double check what a differentially expressed gene looks like, we can plot one with DESeq2::plotCounts() function.

        -
        plotCounts(ddset, gene = "ENSG00000196074", intgroup = "asxl_mutation_status")
        -

        -

        The mutation group samples have higher expression of this gene than the control group, which helps assure us that the results are showing us what we are looking for.

        +
        plotCounts(ddset, gene = "ENSMUSG00000026623", intgroup = "mutation_status")
        +

        +

        The R98S mutated samples have higher expression of this gene than the control group, which helps assure us that the results are showing us what we are looking for.

        4.7 Save results to TSV

        Write the results table to file.

        -
        readr::write_tsv(
        -  deseq_df,
        -  file.path(
        -    results_dir,
        -    "SRP078441_differential_expression_results.tsv" # Replace with a relevant output file name
        -  )
        -)
        +
        readr::write_tsv(
        +  deseq_df,
        +  file.path(
        +    results_dir,
        +    "SRP123625_diff_expr_results.tsv" # Replace with a relevant output file name
        +  )
        +)

        4.8 Create a volcano plot

        -

        We’ll use the EnhancedVolcano package’s main function to plot our data (Zhu et al. 2018). Here we are plotting the log2FoldChange (which was estimated by lfcShrink step) on the x axis and padj on the y axis. The padj variable are the p values corrected with Benjamini-Hochberg (the default from the results() step).

        -
        EnhancedVolcano::EnhancedVolcano(
        -  deseq_df,
        -  lab = deseq_df$Gene, # A vector that contains our gene names
        -  x = "log2FoldChange", # The variable in `deseq_df` you want to be plotted on the x axis
        -  y = "padj" # The variable in `deseq_df` you want to be plotted on the y axis
        -)
        -

        -

        Here the red point is the gene that meets both the default p value and log2 fold change cutoff (which are 10e-6 and 1 respectively).

        -

        We used the adjusted p values for our plot above, so you may want to loosen this cutoff with the pCutoff argument (Take a look at all the options for tailoring this plot using ?EnhancedVolcano).

        -

        Let’s make the same plot again, but adjust the pCutoff since we are using multiple-testing corrected p values and this time we will assign the plot to our environment as volcano_plot.

        -
        # We'll assign this as `volcano_plot` this time
        -volcano_plot <- EnhancedVolcano::EnhancedVolcano(
        -  deseq_df,
        -  lab = deseq_df$Gene,
        -  x = "log2FoldChange",
        -  y = "padj",
        -  pCutoff = 0.01 # Loosen the cutoff since we supplied corrected p-values
        -)
        -
        -# Print out plot here
        -volcano_plot
        -

        -

        This looks pretty good. Let’s save it to a PNG.

        -
        ggsave(
        -  plot = volcano_plot,
        -  file.path(plots_dir, "SRP078441_volcano_plot.png")
        -) # Replace with a plot name relevant to your data
        +

        We’ll use the EnhancedVolcano package’s main function to plot our data (Blighe et al. 2020).

        +

        Here we are plotting the log2FoldChange (which was estimated by lfcShrink step) on the x axis and padj on the y axis. The padj variable are the p values corrected with Benjamini-Hochberg (the default from the results() step).

        +

        Because we are using adjusted p values we can feel safe in making our pCutoff argument 0.01 (default is 1e-05).
        +Take a look at all the options for tailoring this plot using ?EnhancedVolcano.

        +

        We will save the plot to our environment as volcano_plot to make it easier to save the figure separately later.

        +
        # We'll assign this as `volcano_plot`
        +volcano_plot <- EnhancedVolcano::EnhancedVolcano(
        +  deseq_df,
        +  lab = deseq_df$Gene,
        +  x = "log2FoldChange",
        +  y = "padj",
        +  pCutoff = 0.01 # Loosen the cutoff since we supplied corrected p-values
        +)
        +
        ## Registered S3 methods overwritten by 'ggalt':
        +##   method                  from   
        +##   grid.draw.absoluteGrob  ggplot2
        +##   grobHeight.absoluteGrob ggplot2
        +##   grobWidth.absoluteGrob  ggplot2
        +##   grobX.absoluteGrob      ggplot2
        +##   grobY.absoluteGrob      ggplot2
        +
        # Print out plot here
        +volcano_plot
        +

        +

        This looks pretty good! Let’s save it to a PNG.

        +
        ggsave(
        +  plot = volcano_plot,
        +  file.path(plots_dir, "SRP123625_volcano_plot.png")
        +) # Replace with a plot name relevant to your data
        ## Saving 7 x 5 in image

        Heatmaps are also a pretty common way to show differential expression results. You can take your results from this example and make a heatmap following our heatmap module.

        @@ -3379,17 +4157,17 @@

        4.8 Create a volcano plot

        5 Further learning resources about this analysis

  • 6 Session info

    At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

    -
    # Print session info
    -sessioninfo::session_info()
    -
    ## ─ Session info ───────────────────────────────────────────────────────────────
    +
    # Print session info
    +sessioninfo::session_info()
    +
    ## ─ Session info ─────────────────────────────────────────────────────
     ##  setting  value                       
     ##  version  R version 4.0.2 (2020-06-22)
     ##  os       Ubuntu 20.04 LTS            
    @@ -3399,62 +4177,73 @@ 

    6 Session info

    ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2020-10-16 +## date 2020-12-16 ## -## ─ Packages ─────────────────────────────────────────────────────────────────── +## ─ Packages ───────────────────────────────────────────────────────── ## package * version date lib source -## annotate 1.66.0 2020-04-27 [1] Bioconductor -## AnnotationDbi 1.50.3 2020-07-25 [1] Bioconductor -## apeglm 1.10.0 2020-04-27 [1] Bioconductor +## annotate 1.68.0 2020-10-27 [1] Bioconductor +## AnnotationDbi 1.52.0 2020-10-27 [1] Bioconductor +## apeglm 1.12.0 2020-10-27 [1] Bioconductor +## ash 1.0-15 2015-09-01 [1] RSPM (R 4.0.0) ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0) ## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2) ## bbmle 1.0.23.1 2020-02-03 [1] RSPM (R 4.0.0) ## bdsmatrix 1.3-4 2020-01-13 [1] RSPM (R 4.0.0) -## Biobase * 2.48.0 2020-04-27 [1] Bioconductor -## BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor -## BiocParallel 1.22.0 2020-04-27 [1] Bioconductor +## beeswarm 0.2.3 2016-04-25 [1] RSPM (R 4.0.0) +## Biobase * 2.50.0 2020-10-27 [1] Bioconductor +## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor +## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor ## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2) ## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2) ## bitops 1.0-6 2013-08-17 [1] RSPM (R 4.0.0) ## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0) -## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) +## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2) ## coda 0.19-4 2020-09-30 [1] RSPM (R 4.0.2) ## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0) -## DelayedArray * 0.14.1 2020-07-14 [1] Bioconductor -## DESeq2 * 1.28.1 2020-05-12 [1] Bioconductor +## DelayedArray 0.16.0 2020-10-27 [1] Bioconductor +## DESeq2 * 1.30.0 2020-10-27 [1] Bioconductor ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) ## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2) ## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0) ## emdbook 1.3.12 2020-02-19 [1] RSPM (R 4.0.0) -## EnhancedVolcano 1.6.0 2020-04-27 [1] Bioconductor +## EnhancedVolcano 1.8.0 2020-10-27 [1] Bioconductor ## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0) +## extrafont 0.17 2014-12-08 [1] RSPM (R 4.0.0) +## extrafontdb 1.0 2012-06-11 [1] RSPM (R 4.0.0) ## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0) ## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0) -## genefilter 1.70.0 2020-04-27 [1] Bioconductor -## geneplotter 1.66.0 2020-04-27 [1] Bioconductor +## genefilter 1.72.0 2020-10-27 [1] Bioconductor +## geneplotter 1.68.0 2020-10-27 [1] Bioconductor ## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0) -## GenomeInfoDb * 1.24.2 2020-06-15 [1] Bioconductor -## GenomeInfoDbData 1.2.3 2020-10-06 [1] Bioconductor -## GenomicRanges * 1.40.0 2020-04-27 [1] Bioconductor +## GenomeInfoDb * 1.26.2 2020-12-08 [1] Bioconductor +## GenomeInfoDbData 1.2.4 2020-12-16 [1] Bioconductor +## GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor ## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0) +## ggalt 0.4.0 2017-02-15 [1] RSPM (R 4.0.0) +## ggbeeswarm 0.6.0 2017-08-07 [1] RSPM (R 4.0.0) ## ggplot2 * 3.3.2 2020-06-19 [1] RSPM (R 4.0.1) +## ggrastr 0.2.1 2020-09-14 [1] RSPM (R 4.0.2) ## ggrepel 0.8.2 2020-03-08 [1] RSPM (R 4.0.2) ## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2) ## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0) ## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0) ## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1) -## IRanges * 2.22.2 2020-05-21 [1] Bioconductor +## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.2) +## IRanges * 2.24.1 2020-12-12 [1] Bioconductor ## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2) +## KernSmooth 2.23-17 2020-04-26 [2] CRAN (R 4.0.2) ## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2) ## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0) ## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2) ## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0) ## locfit 1.5-9.4 2020-03-25 [1] RSPM (R 4.0.0) ## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0) +## maps 3.3.0 2018-04-03 [1] RSPM (R 4.0.0) ## MASS 7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2) ## Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2) +## MatrixGenerics * 1.2.0 2020-10-27 [1] Bioconductor ## matrixStats * 0.57.0 2020-09-25 [1] RSPM (R 4.0.2) ## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0) ## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0) @@ -3464,6 +4253,8 @@

    6 Session info

    ## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0) ## plyr 1.8.6 2020-03-03 [1] RSPM (R 4.0.2) +## proj4 1.0-10 2020-03-02 [1] RSPM (R 4.0.0) +## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2) ## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0) ## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0) ## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2) @@ -3473,30 +4264,32 @@

    6 Session info

    ## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0) ## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2) ## RCurl 1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0) -## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2) +## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) ## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0) -## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2) +## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2) ## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2) -## RSQLite 2.2.0 2020-01-07 [1] RSPM (R 4.0.2) +## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2) ## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0) -## S4Vectors * 0.26.1 2020-05-16 [1] Bioconductor +## Rttf2pt1 1.3.8 2020-01-10 [1] RSPM (R 4.0.0) +## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor ## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0) ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0) ## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0) ## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0) -## SummarizedExperiment * 1.18.2 2020-07-09 [1] Bioconductor +## SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor ## survival 3.1-12 2020-04-10 [2] CRAN (R 4.0.2) -## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2) +## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2) ## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0) ## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2) +## vipor 0.4.5 2017-03-22 [1] RSPM (R 4.0.0) ## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) ## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2) ## XML 3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2) ## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.0.0) -## XVector 0.28.0 2020-04-27 [1] Bioconductor +## XVector 0.30.0 2020-10-27 [1] Bioconductor ## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0) -## zlibbioc 1.34.0 2020-04-27 [1] Bioconductor +## zlibbioc 1.36.0 2020-10-27 [1] Bioconductor ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library
    @@ -3505,13 +4298,13 @@

    6 Session info

    References

    -

    Blighe K., S. Rana, and M. Lewis, 2020 EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling.

    +

    Blighe K., S. Rana, and M. Lewis, 2020 EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling. https://github.com/kevinblighe/EnhancedVolcano

    -
    -

    Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8

    +
    +

    Kampen K. R., L. Fancello, T. Girardi, G. Rinaldi, and M. Planque et al., 2019 Translatome analysis reveals altered serine and glycine metabolism in t-cell acute lymphoblastic leukemia cells. Nature Communications 10. https://doi.org/10.1038/s41467-019-10508-2

    -
    -

    Micol J. B., A. Pastore, D. Inoue, N. Duployez, and E. Kim et al., 2017 ASXL2 is essential for haematopoiesis and acts as a haploinsufficient tumour suppressor in leukemia. Nat Commun 8: 15429.

    +
    +

    Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8

    Zhu A., J. G. Ibrahim, and M. I. Love, 2018 Heavy-tailed prior distributions for sequence count data: Removing the noise and preserving large differences. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895

    @@ -3519,6 +4312,11 @@

    References

    +
    diff --git a/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd b/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd index e9fe5c7a..3be5192e 100644 --- a/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd +++ b/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd @@ -38,16 +38,13 @@ Run this next chunk to set up your folders! If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. ```{r} -# Define the file path to the data directory -data_dir <- file.path("data", "SRP133573") # Replace with path to desired data directory - # Create the data folder if it doesn't exist -if (!dir.exists(data_dir)) { - dir.create(data_dir) +if (!dir.exists("data")) { + dir.create("data") } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -55,7 +52,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -73,17 +70,15 @@ Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP Click the "Download Now" button on the right side of this screen. - - -Fill out the pop up window with your email and our Terms and Conditions: - - + +Fill out the pop up window with your email and our Terms and Conditions: + We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says "Skip quantile normalization for RNA-seq samples". Note that this option will only be available for RNA-seq datasets. - + It may take a few minutes for the dataset to process. You will get an email when it is ready. @@ -99,11 +94,11 @@ Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples includ ## Place the dataset in your new `data/` folder -Refine.bio will send you a download button in the email when it is ready. +refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. Double clicking should unzip this for you and create a folder of the same name. - + For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#rna-seq-sample-compendium-download-folder). @@ -125,7 +120,7 @@ Your new analysis folder should contain: - A folder for `results` (currently empty) Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): - + In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place. @@ -135,19 +130,24 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in - -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset - -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "SRP133573") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "SRP133573.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` @@ -186,7 +186,7 @@ if (!("DESeq2" %in% installed.packages())) { Attach the `DESeq2` and `ggplot2` libraries: -```{r} +```{r message=FALSE} # Attach the `DESeq2` library library(DESeq2) @@ -212,105 +212,115 @@ We stored our file paths as objects named `metadata_file` and `data_file` in [th metadata <- readr::read_tsv(metadata_file) # Read in data TSV file -df <- readr::read_tsv(data_file) %>% - # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later +expression_df <- readr::read_tsv(data_file) %>% + # Tuck away the gene ID column as row names, leaving only numeric values tibble::column_to_rownames("Gene") ``` Let's ensure that the metadata and data are in the same sample order. ```{r} -# Make the data in the order of the metadata -df <- df %>% dplyr::select(metadata$refinebio_accession_code) +# Make the sure the columns (samples) are in the same order as the metadata +expression_df <- expression_df %>% + dplyr::select(metadata$refinebio_accession_code) # Check if this is in the same order -all.equal(colnames(df), metadata$refinebio_accession_code) +all.equal(colnames(expression_df), metadata$refinebio_accession_code) ``` - Now we are going to use a combination of functions from the `DESeq2` and `ggplot2` packages to perform and visualize the results of the Principal Component Analysis (PCA) dimension reduction technique on our pre-ADT and post-ADT samples. -### Prepare data for `DESeq2` - -We need to make sure all of the values in our data are converted to integers as required by a `DESeq2` function we will use later. - -```{r} -# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers -df <- df %>% - # Mutate numeric variables to be integers - dplyr::mutate_if(is.numeric, round) -``` - ### Prepare metadata for `DESEq2` -We need to make sure all of the metadata column variables, that we would like to use to annotate our plot, are converted into factors. +We need to make sure all of the metadata column variables that we would like to use to annotate our plot, are converted into factors. ```{r} -# We need to also format the variables from the metadata, that we will be using for annotation of the PCA plot, into factors +# convert the columns we will be using for annotation into factors metadata <- metadata %>% dplyr::mutate( - refinebio_treatment = as.factor(refinebio_treatment), + refinebio_treatment = factor( + refinebio_treatment, + # specify the possible levels in the order we want them to appear + levels = c("pre-adt", "post-adt") + ), refinebio_disease = as.factor(refinebio_disease) ) ``` +## Define a minimum counts cutoff + +We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis. +We are going to do some pre-filtering to keep only genes with 10 or more reads total. +This threshold might vary depending on the number of samples and expression patterns across your data set. +Note that rows represent gene data and the columns represent sample data in our dataset. + +```{r} +# Define a minimum counts cutoff and filter the data to include +# only rows (genes) that have total counts above the cutoff +filtered_expression_df <- expression_df %>% + dplyr::filter(rowSums(.) >= 10) +``` + +We also need our counts to be rounded before we can use them with the `DESeqDataSetFromMatrix()` function. + +```{r} +# The `DESeqDataSetFromMatrix()` function needs the values to be integers +filtered_expression_df <- round(filtered_expression_df) +``` + ## Create a DESeqDataset We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object. -We turn the data frame (or matrix) into a [`DESeqDataSet` object](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2). ) and specify which variable labels our experimental groups using the [`design` argument](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs) [@Love2014]. +We turn the data frame (or matrix) into a [`DESeqDataSet` object](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2) and specify which variable labels our experimental groups using the [`design` argument](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs) [@Love2014]. In this chunk of code, we will not provide a specific model to the `design` argument because we are not performing a differential expression analysis. ```{r} # Create a `DESeqDataSet` object dds <- DESeqDataSetFromMatrix( - countData = df, # This is the data frame with the counts values for all replicates in our dataset - colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame - design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis + countData = filtered_expression_df, # the counts values for all samples in our dataset + colData = metadata, # annotation data for the samples in the counts data frame + design = ~1 # Here we are not specifying a model + # Replace with an appropriate design variable for your analysis ) ``` -## Define a minimum counts cutoff - -We want to filter out the genes that have not been expressed or that have low expression counts because we want to remove any possible noise from our data before we normalize the data and create our plot. -We are going to do some pre-filtering to keep only genes with 10 or more reads total. -Note that rows represent gene data and the columns represent sample data in our dataset. - -```{r} -# Define a minimum counts cutoff and filter `DESeqDataSet` object to include -# only rows that have counts above the cutoff -genes_to_keep <- rowSums(counts(dds)) >= 10 -dds <- dds[genes_to_keep, ] -``` - ## Perform DESeq2 normalization and transformation We are going to use the `vst()` function from the `DESeq2` package to normalize and transform the data. For more information about these transformation methods, [see here](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods). ```{r} -# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package +# Normalize and transform the data in the `DESeqDataSet` object +# using the `vst()` function from the `DESeq2` R package dds_norm <- vst(dds) ``` ## Create PCA plot using DESeq2 -In this code chunk, the variable `refinebio_treatment` is given to the `plotPCA()` function as part of the goal of the experiment is to analyze the sample transcriptional responses to androgen deprivation therapy (ADT). +DESeq2 has built-in functions to calculate and plot PCA values, which we will use here. +The `plotPCA()` function allows us to specify our group of interest with the `intgroup` argument, which will be used to color the points in our plot. +In this code chunk, we are using `refinebio_treatment` as the grouping variable, + as part of the goal of the experiment was to analyze the sample transcriptional responses to androgen deprivation therapy (ADT). ```{r} -plotPCA(dds_norm, +plotPCA( + dds_norm, intgroup = "refinebio_treatment" ) ``` -In this chunk, we are going to add another variable to our plot for annotation. +In the next chunk, we are going to add another variable to our plot for annotation. Now we'll plot the PCA using both `refinebio_treatment` and `refinebio_disease` variables for labels since they are central to the androgen deprivation therapy (ADT) based hypothesis in the [original paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6210624/) [@Sharma2018]. ```{r} -plotPCA(dds_norm, - intgroup = c("refinebio_treatment", "refinebio_disease") # Note that we are able to add another variable to the intgroup argument here by providing a vector of the variable names with `c()` function +plotPCA( + dds_norm, + intgroup = c("refinebio_treatment", "refinebio_disease") + # We are able to add another variable to the intgroup argument + # by providing a vector of the variable names with `c()` function ) ``` @@ -322,7 +332,7 @@ First let's use `plotPCA()` to receive and store the PCA values for plotting. ```{r} # We first have to save the results of the `plotPCA()` function for use with `ggplot2` -pcaData <- +pca_results <- plotPCA( dds_norm, intgroup = c("refinebio_treatment", "refinebio_disease"), @@ -330,19 +340,26 @@ pcaData <- ) ``` -Now let's plot our `pcaData` using `ggplot2` functionality. +Now let's plot our `pca_results` using `ggplot2` functionality. ```{r} -# Plot using `ggplot()` function +# Plot using `ggplot()` function and save to an object annotated_pca_plot <- ggplot( - pcaData, + pca_results, aes( x = PC1, y = PC2, - color = refinebio_treatment, # This will label points with different colors for each `refinebio_disease` group - shape = refinebio_disease # This will label points with different shapes for each `refinebio_disease` group + # plot points with different colors for each `refinebio_treatment` group + color = refinebio_treatment, + # plot points with different shapes for each `refinebio_disease` group + shape = refinebio_disease ) -) +) + + # Make a scatter plot + geom_point() + +# display annotated plot +annotated_pca_plot ``` ## Save annotated PCA plot as a PNG @@ -351,8 +368,10 @@ You can easily switch this to save to a JPEG or TIFF by changing the file name w ```{r} # Save plot using `ggsave()` function -ggsave(file.path(plots_dir, "SRP133573_pca_plot.png"), # Replace with name relevant your plotted data - plot = annotated_pca_plot # Here we are giving the function the plot object that we want saved to file +ggsave( + file.path(plots_dir, "SRP133573_pca_plot.png"), + # Replace with a file name relevant your plotted data + plot = annotated_pca_plot # the plot object that we want saved to file ) ``` diff --git a/03-rnaseq/dimension-reduction_rnaseq_01_pca.html b/03-rnaseq/dimension-reduction_rnaseq_01_pca.html index ccc687a0..8067961b 100644 --- a/03-rnaseq/dimension-reduction_rnaseq_01_pca.html +++ b/03-rnaseq/dimension-reduction_rnaseq_01_pca.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2971,29 +3797,26 @@

    2.1 Obtain the .Rmd

    2.2 Set up your analysis folders

    Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

    If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

    -
    # Define the file path to the data directory
    -data_dir <- file.path("data", "SRP133573") # Replace with path to desired data directory
    -
    -# Create the data folder if it doesn't exist
    -if (!dir.exists(data_dir)) {
    -  dir.create(data_dir)
    -}
    -
    -# Define the file path to the plots directory
    -plots_dir <- "plots" # Can replace with path to desired output plots directory
    -
    -# Create the plots folder if it doesn't exist
    -if (!dir.exists(plots_dir)) {
    -  dir.create(plots_dir)
    -}
    -
    -# Define the file path to the results directory
    -results_dir <- "results" # Can replace with path to desired output results directory
    -
    -# Create the results folder if it doesn't exist
    -if (!dir.exists(results_dir)) {
    -  dir.create(results_dir)
    -}
    +
    # Create the data folder if it doesn't exist
    +if (!dir.exists("data")) {
    +  dir.create("data")
    +}
    +
    +# Define the file path to the plots directory
    +plots_dir <- "plots"
    +
    +# Create the plots folder if it doesn't exist
    +if (!dir.exists(plots_dir)) {
    +  dir.create(plots_dir)
    +}
    +
    +# Define the file path to the results directory
    +results_dir <- "results"
    +
    +# Create the results folder if it doesn't exist
    +if (!dir.exists(results_dir)) {
    +  dir.create(results_dir)
    +}

    In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

    @@ -3001,10 +3824,8 @@

    2.3 Obtain the dataset from refin

    For general information about downloading data for these examples, see our ‘Getting Started’ section.

    Go to this dataset’s page on refine.bio.

    Click the “Download Now” button on the right side of this screen.

    -

    -

    Fill out the pop up window with your email and our Terms and Conditions:

    -

    -

    We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples”. Note that this option will only be available for RNA-seq datasets.

    +

    Fill out the pop up window with your email and our Terms and Conditions:

    +

    We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples”. Note that this option will only be available for RNA-seq datasets.

    It may take a few minutes for the dataset to process. You will get an email when it is ready.

    @@ -3016,8 +3837,8 @@

    2.4 About the dataset we are usin

    2.5 Place the dataset in your new data/ folder

    -

    Refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

    -

    +

    refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

    +

    For more details on the contents of this folder see these docs on refine.bio.

    The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

    Copy and paste the SRP133573 folder into your newly created data/ folder.

    @@ -3043,23 +3864,28 @@

    2.6 Check out our file structure!
  • A folder for results (currently empty)
    Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):
  • -

    +

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

    First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

    -
    # Define the file path to the data directory
    -data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
    -
    -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
    -data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
    -
    -# Declare the file path to the metadata file using the data directory saved as `data_dir`
    -metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
    +
    # Define the file path to the data directory
    +# Replace with the path of the folder the files will be in
    +data_dir <- file.path("data", "SRP133573")
    +
    +# Declare the file path to the gene expression matrix file
    +# inside directory saved as `data_dir`
    +# Replace with the path to your dataset file
    +data_file <- file.path(data_dir, "SRP133573.tsv")
    +
    +# Declare the file path to the metadata file
    +# inside the directory saved as `data_dir`
    +# Replace with the path to your metadata file
    +metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv")

    Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

    -
    # Check if the gene expression matrix file is at the file path stored in `data_file`
    -file.exists(data_file)
    +
    # Check if the gene expression matrix file is at the path stored in `data_file`
    +file.exists(data_file)
    ## [1] TRUE
    -
    # Check if the metadata file is at the file path stored in `metadata_file`
    -file.exists(metadata_file)
    +
    # Check if the metadata file is at the file path stored in `metadata_file`
    +file.exists(metadata_file)
    ## [1] TRUE

    If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

    If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

    @@ -3077,82 +3903,32 @@

    4 PCA Visualization - RNA-seq

    4.1 Install libraries

    See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

    -

    In this analysis, we will be using the R package DESeq2 (Love et al. 2014) for normalization and production of PCA values and the R package ggplot2 (Prabhakaran 2016) for plotting the PCA values.

    -
    if (!("DESeq2" %in% installed.packages())) {
    -  # Install DESeq2
    -  BiocManager::install("DESeq2", update = FALSE)
    -}
    +

    In this analysis, we will be using the R package DESeq2 (Love et al. 2014) for normalization and production of PCA values and the R package ggplot2 (Prabhakaran 2016) for plotting the PCA values.

    +
    if (!("DESeq2" %in% installed.packages())) {
    +  # Install DESeq2
    +  BiocManager::install("DESeq2", update = FALSE)
    +}

    Attach the DESeq2 and ggplot2 libraries:

    -
    # Attach the `DESeq2` library
    -library(DESeq2)
    -
    ## Loading required package: S4Vectors
    -
    ## Loading required package: stats4
    -
    ## Loading required package: BiocGenerics
    -
    ## Loading required package: parallel
    -
    ## 
    -## Attaching package: 'BiocGenerics'
    -
    ## The following objects are masked from 'package:parallel':
    -## 
    -##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    -##     clusterExport, clusterMap, parApply, parCapply, parLapply,
    -##     parLapplyLB, parRapply, parSapply, parSapplyLB
    -
    ## The following objects are masked from 'package:stats':
    -## 
    -##     IQR, mad, sd, var, xtabs
    -
    ## The following objects are masked from 'package:base':
    -## 
    -##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    -##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    -##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    -##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    -##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    -##     union, unique, unsplit, which, which.max, which.min
    -
    ## 
    -## Attaching package: 'S4Vectors'
    -
    ## The following object is masked from 'package:base':
    -## 
    -##     expand.grid
    -
    ## Loading required package: IRanges
    -
    ## Loading required package: GenomicRanges
    -
    ## Loading required package: GenomeInfoDb
    -
    ## Loading required package: SummarizedExperiment
    -
    ## Loading required package: Biobase
    -
    ## Welcome to Bioconductor
    -## 
    -##     Vignettes contain introductory material; view with
    -##     'browseVignettes()'. To cite Bioconductor, see
    -##     'citation("Biobase")', and for packages 'citation("pkgname")'.
    -
    ## Loading required package: DelayedArray
    -
    ## Loading required package: matrixStats
    -
    ## 
    -## Attaching package: 'matrixStats'
    -
    ## The following objects are masked from 'package:Biobase':
    -## 
    -##     anyMissing, rowMedians
    -
    ## 
    -## Attaching package: 'DelayedArray'
    -
    ## The following objects are masked from 'package:matrixStats':
    -## 
    -##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
    -
    ## The following objects are masked from 'package:base':
    -## 
    -##     aperm, apply, rowsum
    -
    # Attach the `ggplot2` library for plotting
    -library(ggplot2)
    -
    -# We will need this so we can use the pipe: %>%
    -library(magrittr)
    -
    -# Set the seed so our results are reproducible:
    -set.seed(12345)
    +
    # Attach the `DESeq2` library
    +library(DESeq2)
    +
    +# Attach the `ggplot2` library for plotting
    +library(ggplot2)
    +
    +# We will need this so we can use the pipe: %>%
    +library(magrittr)
    +
    +# Set the seed so our results are reproducible:
    +set.seed(12345)

    4.2 Import and set up data

    Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

    We stored our file paths as objects named metadata_file and data_file in this previous step.

    -
    # Read in metadata TSV file
    -metadata <- readr::read_tsv(metadata_file)
    -
    ## Parsed with column specification:
    +
    # Read in metadata TSV file
    +metadata <- readr::read_tsv(metadata_file)
    +
    ## 
    +## ── Column specification ──────────────────────────────────────────────
     ## cols(
     ##   .default = col_character(),
     ##   refinebio_age = col_logical(),
    @@ -3165,113 +3941,129 @@ 

    4.2 Import and set up data

    ## refinebio_source_archive_url = col_logical(), ## refinebio_specimen_part = col_logical(), ## refinebio_time = col_logical() -## )
    -
    ## See spec(...) for full column specifications.
    -
    # Read in data TSV file
    -df <- readr::read_tsv(data_file) %>%
    -  # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
    -  tibble::column_to_rownames("Gene")
    -
    ## Parsed with column specification:
    +## )
    +## ℹ Use `spec()` for the full column specifications.
    +
    # Read in data TSV file
    +expression_df <- readr::read_tsv(data_file) %>%
    +  # Tuck away the gene ID column as row names, leaving only numeric values
    +  tibble::column_to_rownames("Gene")
    +
    ## 
    +## ── Column specification ──────────────────────────────────────────────
     ## cols(
     ##   .default = col_double(),
     ##   Gene = col_character()
     ## )
    -## See spec(...) for full column specifications.
    +## ℹ Use `spec()` for the full column specifications.

    Let’s ensure that the metadata and data are in the same sample order.

    -
    # Make the data in the order of the metadata
    -df <- df %>% dplyr::select(metadata$refinebio_accession_code)
    -
    -# Check if this is in the same order
    -all.equal(colnames(df), metadata$refinebio_accession_code)
    +
    # Make the sure the columns (samples) are in the same order as the metadata
    +expression_df <- expression_df %>%
    +  dplyr::select(metadata$refinebio_accession_code)
    +
    +# Check if this is in the same order
    +all.equal(colnames(expression_df), metadata$refinebio_accession_code)
    ## [1] TRUE
    -

    Now we are going to use a combination of functions from the DESeq2 and ggplot2 packages to perform and visualize the results of the Principal Component Analysis (PCA) dimension reduction technique on our pre-ADT and post-ADT samples.

    -
    -

    4.2.1 Prepare data for DESeq2

    -

    We need to make sure all of the values in our data are converted to integers as required by a DESeq2 function we will use later.

    -
    # The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers
    -df <- df %>%
    -  # Mutate numeric variables to be integers
    -  dplyr::mutate_if(is.numeric, round)
    -
    -

    4.2.2 Prepare metadata for DESEq2

    -

    We need to make sure all of the metadata column variables, that we would like to use to annotate our plot, are converted into factors.

    -
    # We need to also format the variables from the metadata, that we will be using for annotation of the PCA plot, into factors
    -metadata <- metadata %>%
    -  dplyr::mutate(
    -    refinebio_treatment = as.factor(refinebio_treatment),
    -    refinebio_disease = as.factor(refinebio_disease)
    -  )
    +

    4.2.1 Prepare metadata for DESEq2

    +

    We need to make sure all of the metadata column variables that we would like to use to annotate our plot, are converted into factors.

    +
    # convert the columns we will be using for annotation into factors
    +metadata <- metadata %>%
    +  dplyr::mutate(
    +    refinebio_treatment = factor(
    +      refinebio_treatment,
    +      # specify the possible levels in the order we want them to appear
    +      levels = c("pre-adt", "post-adt")
    +    ),
    +    refinebio_disease = as.factor(refinebio_disease)
    +  )
    +
    +
    +

    4.3 Define a minimum counts cutoff

    +

    We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis. We are going to do some pre-filtering to keep only genes with 10 or more reads total. This threshold might vary depending on the number of samples and expression patterns across your data set. Note that rows represent gene data and the columns represent sample data in our dataset.

    +
    # Define a minimum counts cutoff and filter the data to include
    +# only rows (genes) that have total counts above the cutoff
    +filtered_expression_df <- expression_df %>%
    +  dplyr::filter(rowSums(.) >= 10)
    +

    We also need our counts to be rounded before we can use them with the DESeqDataSetFromMatrix() function.

    +
    # The `DESeqDataSetFromMatrix()` function needs the values to be integers
    +filtered_expression_df <- round(filtered_expression_df)
    -

    4.3 Create a DESeqDataset

    -

    We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object. ) and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

    -
    # Create a `DESeqDataSet` object
    -dds <- DESeqDataSetFromMatrix(
    -  countData = df, # This is the data frame with the counts values for all replicates in our dataset
    -  colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame
    -  design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis
    -)
    +

    4.4 Create a DESeqDataset

    +

    We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

    +
    # Create a `DESeqDataSet` object
    +dds <- DESeqDataSetFromMatrix(
    +  countData = filtered_expression_df, # the counts values for all samples in our dataset
    +  colData = metadata, # annotation data for the samples in the counts data frame
    +  design = ~1 # Here we are not specifying a model
    +  # Replace with an appropriate design variable for your analysis
    +)
    ## converting counts to integer mode
    -
    -

    4.4 Define a minimum counts cutoff

    -

    We want to filter out the genes that have not been expressed or that have low expression counts because we want to remove any possible noise from our data before we normalize the data and create our plot. We are going to do some pre-filtering to keep only genes with 10 or more reads total. Note that rows represent gene data and the columns represent sample data in our dataset.

    -
    # Define a minimum counts cutoff and filter `DESeqDataSet` object to include
    -# only rows that have counts above the cutoff
    -genes_to_keep <- rowSums(counts(dds)) >= 10
    -dds <- dds[genes_to_keep, ]
    -

    4.5 Perform DESeq2 normalization and transformation

    We are going to use the vst() function from the DESeq2 package to normalize and transform the data. For more information about these transformation methods, see here.

    -
    # Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package
    -dds_norm <- vst(dds)
    +
    # Normalize and transform the data in the `DESeqDataSet` object
    +# using the `vst()` function from the `DESeq2` R package
    +dds_norm <- vst(dds)

    4.6 Create PCA plot using DESeq2

    -

    In this code chunk, the variable refinebio_treatment is given to the plotPCA() function as part of the goal of the experiment is to analyze the sample transcriptional responses to androgen deprivation therapy (ADT).

    -
    plotPCA(dds_norm,
    -  intgroup = "refinebio_treatment"
    -)
    -

    -

    In this chunk, we are going to add another variable to our plot for annotation.

    -

    Now we’ll plot the PCA using both refinebio_treatment and refinebio_disease variables for labels since they are central to the androgen deprivation therapy (ADT) based hypothesis in the original paper (Sharma et al. 2018).

    -
    plotPCA(dds_norm,
    -  intgroup = c("refinebio_treatment", "refinebio_disease") # Note that we are able to add another variable to the intgroup argument here by providing a vector of the variable names with  `c()` function
    -)
    +

    DESeq2 has built-in functions to calculate and plot PCA values, which we will use here. The plotPCA() function allows us to specify our group of interest with the intgroup argument, which will be used to color the points in our plot. In this code chunk, we are using refinebio_treatment as the grouping variable, as part of the goal of the experiment was to analyze the sample transcriptional responses to androgen deprivation therapy (ADT).

    +
    plotPCA(
    +  dds_norm,
    +  intgroup = "refinebio_treatment"
    +)
    +

    +

    In the next chunk, we are going to add another variable to our plot for annotation.

    +

    Now we’ll plot the PCA using both refinebio_treatment and refinebio_disease variables for labels since they are central to the androgen deprivation therapy (ADT) based hypothesis in the original paper (Sharma et al. 2018).

    +
    plotPCA(
    +  dds_norm,
    +  intgroup = c("refinebio_treatment", "refinebio_disease")
    +  # We are able to add another variable to the intgroup argument
    +  # by providing a vector of the variable names with `c()` function
    +)

    In the plot above, it is hard to distinguish the different refinebio_treatment values which contain the data on whether or not samples have been treated with ADT versus the refinebio_disease values which refer to the method by which the samples were obtained from patients (i.e. biopsy).

    Let’s use the ggplot2 package functionality to customize our plot further and make the annotation labels better distinguishable.

    First let’s use plotPCA() to receive and store the PCA values for plotting.

    -
    # We first have to save the results of the `plotPCA()` function for use with `ggplot2`
    -pcaData <-
    -  plotPCA(
    -    dds_norm,
    -    intgroup = c("refinebio_treatment", "refinebio_disease"),
    -    returnData = TRUE # This argument tells R to return the PCA values
    -  )
    -

    Now let’s plot our pcaData using ggplot2 functionality.

    -
    # Plot using `ggplot()` function
    -annotated_pca_plot <- ggplot(
    -  pcaData,
    -  aes(
    -    x = PC1,
    -    y = PC2,
    -    color = refinebio_treatment, # This will label points with different colors for each `refinebio_disease` group
    -    shape = refinebio_disease # This will label points with different shapes for each `refinebio_disease` group
    -  )
    -)
    +
    # We first have to save the results of the `plotPCA()` function for use with `ggplot2`
    +pca_results <-
    +  plotPCA(
    +    dds_norm,
    +    intgroup = c("refinebio_treatment", "refinebio_disease"),
    +    returnData = TRUE # This argument tells R to return the PCA values
    +  )
    +

    Now let’s plot our pca_results using ggplot2 functionality.

    +
    # Plot using `ggplot()` function and save to an object
    +annotated_pca_plot <- ggplot(
    +  pca_results,
    +  aes(
    +    x = PC1,
    +    y = PC2,
    +    # plot points with different colors for each `refinebio_treatment` group
    +    color = refinebio_treatment,
    +    # plot points with different shapes for each `refinebio_disease` group
    +    shape = refinebio_disease
    +  )
    +) +
    +  # Make a scatter plot
    +  geom_point()
    +
    +# display annotated plot
    +annotated_pca_plot
    +

    4.7 Save annotated PCA plot as a PNG

    You can easily switch this to save to a JPEG or TIFF by changing the file name within the ggsave() function to the respective file suffix.

    -
    # Save plot using `ggsave()` function
    -ggsave(file.path(plots_dir, "SRP133573_pca_plot.png"), # Replace with name relevant your plotted data
    -  plot = annotated_pca_plot # Here we are giving the function the plot object that we want saved to file
    -)
    +
    # Save plot using `ggsave()` function
    +ggsave(
    +  file.path(plots_dir, "SRP133573_pca_plot.png"),
    +  # Replace with a file name relevant your plotted data
    +  plot = annotated_pca_plot # the plot object that we want saved to file
    +)
    ## Saving 7 x 5 in image
    @@ -3279,17 +4071,17 @@

    4.7 Save annotated PCA plot as a

    5 Resources for further learning

    diff --git a/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd b/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd index a5d9d88b..d8fa4bef 100644 --- a/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd +++ b/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd @@ -44,7 +44,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -52,7 +52,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -70,11 +70,11 @@ Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP Click the "Download Now" button on the right side of this screen. - + Fill out the pop up window with your email and our Terms and Conditions: - + @@ -82,7 +82,7 @@ We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says "Skip quantile normalization for RNA-seq samples". Note that this option will only be available for RNA-seq datasets. - + It may take a few minutes for the dataset to process. You will get an email when it is ready. @@ -91,20 +91,19 @@ You will get an email when it is ready. For this example analysis, we will use this [prostate cancer dataset](https://www.refine.bio/experiments/SRP133573). - The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens. ## Place the dataset in your new `data/` folder -Refine.bio will send you a download button in the email when it is ready. +refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. Double clicking should unzip this for you and create a folder of the same name. - + -For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#rna-seq-sample-compendium-download-folder). +For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#rna-seq-sample-compendium-download-folder). The `` folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like `GSE1235` or `SRP12345`. @@ -135,19 +134,24 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in - -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset - -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "SRP133573") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "SRP133573.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` @@ -191,7 +195,7 @@ if (!("umap" %in% installed.packages())) { Attach the packages we need for this analysis: -```{r} +```{r message=FALSE} # Attach the `DESeq2` library library(DESeq2) @@ -220,8 +224,8 @@ We stored our file paths as objects named `metadata_file` and `data_file` in [th metadata <- readr::read_tsv(metadata_file) # Read in data TSV file -df <- readr::read_tsv(data_file) %>% - # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later +expression_df <- readr::read_tsv(data_file) %>% + # Tuck away the gene ID column as row names, leaving only numeric values tibble::column_to_rownames("Gene") ``` @@ -229,46 +233,57 @@ Let's ensure that the metadata and data are in the same sample order. ```{r} # Make the data in the order of the metadata -df <- df %>% +expression_df <- expression_df %>% dplyr::select(metadata$refinebio_accession_code) # Check if this is in the same order -all.equal(colnames(df), metadata$refinebio_accession_code) +all.equal(colnames(expression_df), metadata$refinebio_accession_code) ``` - - Now we are going to use a combination of functions from the `DESeq2`, `umap`, and `ggplot2` packages to perform and visualize the results of the Uniform Manifold Approximation and Projection (UMAP) dimension reduction technique on our pre-ADT and post-ADT samples. -## Prepare data for `DESeq2` - -We need to make sure all of the values in our data are converted to integers as required by a `DESeq2` function we will use later. - -```{r} -# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers -df <- df %>% - # Mutate numeric variables to be integers - dplyr::mutate_if(is.numeric, round) -``` - ## Prepare metadata for `DESEq2` We need to make sure all of the metadata column variables, that we would like to use to annotate our plot, are converted into factors. ```{r} -# We need to also format the variables from the metadata, that we will be using for annotation of the UMAP plot, into factors +# convert the columns we will be using for annotation into factors metadata <- metadata %>% - dplyr::select( # The metadata has many variables, we want to select only those that we will need for plotting later + dplyr::select( # select only the columns that we will need for plotting refinebio_accession_code, refinebio_treatment, refinebio_disease ) %>% dplyr::mutate( # Now let's convert the annotation variables into factors - refinebio_treatment = as.factor(refinebio_treatment), + refinebio_treatment = factor( + refinebio_treatment, + # specify the possible levels in the order we want them to appear + levels = c("pre-adt", "post-adt") + ), refinebio_disease = as.factor(refinebio_disease) ) ``` +## Define a minimum counts cutoff + +We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis. +We are going to do some pre-filtering to keep only genes with 10 or more reads total. +Note that rows represent gene data and the columns represent sample data in our dataset. + +```{r} +# Define a minimum counts cutoff and filter the data to include +# only rows (genes) that have total counts above the cutoff +filtered_expression_df <- expression_df %>% + dplyr::filter(rowSums(.) >= 10) +``` + +We also need our counts to be rounded before we can use them with the `DESeqDataSetFromMatrix()` function. + +```{r} +# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers +filtered_expression_df <- round(filtered_expression_df) +``` + ## Create a DESeqDataset We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object. @@ -278,32 +293,21 @@ In this chunk of code, we will not provide a specific model to the `design` argu ```{r} # Create a `DESeqDataSet` object dds <- DESeqDataSetFromMatrix( - countData = df, # This is the data frame with the counts values for all replicates in our dataset - colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame - design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis + countData = filtered_expression_df, # the counts values for all samples in our dataset + colData = metadata, # annotation data for the samples in the counts data frame + design = ~1 # Here we are not specifying a model + # Replace with an appropriate design variable for your analysis ) ``` -## Define a minimum counts cutoff - -We want to filter out the genes that have not been expressed or that have low expression counts because we want to remove any possible noise from our data before we normalize the data and create our plot. -We are going to do some pre-filtering to keep only genes with 10 or more reads total. -Note that rows represent gene data and the columns represent sample data in our dataset. - -```{r} -# Define a minimum counts cutoff and filter `DESeqDataSet` object to include -# only rows that have counts above the cutoff -genes_to_keep <- rowSums(counts(dds)) >= 10 -dds <- dds[genes_to_keep, ] -``` - ## Perform DESeq2 normalization and transformation We are going to use the `vst()` function from the `DESeq2` package to normalize and transform the data. For more information about these transformation methods, [see here](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods). ```{r} -# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package +# Normalize and transform the data in the `DESeqDataSet` object +# using the `vst()` function from the `DESeq2` R package dds_norm <- vst(dds) ``` @@ -319,11 +323,12 @@ Here's some [guidance about choosing parameters](https://cran.r-project.org/web/ You can also run the following in the RStudio console to get more information on the function and its default parameters: `?umap::umap` or `?umap::umap.defaults`. ```{r} -# First we are going to retrieve the normalized data from the `DESeqDataSet` object using the `assay()` function +# First we are going to retrieve the normalized data +# from the `DESeqDataSet` object using the `assay()` function normalized_counts <- assay(dds_norm) %>% - t() # We need to transpose this data in preparation for the `umap()` function + t() # We need to transpose this data so each row is a sample -# Now let's tell R to perform UMAP on the normalized data +# Now perform UMAP on the normalized data umap_results <- umap::umap(normalized_counts) ``` @@ -332,12 +337,24 @@ umap_results <- umap::umap(normalized_counts) Now that we have the results from UMAP, we need to extract the counts data from the `umap_results` object and merge the variables from the metadata that we will use for annotating our plot. ```{r} -# We need to tell R to create a data frame with the umap values and annotation data in preparation for plotting with `ggplot2` +# Make into data frame for plotting with `ggplot2` +# The UMAP values we need for plotting are stored in the `layout` element umap_plot_df <- data.frame(umap_results$layout) %>% - tibble::rownames_to_column("refinebio_accession_code") %>% # Let's get the rownames into a column so we can merge the annotation data - dplyr::inner_join(metadata, by = "refinebio_accession_code") # Let's merge the annotation data using the `refinebio_accession_code` column + # Turn sample IDs stored as row names into a column + tibble::rownames_to_column("refinebio_accession_code") %>% + # Add the metadata into this data frame; match by sample IDs + dplyr::inner_join(metadata, by = "refinebio_accession_code") +``` + +Let's take a look at the data frame we created in the chunk above. + +```{r} +umap_plot_df ``` +Here we can see that UMAP took the data from thousands of genes, and reduced it to just two variables, `X1` and `X2`. + + ## Create UMAP plot Now we can use the `ggplot()` function to plot our normalized UMAP scores. @@ -351,7 +368,7 @@ ggplot( y = X2 ) ) + - geom_point() # This tells R that we want a scatterplot + geom_point() # Plot individual points to make a scatterplot ``` Let's try adding a variable to our plot for annotation. @@ -365,7 +382,7 @@ ggplot( aes( x = X1, y = X2, - color = refinebio_treatment # This will label points with different colors for each `refinebio_treatment` group + color = refinebio_treatment # label points with different colors for each `subgroup` ) ) + geom_point() # This tells R that we want a scatterplot @@ -376,17 +393,19 @@ In the next code chunk, we are going to add another variable to our plot for ann We'll plot using both `refinebio_treatment` and `refinebio_disease` variables for labels since they are central to the androgen deprivation therapy (ADT) based hypothesis in the [original paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6210624/) [@Sharma2018]. ```{r} -# Plot using `ggplot()` function +# Plot using `ggplot()` function and save to an object final_annotated_umap_plot <- ggplot( umap_plot_df, aes( x = X1, y = X2, - color = refinebio_treatment, # This will label points with different colors for each `refinebio_treatment` group - shape = refinebio_disease # This will label points with different shapes for each `refinebio_disease` group + # plot points with different colors for each `refinebio_treatment` group + color = refinebio_treatment, + # plot points with different shapes for each `refinebio_disease` group + shape = refinebio_disease ) ) + - geom_point() # This tells R that we want a scatterplot + geom_point() # make a scatterplot # Display the plot that we saved above final_annotated_umap_plot @@ -412,7 +431,11 @@ You can easily switch this to save to a JPEG or TIFF by changing the file name w ```{r} # Save plot using `ggsave()` function -ggsave(file.path(plots_dir, "SRP133573_umap_plot.png"), # Replace with name relevant your plotted data +ggsave( + file.path( + plots_dir, + "SRP133573_umap_plot.png" # Replace with a good file name for your plot + ), plot = final_annotated_umap_plot ) ``` diff --git a/03-rnaseq/dimension-reduction_rnaseq_02_umap.html b/03-rnaseq/dimension-reduction_rnaseq_02_umap.html index 7fa99d8a..98eda4b4 100644 --- a/03-rnaseq/dimension-reduction_rnaseq_02_umap.html +++ b/03-rnaseq/dimension-reduction_rnaseq_02_umap.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2971,26 +3797,26 @@

    2.1 Obtain the .Rmd

    2.2 Set up your analysis folders

    Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

    If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

    -
    # Create the data folder if it doesn't exist
    -if (!dir.exists("data")) {
    -  dir.create("data")
    -}
    -
    -# Define the file path to the plots directory
    -plots_dir <- "plots" # Can replace with path to desired output plots directory
    -
    -# Create the plots folder if it doesn't exist
    -if (!dir.exists(plots_dir)) {
    -  dir.create(plots_dir)
    -}
    -
    -# Define the file path to the results directory
    -results_dir <- "results" # Can replace with path to desired output results directory
    -
    -# Create the results folder if it doesn't exist
    -if (!dir.exists(results_dir)) {
    -  dir.create(results_dir)
    -}
    +
    # Create the data folder if it doesn't exist
    +if (!dir.exists("data")) {
    +  dir.create("data")
    +}
    +
    +# Define the file path to the plots directory
    +plots_dir <- "plots"
    +
    +# Create the plots folder if it doesn't exist
    +if (!dir.exists(plots_dir)) {
    +  dir.create(plots_dir)
    +}
    +
    +# Define the file path to the results directory
    +results_dir <- "results"
    +
    +# Create the results folder if it doesn't exist
    +if (!dir.exists(results_dir)) {
    +  dir.create(results_dir)
    +}

    In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

    @@ -3000,7 +3826,7 @@

    2.3 Obtain the dataset from refin

    Click the “Download Now” button on the right side of this screen.

    Fill out the pop up window with your email and our Terms and Conditions:

    -

    +

    We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples”. Note that this option will only be available for RNA-seq datasets.

    It may take a few minutes for the dataset to process. You will get an email when it is ready.

    @@ -3008,13 +3834,12 @@

    2.3 Obtain the dataset from refin

    2.4 About the dataset we are using for this example

    For this example analysis, we will use this prostate cancer dataset.

    -

    The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens.

    2.5 Place the dataset in your new data/ folder

    -

    Refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

    -

    +

    refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

    +

    For more details on the contents of this folder see these docs on refine.bio.

    The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

    Copy and paste the SRP133573 folder into your newly created data/ folder.

    @@ -3042,20 +3867,25 @@

    2.6 Check out our file structure!

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

    First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

    -
    # Define the file path to the data directory
    -data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
    -
    -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
    -data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
    -
    -# Declare the file path to the metadata file using the data directory saved as `data_dir`
    -metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
    +
    # Define the file path to the data directory
    +# Replace with the path of the folder the files will be in
    +data_dir <- file.path("data", "SRP133573")
    +
    +# Declare the file path to the gene expression matrix file
    +# inside directory saved as `data_dir`
    +# Replace with the path to your dataset file
    +data_file <- file.path(data_dir, "SRP133573.tsv")
    +
    +# Declare the file path to the metadata file
    +# inside the directory saved as `data_dir`
    +# Replace with the path to your metadata file
    +metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv")

    Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

    -
    # Check if the gene expression matrix file is at the file path stored in `data_file`
    -file.exists(data_file)
    +
    # Check if the gene expression matrix file is at the path stored in `data_file`
    +file.exists(data_file)
    ## [1] TRUE
    -
    # Check if the metadata file is at the file path stored in `metadata_file`
    -file.exists(metadata_file)
    +
    # Check if the metadata file is at the file path stored in `metadata_file`
    +file.exists(metadata_file)
    ## [1] TRUE

    If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

    If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

    @@ -3073,90 +3903,40 @@

    4 UMAP Visualization - RNA-seq

    4.1 Install libraries

    See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

    -

    In this analysis, we will be using the R package DESeq2 (Love et al. 2014) for normalization, the R package umap (Konopka 2020) for the production of UMAP dimension reduction values and the R package , and the R package ggplot2 (Prabhakaran 2016) for plotting the UMAP values.

    -
    if (!("DESeq2" %in% installed.packages())) {
    -  # Install DESeq2
    -  BiocManager::install("DESeq2", update = FALSE)
    -}
    -
    -if (!("umap" %in% installed.packages())) {
    -  # Install umap package
    -  BiocManager::install("umap", update = FALSE)
    -}
    +

    In this analysis, we will be using the R package DESeq2 (Love et al. 2014) for normalization, the R package umap (Konopka 2020) for the production of UMAP dimension reduction values and the R package , and the R package ggplot2 (Prabhakaran 2016) for plotting the UMAP values.

    +
    if (!("DESeq2" %in% installed.packages())) {
    +  # Install DESeq2
    +  BiocManager::install("DESeq2", update = FALSE)
    +}
    +
    +if (!("umap" %in% installed.packages())) {
    +  # Install umap package
    +  BiocManager::install("umap", update = FALSE)
    +}

    Attach the packages we need for this analysis:

    -
    # Attach the `DESeq2` library
    -library(DESeq2)
    -
    ## Loading required package: S4Vectors
    -
    ## Loading required package: stats4
    -
    ## Loading required package: BiocGenerics
    -
    ## Loading required package: parallel
    -
    ## 
    -## Attaching package: 'BiocGenerics'
    -
    ## The following objects are masked from 'package:parallel':
    -## 
    -##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    -##     clusterExport, clusterMap, parApply, parCapply, parLapply,
    -##     parLapplyLB, parRapply, parSapply, parSapplyLB
    -
    ## The following objects are masked from 'package:stats':
    -## 
    -##     IQR, mad, sd, var, xtabs
    -
    ## The following objects are masked from 'package:base':
    -## 
    -##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    -##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    -##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    -##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    -##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    -##     union, unique, unsplit, which, which.max, which.min
    -
    ## 
    -## Attaching package: 'S4Vectors'
    -
    ## The following object is masked from 'package:base':
    -## 
    -##     expand.grid
    -
    ## Loading required package: IRanges
    -
    ## Loading required package: GenomicRanges
    -
    ## Loading required package: GenomeInfoDb
    -
    ## Loading required package: SummarizedExperiment
    -
    ## Loading required package: Biobase
    -
    ## Welcome to Bioconductor
    -## 
    -##     Vignettes contain introductory material; view with
    -##     'browseVignettes()'. To cite Bioconductor, see
    -##     'citation("Biobase")', and for packages 'citation("pkgname")'.
    -
    ## Loading required package: DelayedArray
    -
    ## Loading required package: matrixStats
    -
    ## 
    -## Attaching package: 'matrixStats'
    -
    ## The following objects are masked from 'package:Biobase':
    -## 
    -##     anyMissing, rowMedians
    -
    ## 
    -## Attaching package: 'DelayedArray'
    -
    ## The following objects are masked from 'package:matrixStats':
    -## 
    -##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
    -
    ## The following objects are masked from 'package:base':
    -## 
    -##     aperm, apply, rowsum
    -
    # Attach the `umap` library
    -library(umap)
    -
    -# Attach the `ggplot2` library for plotting
    -library(ggplot2)
    -
    -# We will need this so we can use the pipe: %>%
    -library(magrittr)
    -
    -# Set the seed so our results are reproducible:
    -set.seed(12345)
    +
    # Attach the `DESeq2` library
    +library(DESeq2)
    +
    +# Attach the `umap` library
    +library(umap)
    +
    +# Attach the `ggplot2` library for plotting
    +library(ggplot2)
    +
    +# We will need this so we can use the pipe: %>%
    +library(magrittr)
    +
    +# Set the seed so our results are reproducible:
    +set.seed(12345)

    4.2 Import and set up data

    Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

    We stored our file paths as objects named metadata_file and data_file in this previous step.

    -
    # Read in metadata TSV file
    -metadata <- readr::read_tsv(metadata_file)
    -
    ## Parsed with column specification:
    +
    # Read in metadata TSV file
    +metadata <- readr::read_tsv(metadata_file)
    +
    ## 
    +## ── Column specification ──────────────────────────────────────────────
     ## cols(
     ##   .default = col_character(),
     ##   refinebio_age = col_logical(),
    @@ -3169,142 +3949,157 @@ 

    4.2 Import and set up data

    ## refinebio_source_archive_url = col_logical(), ## refinebio_specimen_part = col_logical(), ## refinebio_time = col_logical() -## )
    -
    ## See spec(...) for full column specifications.
    -
    # Read in data TSV file
    -df <- readr::read_tsv(data_file) %>%
    -  # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
    -  tibble::column_to_rownames("Gene")
    -
    ## Parsed with column specification:
    +## )
    +## ℹ Use `spec()` for the full column specifications.
    +
    # Read in data TSV file
    +expression_df <- readr::read_tsv(data_file) %>%
    +  # Tuck away the gene ID  column as row names, leaving only numeric values
    +  tibble::column_to_rownames("Gene")
    +
    ## 
    +## ── Column specification ──────────────────────────────────────────────
     ## cols(
     ##   .default = col_double(),
     ##   Gene = col_character()
     ## )
    -## See spec(...) for full column specifications.
    +## ℹ Use `spec()` for the full column specifications.

    Let’s ensure that the metadata and data are in the same sample order.

    -
    # Make the data in the order of the metadata
    -df <- df %>%
    -  dplyr::select(metadata$refinebio_accession_code)
    -
    -# Check if this is in the same order
    -all.equal(colnames(df), metadata$refinebio_accession_code)
    +
    # Make the data in the order of the metadata
    +expression_df <- expression_df %>%
    +  dplyr::select(metadata$refinebio_accession_code)
    +
    +# Check if this is in the same order
    +all.equal(colnames(expression_df), metadata$refinebio_accession_code)
    ## [1] TRUE
    -

    Now we are going to use a combination of functions from the DESeq2, umap, and ggplot2 packages to perform and visualize the results of the Uniform Manifold Approximation and Projection (UMAP) dimension reduction technique on our pre-ADT and post-ADT samples.

    -
    -

    4.3 Prepare data for DESeq2

    -

    We need to make sure all of the values in our data are converted to integers as required by a DESeq2 function we will use later.

    -
    # The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers
    -df <- df %>%
    -  # Mutate numeric variables to be integers
    -  dplyr::mutate_if(is.numeric, round)
    -
    -

    4.4 Prepare metadata for DESEq2

    +

    4.3 Prepare metadata for DESEq2

    We need to make sure all of the metadata column variables, that we would like to use to annotate our plot, are converted into factors.

    -
    # We need to also format the variables from the metadata, that we will be using for annotation of the UMAP plot, into factors
    -metadata <- metadata %>%
    -  dplyr::select( # The metadata has many variables, we want to select only those that we will need for plotting later
    -    refinebio_accession_code,
    -    refinebio_treatment,
    -    refinebio_disease
    -  ) %>%
    -  dplyr::mutate( # Now let's convert the annotation variables into factors
    -    refinebio_treatment = as.factor(refinebio_treatment),
    -    refinebio_disease = as.factor(refinebio_disease)
    -  )
    +
    # convert the columns we will be using for annotation into factors
    +metadata <- metadata %>%
    +  dplyr::select( # select only the columns that we will need for plotting
    +    refinebio_accession_code,
    +    refinebio_treatment,
    +    refinebio_disease
    +  ) %>%
    +  dplyr::mutate( # Now let's convert the annotation variables into factors
    +    refinebio_treatment = factor(
    +      refinebio_treatment,
    +      # specify the possible levels in the order we want them to appear
    +      levels = c("pre-adt", "post-adt")
    +    ),
    +    refinebio_disease = as.factor(refinebio_disease)
    +  )
    +
    +
    +

    4.4 Define a minimum counts cutoff

    +

    We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis. We are going to do some pre-filtering to keep only genes with 10 or more reads total. Note that rows represent gene data and the columns represent sample data in our dataset.

    +
    # Define a minimum counts cutoff and filter the data to include
    +# only rows (genes) that have total counts above the cutoff
    +filtered_expression_df <- expression_df %>%
    +  dplyr::filter(rowSums(.) >= 10)
    +

    We also need our counts to be rounded before we can use them with the DESeqDataSetFromMatrix() function.

    +
    # The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers
    +filtered_expression_df <- round(filtered_expression_df)

    4.5 Create a DESeqDataset

    -

    We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object. ) and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

    -
    # Create a `DESeqDataSet` object
    -dds <- DESeqDataSetFromMatrix(
    -  countData = df, # This is the data frame with the counts values for all replicates in our dataset
    -  colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame
    -  design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis
    -)
    +

    We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object. ) and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

    +
    # Create a `DESeqDataSet` object
    +dds <- DESeqDataSetFromMatrix(
    +  countData = filtered_expression_df, # the counts values for all samples in our dataset
    +  colData = metadata, # annotation data for the samples in the counts data frame
    +  design = ~1 # Here we are not specifying a model
    +  # Replace with an appropriate design variable for your analysis
    +)
    ## converting counts to integer mode
    -
    -

    4.6 Define a minimum counts cutoff

    -

    We want to filter out the genes that have not been expressed or that have low expression counts because we want to remove any possible noise from our data before we normalize the data and create our plot. We are going to do some pre-filtering to keep only genes with 10 or more reads total. Note that rows represent gene data and the columns represent sample data in our dataset.

    -
    # Define a minimum counts cutoff and filter `DESeqDataSet` object to include
    -# only rows that have counts above the cutoff
    -genes_to_keep <- rowSums(counts(dds)) >= 10
    -dds <- dds[genes_to_keep, ]
    -
    -

    4.7 Perform DESeq2 normalization and transformation

    +

    4.6 Perform DESeq2 normalization and transformation

    We are going to use the vst() function from the DESeq2 package to normalize and transform the data. For more information about these transformation methods, see here.

    -
    # Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package
    -dds_norm <- vst(dds)
    +
    # Normalize and transform the data in the `DESeqDataSet` object
    +# using the `vst()` function from the `DESeq2` R package
    +dds_norm <- vst(dds)
    -

    4.8 Perform UMAP

    -

    Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique proposed by McInnes et al. (2018) (See associated paper). While PCA assumes that the variation we care about has a particular distribution (normal, broadly speaking), UMAP allows more complicated distributions that it learns from the data. The advantage of this feature is that UMAP can do a better job separating clusters, especially when some of those clusters may be more similar to each other than others (CCDL 2020).

    -

    In this code chunk, we are going to extract the normalized counts data from the DESeqDataSet object and perform UMAP on the normalized data using umap() from the umap package. We are using the default parameters when we run the umap::umap() function. Here’s some guidance about choosing parameters when executing umap::umap() (R CRAN Team 2019). You can also run the following in the RStudio console to get more information on the function and its default parameters: ?umap::umap or ?umap::umap.defaults.

    -
    # First we are going to retrieve the normalized data from the `DESeqDataSet` object using the `assay()` function
    -normalized_counts <- assay(dds_norm) %>%
    -  t() # We need to transpose this data in preparation for the `umap()` function
    -
    -# Now let's tell R to perform UMAP on the normalized data
    -umap_results <- umap::umap(normalized_counts)
    +

    4.7 Perform UMAP

    +

    Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique proposed by McInnes et al. (2018) (See associated paper). While PCA assumes that the variation we care about has a particular distribution (normal, broadly speaking), UMAP allows more complicated distributions that it learns from the data. The advantage of this feature is that UMAP can do a better job separating clusters, especially when some of those clusters may be more similar to each other than others (Childhood Cancer Data Lab 2020).

    +

    In this code chunk, we are going to extract the normalized counts data from the DESeqDataSet object and perform UMAP on the normalized data using umap() from the umap package. We are using the default parameters when we run the umap::umap() function. Here’s some guidance about choosing parameters when executing umap::umap() (R CRAN Team 2019). You can also run the following in the RStudio console to get more information on the function and its default parameters: ?umap::umap or ?umap::umap.defaults.

    +
    # First we are going to retrieve the normalized data
    +# from the `DESeqDataSet` object using the `assay()` function
    +normalized_counts <- assay(dds_norm) %>%
    +  t() # We need to transpose this data so each row is a sample
    +
    +# Now perform UMAP on the normalized data
    +umap_results <- umap::umap(normalized_counts)
    -

    4.9 Prepare data frame for plotting

    +

    4.8 Prepare data frame for plotting

    Now that we have the results from UMAP, we need to extract the counts data from the umap_results object and merge the variables from the metadata that we will use for annotating our plot.

    -
    # We need to tell R to create a data frame with the umap values and annotation data in preparation for plotting with `ggplot2`
    -umap_plot_df <- data.frame(umap_results$layout) %>%
    -  tibble::rownames_to_column("refinebio_accession_code") %>% # Let's get the rownames into a column so we can merge the annotation data
    -  dplyr::inner_join(metadata, by = "refinebio_accession_code") # Let's merge the annotation data using the `refinebio_accession_code` column
    +
    # Make into data frame for plotting with `ggplot2`
    +# The UMAP values we need for plotting are stored in the `layout` element
    +umap_plot_df <- data.frame(umap_results$layout) %>%
    +  # Turn sample IDs stored as row names into a column
    +  tibble::rownames_to_column("refinebio_accession_code") %>%
    +  # Add the metadata into this data frame; match by sample IDs
    +  dplyr::inner_join(metadata, by = "refinebio_accession_code")
    +

    Let’s take a look at the data frame we created in the chunk above.

    +
    umap_plot_df
    +
    + +
    +

    Here we can see that UMAP took the data from thousands of genes, and reduced it to just two variables, X1 and X2.

    -

    4.10 Create UMAP plot

    +

    4.9 Create UMAP plot

    Now we can use the ggplot() function to plot our normalized UMAP scores.

    -
    # Plot using `ggplot()` function
    -ggplot(
    -  umap_plot_df,
    -  aes(
    -    x = X1,
    -    y = X2
    -  )
    -) +
    -  geom_point() # This tells R that we want a scatterplot
    -

    +
    # Plot using `ggplot()` function
    +ggplot(
    +  umap_plot_df,
    +  aes(
    +    x = X1,
    +    y = X2
    +  )
    +) +
    +  geom_point() # Plot individual points to make a scatterplot
    +

    Let’s try adding a variable to our plot for annotation.

    In this code chunk, the variable refinebio_treatment is given to the ggplot() function so we can label by androgen deprivation therapy (ADT) status.

    -
    # Plot using `ggplot()` function
    -ggplot(
    -  umap_plot_df,
    -  aes(
    -    x = X1,
    -    y = X2,
    -    color = refinebio_treatment # This will label points with different colors for each `refinebio_treatment` group
    -  )
    -) +
    -  geom_point() # This tells R that we want a scatterplot
    -

    +
    # Plot using `ggplot()` function
    +ggplot(
    +  umap_plot_df,
    +  aes(
    +    x = X1,
    +    y = X2,
    +    color = refinebio_treatment # label points with different colors for each `subgroup`
    +  )
    +) +
    +  geom_point() # This tells R that we want a scatterplot
    +

    In the next code chunk, we are going to add another variable to our plot for annotation.

    -

    We’ll plot using both refinebio_treatment and refinebio_disease variables for labels since they are central to the androgen deprivation therapy (ADT) based hypothesis in the original paper (Sharma et al. 2018).

    -
    # Plot using `ggplot()` function
    -final_annotated_umap_plot <- ggplot(
    -  umap_plot_df,
    -  aes(
    -    x = X1,
    -    y = X2,
    -    color = refinebio_treatment, # This will label points with different colors for each `refinebio_treatment` group
    -    shape = refinebio_disease # This will label points with different shapes for each `refinebio_disease` group
    -  )
    -) +
    -  geom_point() # This tells R that we want a scatterplot
    -
    -# Display the plot that we saved above
    -final_annotated_umap_plot
    -

    +

    We’ll plot using both refinebio_treatment and refinebio_disease variables for labels since they are central to the androgen deprivation therapy (ADT) based hypothesis in the original paper (Sharma et al. 2018).

    +
    # Plot using `ggplot()` function and save to an object
    +final_annotated_umap_plot <- ggplot(
    +  umap_plot_df,
    +  aes(
    +    x = X1,
    +    y = X2,
    +    # plot points with different colors for each `refinebio_treatment` group
    +    color = refinebio_treatment,
    +    # plot points with different shapes for each `refinebio_disease` group
    +    shape = refinebio_disease
    +  )
    +) +
    +  geom_point() # make a scatterplot
    +
    +# Display the plot that we saved above
    +final_annotated_umap_plot
    +

    Although it does appear that majority of the pre-ADT and post-ADT appear to cluster together, there are still questions remaining as we look at outliers.

    -

    4.10.1 Interpretation of UMAP plot and results

    +

    4.9.1 Interpretation of UMAP plot and results

    1. Note that the coordinates of UMAP output for any given cell can change dramatically depending on parameters, and even run to run with the same parameters (Also why setting the seed is important). This means that you should not rely too heavily on the exact values of UMAP’s output.
    @@ -3314,34 +4109,38 @@

    4.10.1 Interpretation of UMAP plo
    1. Playing with the parameters so you can fine-tune them is a good way to give you more information about a particular analysis as well as the data itself. Feel free to try playing with the parameters on your own in the code chunks above!
    -

    In summary, a good rule of thumb to remember is: if the results of an analysis can be completely changed by changing its parameters, you should be very cautious when it comes to the conclusions you draw from it as well as having good rationale for the parameters you choose (adapted from CCDL (2020) training materials).

    +

    In summary, a good rule of thumb to remember is: if the results of an analysis can be completely changed by changing its parameters, you should be very cautious when it comes to the conclusions you draw from it as well as having good rationale for the parameters you choose (adapted from Childhood Cancer Data Lab (2020) training materials).

    -

    4.11 Save annotated UMAP plot as a PNG

    +

    4.10 Save annotated UMAP plot as a PNG

    You can easily switch this to save to a JPEG or TIFF by changing the file name within the ggsave() function to the respective file suffix.

    -
    # Save plot using `ggsave()` function
    -ggsave(file.path(plots_dir, "SRP133573_umap_plot.png"), # Replace with name relevant your plotted data
    -  plot = final_annotated_umap_plot
    -)
    +
    # Save plot using `ggsave()` function
    +ggsave(
    +  file.path(
    +    plots_dir,
    +    "SRP133573_umap_plot.png" # Replace with a good file name for your plot
    +  ),
    +  plot = final_annotated_umap_plot
    +)
    ## Saving 7 x 5 in image

    diff --git a/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.Rmd b/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.Rmd index f550a4b0..60211ecb 100644 --- a/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.Rmd +++ b/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.Rmd @@ -11,14 +11,14 @@ output: # Purpose of this analysis -The purpose of this notebook is to provide an example of mapping gene IDs for RNA-seq data obtained from refine.bio using `AnnotationDbi` packages [@Carlson2020-package]. +The purpose of this notebook is to provide an example of mapping gene IDs for RNA-seq data obtained from refine.bio using `AnnotationDbi` packages [@Pages2020-package]. ⬇️ [**Jump to the analysis code**](#analysis) ⬇️ # How to run this example For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). -We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. ## Obtain the `.Rmd` file @@ -44,7 +44,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -52,7 +52,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -70,7 +70,7 @@ Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP Click the "Download Now" button on the right side of this screen. - + Fill out the pop up window with your email and our Terms and Conditions: @@ -128,19 +128,24 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "SRP040561") # Replace with accession number which will be the name of the folder the files will be in - -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "SRP040561.tsv") # Replace with file path to your dataset - -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_SRP040561.tsv") # Replace with file path to your metadata +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "SRP040561") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "SRP040561.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_SRP040561.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` @@ -149,7 +154,7 @@ file.exists(metadata_file) If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. -If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). # Using a different refine.bio dataset with this analysis? @@ -157,12 +162,12 @@ If you'd like to adapt an example analysis to use a different dataset from [refi We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences. -refine.bio data comes with gene level data with Ensembl IDs. -Although this example script uses Ensembl IDs from Zebrafish, (Danio rerio), to obtain Entrez IDs this script can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers. +refine.bio data comes with gene level data identified by Ensembl IDs. +Although this example notebook uses Ensembl IDs from Zebrafish, (Danio rerio), to obtain Entrez IDs this script can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers. For different species, wherever the abbreviation `org.Dr.eg.db` or `Dr` is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens `org.Hs.eg.db` or `Hs` would be used. In the case of our [microarray gene identifier annotation example notebook](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html), a Mouse (Mus musculus) dataset is used, meaning `org.Mm.eg.db` or `Mm` would also need to be used there. -A full list of the annotation R packages from Bioconductor is at this [link](https://bioconductor.org/packages/release/BiocViews.html#___AnnotationData) [@annotation-packages]. +A full list of the annotation R packages from Bioconductor is at this [link](https://bioconductor.org/packages/release/BiocViews.html#___AnnotationData). *** @@ -170,14 +175,18 @@ A full list of the annotation R packages from Bioconductor is at this [link](htt # Obtaining Annotation for Ensembl IDs - RNA-seq -Ensembl IDs can be used to obtain various different annotations at the gene/transcript level. +refine.bio uses Ensembl IDs as the primary gene identifier in its data sets. +While this is a consistent and useful identifier, a string of apparently random letters and numbers is not the most user-friendly or informative for interpretation. +Luckily, we can use the Ensembl IDs that we have to obtain various different annotations at the gene/transcript level. Let's get ready to use the Ensembl IDs from our zebrafish dataset to obtain the associated Entrez IDs. ## Install libraries See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. -In this analysis, we will be using the `org.Dr.eg.db` R package [@Carlson2019-zebrafish]. +In this analysis, we will be using the `org.Dr.eg.db` R package [@Carlson2019-zebrafish], which is part of the Bioconductor `AnnotationDbi` framework [@Pages2020-package]. +Bioconductor compiles annotations from various sources, and these packages provide convenient methods to access and translate among those annotations. +[Other species can be used](#using-a-different-refinebio-dataset-with-this-analysis). ```{r} # Install the Zebrafish package @@ -188,8 +197,9 @@ if (!("org.Dr.eg.db" %in% installed.packages())) { ``` Attach the packages we need for this analysis. +Note that attaching `org.Mm.eg.db` will automatically also attach `AnnotationDbi`. -```{r} +```{r message=FALSE} # Attach the library library(org.Dr.eg.db) @@ -209,8 +219,8 @@ We stored our file paths as objects named `metadata_file` and `data_file` in [th metadata <- readr::read_tsv(metadata_file) # Read in data TSV file -df <- readr::read_tsv(data_file) %>% - # Tuck away the Gene ID column as rownames +expression_df <- readr::read_tsv(data_file) %>% + # Tuck away the Gene ID column as row names tibble::column_to_rownames("Gene") ``` @@ -218,14 +228,14 @@ Let's ensure that the metadata and data are in the same sample order. ```{r} # Make the data in the order of the metadata -df <- df %>% +expression_df <- expression_df %>% dplyr::select(metadata$refinebio_accession_code) # Check if this is in the same order -all.equal(colnames(df), metadata$refinebio_accession_code) +all.equal(colnames(expression_df), metadata$refinebio_accession_code) # Bring back the "Gene" column in preparation for mapping -df <- df %>% +expression_df <- expression_df %>% tibble::rownames_to_column("Gene") ``` @@ -233,7 +243,7 @@ df <- df %>% The `mapIds()` function has a `multiVals` argument which denotes what to do when there are multiple mapped values for a single gene identifier. The default behavior is to return just the first mapped value. -It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. +It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. Use `?mapIds` to see more options or strategies. In the next chunk, we will run the `mapIds()` function and supply the `multiVals` argument with the `"list"` option in order to get a large list with all the mapped values found for each gene identifier. @@ -241,10 +251,10 @@ In the next chunk, we will run the `mapIds()` function and supply the `multiVals ```{r} # Map Ensembl IDs to their associated Entrez IDs mapped_list <- mapIds( - org.Dr.eg.db, # Replace with annotation package for the organism relevant to your data - keys = df$Gene, - column = "ENTREZID", # Replace with the type of gene identifiers you would like to map to + org.Dr.eg.db, # Replace with annotation package for your organism + keys = expression_df$Gene, keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data + column = "ENTREZID", # The type of gene identifiers you would like to map to multiVals = "list" ) ``` @@ -254,7 +264,7 @@ mapped_list <- mapIds( Now, let's take a look at our mapped object to see how the mapping went. ```{r} -# Let's use the `head()` function to take a preview at our mapped list +# Let's use the `head()` function for a preview of our mapped list head(mapped_list) ``` @@ -263,10 +273,10 @@ However, the data is now in a `list` object, making it a little more difficult t We are going to turn our list object into a data frame object in the next chunk. ```{r} -# Let's make our object a bit more manageable for exploration by turning it into a data frame +# Let's make our list a bit more manageable by turning it into a data frame mapped_df <- mapped_list %>% tibble::enframe(name = "Ensembl", value = "Entrez") %>% - # enframe makes a `list` column, so we will convert that to simpler format with `unnest() + # enframe() makes a `list` column; we will simplify it with unnest() # This will result in one row of our data frame per list item tidyr::unnest(cols = Entrez) ``` @@ -281,24 +291,26 @@ We can see that our data frame has a new column `Entrez`. Let's get a summary of the values returned in the `Entrez` column of our mapped data frame. ```{r} -# We can use the `summary()` function to get a better idea of the distribution of values in the `Entrez` column -summary(as.factor(mapped_df$Entrez)) # We need to use the `as.factor()` function here in order to get the count of each unique value returned +# Use the `summary()` function to show the distribution of Entrez values +# We need to use `as.factor()` here to get the count of unique values +# `maxsum = 10` limits the summary to 10 distinct values +summary(as.factor(mapped_df$Entrez), maxsum = 10) ``` -There are 9102 NAs in our data frame, which means that 9102 out of the 31882 Ensembl IDs did not map to Entrez IDs. -This means if you are depending on `Entrez` IDs for your downstream analyses, you may not be able to use the 9102 unmapped genes. +There are 9026 `NA`s in our data frame, which means that 9026 out of the 31882 Ensembl IDs did not map to Entrez IDs. +This means if you are depending on `Entrez` IDs for your downstream analyses, you may not be able to use the 9026 unmapped genes. Now let's check to see how many genes we have that were mapped to multiple IDs. ```{r} -multi_mapped_df <- mapped_df %>% +multi_mapped <- mapped_df %>% # Let's count the number of times each Ensembl ID appears in `Ensembl` column dplyr::count(Ensembl, name = "entrez_id_count") %>% # Arrange by the genes with the highest number of Entrez IDs mapped dplyr::arrange(desc(entrez_id_count)) # Let's look at the first 6 rows of our `multi_mapped` object -head(multi_mapped_df) +head(multi_mapped) ``` Looks like we have one case where one Ensembl ID mapped to 15 Entrez IDs! @@ -315,7 +327,7 @@ In the next chunk, we show how we can collapse all the Entrez IDs into one colum collapsed_mapped_df <- mapped_df %>% # Group by Ensembl IDs dplyr::group_by(Ensembl) %>% - # Collapse the mapped values in `mapped_df` into one column named `all_entrez_ids` + # Collapse the Entrez IDs `mapped_df` into one column named `all_entrez_ids` dplyr::summarize(all_entrez_ids = paste(Entrez, collapse = ";")) ``` @@ -323,7 +335,9 @@ Let's take a look at our new collapsed `all_entrez_ids` column in the `collapsed ```{r} collapsed_mapped_df %>% - # Filter `collapsed_mapped_df` to include only the rows where `all_entrez_ids` values include the ";" character -- we know that these are the rows with multiple mapped values + # Filter `collapsed_mapped_df` to include only the rows where + # `all_entrez_ids` values include the ";" character -- + # these are the rows with multiple mapped values dplyr::filter(stringr::str_detect(all_entrez_ids, ";")) %>% # We only need a preview here head() @@ -333,7 +347,7 @@ You may have a list of Entrez IDs you are interested in, in which case, the abov In a different study, you may want the oldest Entrez ID (which is probably first), in which case, you can create a column that stores just the first mapped Entrez ID that comes up for each Ensembl ID. We will do this in the next section by re-running the `mapIds()` function with `multiVals = "first"`. -### Map Ensembl IDs to gene symbols -- keeping only the first mapped value +### Map Ensembl IDs to Entrez -- keeping only the first mapped value If we don't have a particular preference of which Entrez ID is returned, we can have `mapIds()` use its default of returning the first Entrez ID listed. @@ -342,26 +356,28 @@ Let's re-run `mapIds()`, this time using `multiVals = "first"`. ```{r} final_mapped_df <- data.frame( "first_mapped_entrez_id" = mapIds( - org.Dr.eg.db, # Replace with annotation package for the organism relevant to your data - keys = df$Gene, - column = "ENTREZID", # Replace with the type of gene identifiers you would like to map to - keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data - multiVals = "first" # This will keep only the first mapped value for each Ensembl ID + org.Dr.eg.db, # Replace with annotation package for your organism + keys = expression_df$Gene, + keytype = "ENSEMBL", # Replace with the gene identifiers used in your data + column = "ENTREZID", # The type of gene identifiers you would like to map to + multiVals = "first" # Keep only the first mapped value for each Ensembl ID ) ) %>% # Make an `Ensembl` column to store the rownames tibble::rownames_to_column("Ensembl") %>% - # Join the multiple mappings data from `collapsed_mapped_df` using the Ensembl IDs + # Add the multiple mappings data from `collapsed_mapped_df` using Ensembl IDs dplyr::inner_join(collapsed_mapped_df, by = "Ensembl") %>% - # Now let's join the rest of the expression data - dplyr::inner_join(df, by = c("Ensembl" = "Gene")) + # Now let's add on the rest of the expression data + dplyr::inner_join(expression_df, by = c("Ensembl" = "Gene")) ``` Our `final_mapped_df` object now has a column named `first_mapped_entrez_id` that contains only the first mapped Entrez ID, in addition to the `all_entrez_ids` column that contains all mapped Entrez IDs per Ensembl ID. -```{r} +Let's look at the multi-mapped data again + +```{r rownames.print = FALSE} final_mapped_df %>% - # Filter `final_mapped_df` to preview the rows where `all_entrez_ids` values include the ";" character -- we know that these are the rows with multiple mapped values + # Filter `final_mapped_df` to rows with multiple mapped values dplyr::filter(stringr::str_detect(all_entrez_ids, ";")) %>% head() ``` @@ -381,7 +397,7 @@ readr::write_tsv(final_mapped_df, file.path( # Resources for further learning - Marc Carlson has prepared a nice [Introduction to Bioconductor Annotation Packages](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf) [@Carlson2020-vignette] -- See our [microarray gene ID conversion notebook](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html) as another applicable example, since the steps for this workflow do not change with technology [@gene-id-annotation-microarray]. +- See our [microarray gene ID conversion notebook](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html) as another applicable example, since the steps for this workflow do not change with technology. # Session info diff --git a/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html b/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html index 211918a8..1e9648d6 100644 --- a/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html +++ b/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2955,7 +3781,7 @@

    October 2020

    1 Purpose of this analysis

    -

    The purpose of this notebook is to provide an example of mapping gene IDs for RNA-seq data obtained from refine.bio using AnnotationDbi packages (Carlson 2020a).

    +

    The purpose of this notebook is to provide an example of mapping gene IDs for RNA-seq data obtained from refine.bio using AnnotationDbi packages (Pagès et al. 2020).

    ⬇️ Jump to the analysis code ⬇️

    @@ -2971,26 +3797,26 @@

    2.1 Obtain the .Rmd

    2.2 Set up your analysis folders

    Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

    If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

    -
    # Create the data folder if it doesn't exist
    -if (!dir.exists("data")) {
    -  dir.create("data")
    -}
    -
    -# Define the file path to the plots directory
    -plots_dir <- "plots" # Can replace with path to desired output plots directory
    -
    -# Create the plots folder if it doesn't exist
    -if (!dir.exists(plots_dir)) {
    -  dir.create(plots_dir)
    -}
    -
    -# Define the file path to the results directory
    -results_dir <- "results" # Can replace with path to desired output results directory
    -
    -# Create the results folder if it doesn't exist
    -if (!dir.exists(results_dir)) {
    -  dir.create(results_dir)
    -}
    +
    # Create the data folder if it doesn't exist
    +if (!dir.exists("data")) {
    +  dir.create("data")
    +}
    +
    +# Define the file path to the plots directory
    +plots_dir <- "plots"
    +
    +# Create the plots folder if it doesn't exist
    +if (!dir.exists(plots_dir)) {
    +  dir.create(plots_dir)
    +}
    +
    +# Define the file path to the results directory
    +results_dir <- "results"
    +
    +# Create the results folder if it doesn't exist
    +if (!dir.exists(results_dir)) {
    +  dir.create(results_dir)
    +}

    In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

    @@ -3041,20 +3867,25 @@

    2.6 Check out our file structure!

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

    First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

    -
    # Define the file path to the data directory
    -data_dir <- file.path("data", "SRP040561") # Replace with accession number which will be the name of the folder the files will be in
    -
    -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
    -data_file <- file.path(data_dir, "SRP040561.tsv") # Replace with file path to your dataset
    -
    -# Declare the file path to the metadata file using the data directory saved as `data_dir`
    -metadata_file <- file.path(data_dir, "metadata_SRP040561.tsv") # Replace with file path to your metadata
    +
    # Define the file path to the data directory
    +# Replace with the path of the folder the files will be in
    +data_dir <- file.path("data", "SRP040561")
    +
    +# Declare the file path to the gene expression matrix file
    +# inside directory saved as `data_dir`
    +# Replace with the path to your dataset file
    +data_file <- file.path(data_dir, "SRP040561.tsv")
    +
    +# Declare the file path to the metadata file
    +# inside the directory saved as `data_dir`
    +# Replace with the path to your metadata file
    +metadata_file <- file.path(data_dir, "metadata_SRP040561.tsv")

    Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

    -
    # Check if the gene expression matrix file is at the file path stored in `data_file`
    -file.exists(data_file)
    +
    # Check if the gene expression matrix file is at the path stored in `data_file`
    +file.exists(data_file)
    ## [1] TRUE
    -
    # Check if the metadata file is at the file path stored in `metadata_file`
    -file.exists(metadata_file)
    +
    # Check if the metadata file is at the file path stored in `metadata_file`
    +file.exists(metadata_file)
    ## [1] TRUE

    If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

    If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

    @@ -3063,73 +3894,39 @@

    2.6 Check out our file structure!

    3 Using a different refine.bio dataset with this analysis?

    If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

    -

    refine.bio data comes with gene level data with Ensembl IDs. Although this example script uses Ensembl IDs from Zebrafish, (Danio rerio), to obtain Entrez IDs this script can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers.

    -

    For different species, wherever the abbreviation org.Dr.eg.db or Dr is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens org.Hs.eg.db or Hs would be used. In the case of our microarray gene identifier annotation example notebook, a Mouse (Mus musculus) dataset is used, meaning org.Mm.eg.db or Mm would also need to be used there. A full list of the annotation R packages from Bioconductor is at this link (R Bioconductor Team 2003).

    +

    refine.bio data comes with gene level data identified by Ensembl IDs. Although this example notebook uses Ensembl IDs from Zebrafish, (Danio rerio), to obtain Entrez IDs this script can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers.

    +

    For different species, wherever the abbreviation org.Dr.eg.db or Dr is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens org.Hs.eg.db or Hs would be used. In the case of our microarray gene identifier annotation example notebook, a Mouse (Mus musculus) dataset is used, meaning org.Mm.eg.db or Mm would also need to be used there. A full list of the annotation R packages from Bioconductor is at this link.


     

    4 Obtaining Annotation for Ensembl IDs - RNA-seq

    -

    Ensembl IDs can be used to obtain various different annotations at the gene/transcript level. Let’s get ready to use the Ensembl IDs from our zebrafish dataset to obtain the associated Entrez IDs.

    +

    refine.bio uses Ensembl IDs as the primary gene identifier in its data sets. While this is a consistent and useful identifier, a string of apparently random letters and numbers is not the most user-friendly or informative for interpretation. Luckily, we can use the Ensembl IDs that we have to obtain various different annotations at the gene/transcript level. Let’s get ready to use the Ensembl IDs from our zebrafish dataset to obtain the associated Entrez IDs.

    4.1 Install libraries

    See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

    -

    In this analysis, we will be using the org.Dr.eg.db R package (Carlson 2019).

    -
    # Install the Zebrafish package
    -if (!("org.Dr.eg.db" %in% installed.packages())) {
    -  # Install this package if it isn't installed yet
    -  BiocManager::install("org.Dr.eg.db", update = FALSE)
    -}
    -

    Attach the packages we need for this analysis.

    -
    # Attach the library
    -library(org.Dr.eg.db)
    -
    ## Loading required package: AnnotationDbi
    -
    ## Loading required package: stats4
    -
    ## Loading required package: BiocGenerics
    -
    ## Loading required package: parallel
    -
    ## 
    -## Attaching package: 'BiocGenerics'
    -
    ## The following objects are masked from 'package:parallel':
    -## 
    -##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    -##     clusterExport, clusterMap, parApply, parCapply, parLapply,
    -##     parLapplyLB, parRapply, parSapply, parSapplyLB
    -
    ## The following objects are masked from 'package:stats':
    -## 
    -##     IQR, mad, sd, var, xtabs
    -
    ## The following objects are masked from 'package:base':
    -## 
    -##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    -##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    -##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    -##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    -##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    -##     union, unique, unsplit, which, which.max, which.min
    -
    ## Loading required package: Biobase
    -
    ## Welcome to Bioconductor
    -## 
    -##     Vignettes contain introductory material; view with
    -##     'browseVignettes()'. To cite Bioconductor, see
    -##     'citation("Biobase")', and for packages 'citation("pkgname")'.
    -
    ## Loading required package: IRanges
    -
    ## Loading required package: S4Vectors
    -
    ## 
    -## Attaching package: 'S4Vectors'
    -
    ## The following object is masked from 'package:base':
    -## 
    -##     expand.grid
    -
    ## 
    -
    # We will need this so we can use the pipe: %>%
    -library(magrittr)
    +

    In this analysis, we will be using the org.Dr.eg.db R package (Carlson 2019), which is part of the Bioconductor AnnotationDbi framework (Pagès et al. 2020). Bioconductor compiles annotations from various sources, and these packages provide convenient methods to access and translate among those annotations. Other species can be used.

    +
    # Install the Zebrafish package
    +if (!("org.Dr.eg.db" %in% installed.packages())) {
    +  # Install this package if it isn't installed yet
    +  BiocManager::install("org.Dr.eg.db", update = FALSE)
    +}
    +

    Attach the packages we need for this analysis. Note that attaching org.Mm.eg.db will automatically also attach AnnotationDbi.

    +
    # Attach the library
    +library(org.Dr.eg.db)
    +
    +# We will need this so we can use the pipe: %>%
    +library(magrittr)

    4.2 Import and set up data

    Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

    We stored our file paths as objects named metadata_file and data_file in this previous step.

    -
    # Read in metadata TSV file
    -metadata <- readr::read_tsv(metadata_file)
    -
    ## Parsed with column specification:
    +
    # Read in metadata TSV file
    +metadata <- readr::read_tsv(metadata_file)
    +
    ## 
    +## ── Column specification ──────────────────────────────────────────────
     ## cols(
     ##   .default = col_logical(),
     ##   refinebio_accession_code = col_character(),
    @@ -3138,49 +3935,50 @@ 

    4.2 Import and set up data

    ## refinebio_platform = col_character(), ## refinebio_source_database = col_character(), ## refinebio_title = col_character() -## )
    -
    ## See spec(...) for full column specifications.
    -
    # Read in data TSV file
    -df <- readr::read_tsv(data_file) %>%
    -  # Tuck away the Gene ID column as rownames
    -  tibble::column_to_rownames("Gene")
    -
    ## Parsed with column specification:
    +## )
    +## ℹ Use `spec()` for the full column specifications.
    +
    # Read in data TSV file
    +expression_df <- readr::read_tsv(data_file) %>%
    +  # Tuck away the Gene ID column as row names
    +  tibble::column_to_rownames("Gene")
    +
    ## 
    +## ── Column specification ──────────────────────────────────────────────
     ## cols(
     ##   .default = col_double(),
     ##   Gene = col_character()
     ## )
    -## See spec(...) for full column specifications.
    +## ℹ Use `spec()` for the full column specifications.

    Let’s ensure that the metadata and data are in the same sample order.

    -
    # Make the data in the order of the metadata
    -df <- df %>%
    -  dplyr::select(metadata$refinebio_accession_code)
    -
    -# Check if this is in the same order
    -all.equal(colnames(df), metadata$refinebio_accession_code)
    +
    # Make the data in the order of the metadata
    +expression_df <- expression_df %>%
    +  dplyr::select(metadata$refinebio_accession_code)
    +
    +# Check if this is in the same order
    +all.equal(colnames(expression_df), metadata$refinebio_accession_code)
    ## [1] TRUE
    -
    # Bring back the "Gene" column in preparation for mapping
    -df <- df %>%
    -  tibble::rownames_to_column("Gene")
    +
    # Bring back the "Gene" column in preparation for mapping
    +expression_df <- expression_df %>%
    +  tibble::rownames_to_column("Gene")

    4.3 Map Ensembl IDs to Entrez IDs

    The mapIds() function has a multiVals argument which denotes what to do when there are multiple mapped values for a single gene identifier. The default behavior is to return just the first mapped value. It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. Use ?mapIds to see more options or strategies.

    In the next chunk, we will run the mapIds() function and supply the multiVals argument with the "list" option in order to get a large list with all the mapped values found for each gene identifier.

    -
    # Map Ensembl IDs to their associated Entrez IDs
    -mapped_list <- mapIds(
    -  org.Dr.eg.db, # Replace with annotation package for the organism relevant to your data
    -  keys = df$Gene,
    -  column = "ENTREZID", # Replace with the type of gene identifiers you would like to map to
    -  keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
    -  multiVals = "list"
    -)
    +
    # Map Ensembl IDs to their associated Entrez IDs
    +mapped_list <- mapIds(
    +  org.Dr.eg.db, # Replace with annotation package for your organism
    +  keys = expression_df$Gene,
    +  keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
    +  column = "ENTREZID", # The type of gene identifiers you would like to map to
    +  multiVals = "list"
    +)
    ## 'select()' returned 1:many mapping between keys and columns

    4.4 Explore gene ID conversion

    Now, let’s take a look at our mapped object to see how the mapping went.

    -
    # Let's use the `head()` function to take a preview at our mapped list
    -head(mapped_list)
    +
    # Let's use the `head()` function for a preview of our mapped list
    +head(mapped_list)
    ## $ENSDARG00000000001
     ## [1] "368418"
     ## 
    @@ -3199,58 +3997,38 @@ 

    4.4 Explore gene ID conversion

    It looks like we have Entrez IDs that were successfully mapped to the Ensembl IDs we provided. However, the data is now in a list object, making it a little more difficult to explore. We are going to turn our list object into a data frame object in the next chunk.

    -
    # Let's make our object a bit more manageable for exploration by turning it into a data frame
    -mapped_df <- mapped_list %>%
    -  tibble::enframe(name = "Ensembl", value = "Entrez") %>%
    -  # enframe makes a `list` column, so we will convert that to simpler format with `unnest()
    -  # This will result in one row of our data frame per list item
    -  tidyr::unnest(cols = Entrez)
    +
    # Let's make our list a bit more manageable by turning it into a data frame
    +mapped_df <- mapped_list %>%
    +  tibble::enframe(name = "Ensembl", value = "Entrez") %>%
    +  # enframe() makes a `list` column; we will simplify it with unnest()
    +  # This will result in one row of our data frame per list item
    +  tidyr::unnest(cols = Entrez)

    Now let’s take a peek at our data frame.

    -
    head(mapped_df)
    +
    head(mapped_df)

    We can see that our data frame has a new column Entrez. Let’s get a summary of the values returned in the Entrez column of our mapped data frame.

    -
    # We can use the `summary()` function to get a better idea of the distribution of values in the `Entrez` column
    -summary(as.factor(mapped_df$Entrez)) # We need to use the `as.factor()` function here in order to get the count of each unique value returned
    -
    ## 100126027 100150038 100331412    794549    554097    794406 100536331 100148591 
    -##        28        28        28        28        11         7         6         5 
    -## 100329290 100334559 100333446    393541    562179    795394 100000956 100002012 
    -##         5         5         3         3         3         3         2         2 
    -## 100002312 100002647 100002917 100004335 100005482 100006972 100007523 100007836 
    -##         2         2         2         2         2         2         2         2 
    -## 100034531 100037365 100124612 100126123 100144555 100150193 100150195 100151764 
    -##         2         2         2         2         2         2         2         2 
    -## 100170660 100170822 100191016 100318255 100330526 100330579 100334190 100334824 
    -##         2         2         2         2         2         2         2         2 
    -## 100422734 100535828 100536566 100537891 101883724 101885111 101885541 101887034 
    -##         2         2         2         2         2         2         2         2 
    -## 101887190 103908643 103908654 103909414 108182725 108182861 108190101 108190662 
    -##         2         2         2         2         2         2         2         2 
    -## 108191494 108191539 110437858 110437953 110438456 110438861 110438889 110438898 
    -##         2         2         2         2         2         2         2         2 
    -## 110438962 110438964 110439267 110439798 110440092     30459    323832    323951 
    -##         2         2         2         2         2         2         2         2 
    -##    327020    336702    368925    378480    393135    393320    393518    405787 
    -##         2         2         2         2         2         2         2         2 
    -##    415097    445235    449796    541321    550342    550507    553461    553501 
    -##         2         2         2         2         2         2         2         2 
    -##    559771    561678    563398    563587    564790    566030    566190    568680 
    -##         2         2         2         2         2         2         2         2 
    -##    569452    570815   (Other)      NA's 
    -##         2         2     22821      9102
    -

    There are 9102 NAs in our data frame, which means that 9102 out of the 31882 Ensembl IDs did not map to Entrez IDs. This means if you are depending on Entrez IDs for your downstream analyses, you may not be able to use the 9102 unmapped genes.

    +
    # Use the `summary()` function to show the distribution of Entrez values
    +# We need to use `as.factor()` here to get the count of unique values
    +# `maxsum = 10` limits the summary to 10 distinct values
    +summary(as.factor(mapped_df$Entrez), maxsum = 10)
    +
    ## 100126027 100150038 100331412    794549    554097    794406 100536331 
    +##        28        28        28        28        11         7         6 
    +## 100148591   (Other)      NA's 
    +##         5     23089      9026
    +

    There are 9026 NAs in our data frame, which means that 9026 out of the 31882 Ensembl IDs did not map to Entrez IDs. This means if you are depending on Entrez IDs for your downstream analyses, you may not be able to use the 9026 unmapped genes.

    Now let’s check to see how many genes we have that were mapped to multiple IDs.

    -
    multi_mapped_df <- mapped_df %>%
    -  # Let's count the number of times each Ensembl ID appears in `Ensembl` column
    -  dplyr::count(Ensembl, name = "entrez_id_count") %>%
    -  # Arrange by the genes with the highest number of Entrez IDs mapped
    -  dplyr::arrange(desc(entrez_id_count))
    -
    -# Let's look at the first 6 rows of our `multi_mapped` object
    -head(multi_mapped_df)
    +
    multi_mapped <- mapped_df %>%
    +  # Let's count the number of times each Ensembl ID appears in `Ensembl` column
    +  dplyr::count(Ensembl, name = "entrez_id_count") %>%
    +  # Arrange by the genes with the highest number of Entrez IDs mapped
    +  dplyr::arrange(desc(entrez_id_count))
    +
    +# Let's look at the first 6 rows of our `multi_mapped` object
    +head(multi_mapped)

    Now let’s write our mapped results and data to file!

    @@ -3314,26 +4095,26 @@

    4.4.2 Map Ensembl IDs to gene sym

    4.5 Write mapped results to file

    -
    # Write mapped and annotated data frame to output file
    -readr::write_tsv(final_mapped_df, file.path(
    -  results_dir,
    -  "SRP040561_Entrez_IDs.tsv" # Replace with a relevant output file name
    -))
    +
    # Write mapped and annotated data frame to output file
    +readr::write_tsv(final_mapped_df, file.path(
    +  results_dir,
    +  "SRP040561_Entrez_IDs.tsv" # Replace with a relevant output file name
    +))

    5 Resources for further learning

    6 Session info

    At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

    -
    # Print session info
    -sessioninfo::session_info()
    -
    ## ─ Session info ───────────────────────────────────────────────────────────────
    +
    # Print session info
    +sessioninfo::session_info()
    +
    ## ─ Session info ─────────────────────────────────────────────────────
     ##  setting  value                       
     ##  version  R version 4.0.2 (2020-06-22)
     ##  os       Ubuntu 20.04 LTS            
    @@ -3343,19 +4124,19 @@ 

    6 Session info

    ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2020-10-16 +## date 2020-12-21 ## -## ─ Packages ─────────────────────────────────────────────────────────────────── +## ─ Packages ───────────────────────────────────────────────────────── ## package * version date lib source -## AnnotationDbi * 1.50.3 2020-07-25 [1] Bioconductor +## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0) ## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2) -## Biobase * 2.48.0 2020-04-27 [1] Bioconductor -## BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor +## Biobase * 2.50.0 2020-10-27 [1] Bioconductor +## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor ## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2) ## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2) ## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0) -## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) +## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) @@ -3368,16 +4149,17 @@

    6 Session info

    ## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2) ## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0) ## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1) -## IRanges * 2.22.2 2020-05-21 [1] Bioconductor +## IRanges * 2.24.1 2020-12-12 [1] Bioconductor ## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2) ## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2) ## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0) ## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0) ## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0) ## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0) -## org.Dr.eg.db * 3.11.4 2020-10-06 [1] Bioconductor +## org.Dr.eg.db * 3.12.0 2020-12-16 [1] Bioconductor ## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0) +## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2) ## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0) ## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0) ## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2) @@ -3385,18 +4167,18 @@

    6 Session info

    ## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2) ## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) ## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2) -## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2) +## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) ## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0) -## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2) +## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2) ## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2) -## RSQLite 2.2.0 2020-01-07 [1] RSPM (R 4.0.2) +## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2) ## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0) -## S4Vectors * 0.26.1 2020-05-16 [1] Bioconductor +## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0) ## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0) ## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0) -## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2) +## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2) ## tidyr 1.1.2 2020-08-27 [1] RSPM (R 4.0.2) ## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0) ## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2) @@ -3411,23 +4193,22 @@

    6 Session info

    References

    -

    Carlson M., 2019 Genome wide annotation for zebrafish

    -
    -
    -

    Carlson M., 2020a AnnotationDbi

    +

    Carlson M., 2019 Genome wide annotation for zebrafish. https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html

    -

    Carlson M., 2020b AnnotationDbi: Introduction to bioconductor annotation packages

    +

    Carlson M., 2020 AnnotationDbi: Introduction to bioconductor annotation packages. https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf

    -
    -

    CCDL, 2020 Obtaining annotation for ensembl ids - microarray.

    -
    -
    -

    R Bioconductor Team, 2003 Packages found under annotationdata

    +
    +

    Pagès H., M. Carlson, S. Falcon, and N. Li, 2020 AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor. https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html

    +
    diff --git a/03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.Rmd b/03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.Rmd index 21f51e12..8a39cb64 100644 --- a/03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.Rmd +++ b/03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.Rmd @@ -1,7 +1,7 @@ --- title: "Ortholog Mapping - RNA-seq" author: "CCDL for ALSF" -date: "October 2020" +date: "December 2020" output: html_notebook: toc: true @@ -33,7 +33,7 @@ You can open this `.Rmd` file in RStudio and follow the rest of these steps from ## Set up your analysis folders Good file organization is helpful for keeping your data analysis project on track! -We have set up some code that will automatically set up a folder structure for you. +We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders! If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. @@ -45,7 +45,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -53,7 +53,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -92,7 +92,7 @@ For this example analysis, we will use this [acute myeloid leukemia sample datas The data that we downloaded from refine.bio for this analysis has 19 samples (obtained from 19 acute myeloid leukemia (AML) mouse models), containing RNA-sequencing results for types of AML under controlled treatment conditions. More specifically, IDH2-mutant AML mouse models were treated with either vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials). -While, the TET2-mutant AML mouse models were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent). +The TET2-mutant AML mouse models were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent). ## Place the dataset in your new `data/` folder @@ -133,19 +133,24 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "SRP070849") # Replace with accession number which will be the name of the folder the files will be in - -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "SRP070849.tsv") # Replace with file path to your dataset - -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv") # Replace with file path to your metadata +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "SRP070849") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "SRP070849.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` @@ -175,7 +180,7 @@ See our Getting Started page with [instructions for package installation](https: Attach a package we need for this analysis. -```{r} +```{r message=FALSE} # We will need this so we can use the pipe: %>% library(magrittr) ``` @@ -187,38 +192,51 @@ The [HGNC database](https://www.genenames.org/) currently contains over 39,000 p The [HGNC Comparison of Orthology Predictions (HCOP)](https://www.genenames.org/tools/hcop/) is a search tool that combines orthology predictions for a specified human gene, or set of human genes from a variety of sources, including Ensembl Compara, HGNC, and NCBI Gene Orthology [@Wright2005]. In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. -HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes [@hcop-help]. +HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes, including mouse, which we will use in this notebook [@hcop-help]. -First, we need to download the file from the server holding the HGNC data. -Go to this [directory page of the HGNC Comparison of Orthology Predictions (HCOP) files](ftp://ftp.ebi.ac.uk/pub/databases/genenames/hcop/). +We can download the human mouse file we need for this example using `download.file()` command. +For this notebook, we want to download the file named `human_mouse_hcop_fifteen_column.txt.gz`. -This is where the files that reflect the data provided via the [HGNC database](https://www.genenames.org/) are maintained. -Ortholog species files with the '6 Column' output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the '15 Column' output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions. +First we'll declare a sensible file path for this. -*Note:* If you are using Safari (or the above FTP server link does not open in a web browser), you may need to go to the [link for the HCOP search tool](https://www.genenames.org/tools/hcop/) and scroll down to "Bulk Downloads" to choose a file to download. -Here, you can find the same files you would find at the server linked above. +```{r} +# Declare what we want the downloaded file to be called and its location +mouse_hgnc_file <- file.path( + data_dir, + # The name the file will have locally + "human_mouse_hcop_fifteen_column.txt.gz" +) +``` -To download a file, click the file name. -For this notebook, you will want to download the file named `human_mouse_hcop_fifteen_column.txt.gz`. -If you are using a different dataset, you can replace `mouse` in `human_mouse_hcop_fifteen_column.txt.gz` with the name of the species you have data for, and click on that file to download. +Using the file path we just declared, we can use the `destfile` argument to download the file we need to this directory and use this file name. - +We are downloading this orthology predictions file from the [HGNC database](https://www.genenames.org/). +If you are looking for a different species, see the [directory page of the HGNC Comparison of Orthology Predictions (HCOP) files](http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/) and find the file name of the species you are looking for. -Next, move the `human_mouse_hcop_fifteen_column.txt.gz` file into your `data/` folder. +```{r} +download.file( + paste0( + "http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/", + # Replace with the file name for the species conversion you want + "human_mouse_hcop_fifteen_column.txt.gz" + ), + # The file will be saved to the name and location we defined earlier + destfile = mouse_hgnc_file +) +``` -*Note:* If you are using Safari, this file will automatically be decompressed, so the name of the file would instead be `human_mouse_hcop_fifteen_column.txt` (don't forget to change the file name in the chunk below if this is the case). +If you are using a different dataset, in the last chunk you can replace `mouse` in `human_mouse_hcop_fifteen_column.txt.gz` with the name of the species you have data for (if you see it listed in the directory). -Now let's double check that the file is in the right place. +Ortholog species files with the '6 Column' output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the '15 Column' output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions. -```{r} -# Define the file path to organism orthology file downloaded from the HGNC database -mouse_hgnc_file <- file.path("data", "human_mouse_hcop_fifteen_column.txt.gz") +Now let's double check that the mouse ortholog file is in the right place. +```{r} # Check if the organism orthology file file is in the `data` directory file.exists(mouse_hgnc_file) ``` -In the next chunk, we will read in the orthology file that was just downloaded. +Now we can read in the orthology file that we downloaded. ```{r} # Read in the data from HGNC @@ -250,8 +268,8 @@ We stored our data file path as an object named `data_file` in [this previous st ```{r} # Read in data TSV file mouse_genes <- readr::read_tsv(data_file) %>% - # We only want the gene IDs so let's pull the `Gene` column - dplyr::pull("Gene") + # We only want the gene IDs so let's keep only the `Gene` column + dplyr::select("Gene") ``` ## Mapping mouse Ensembl gene IDs to human Ensembl gene IDs @@ -259,7 +277,7 @@ mouse_genes <- readr::read_tsv(data_file) %>% refine.bio data uses Ensembl gene identifiers, which will be in the first column. ```{r} -# Let's take a look at the first 6 items of `mouse_genes` +# Let's take a look at the first 6 rows of `mouse_genes` head(mouse_genes) ``` @@ -287,8 +305,8 @@ We don't want to handle duplicate data, so let's remove those duplicates before ```{r} human_mouse_key <- human_mouse_key %>% - # We need to use the `distinct()` function to remove duplicates resulted from - # ignoring the additional columns in the `mouse` object + # Use the `distinct()` function to remove duplicates resulting from + # dropping the additional columns in the `mouse` data frame dplyr::distinct() ``` @@ -296,40 +314,41 @@ Now let's join the mapped data from `human_mouse_key` with the gene data in `mou ```{r} # First, we need to convert our vector of mouse genes into a data frame -human_mouse_mapped_df <- data.frame("Gene" = mouse_genes) %>% +human_mouse_mapped_df <- mouse_genes %>% # Now we can join the mapped data dplyr::left_join(human_mouse_key, by = c("Gene" = "mouse_ensembl")) ``` Here's what the new data frame looks like: -```{r} -head(human_mouse_mapped_df, n = 25) +```{r rownames.print = FALSE} +head(human_mouse_mapped_df, n = 10) ``` +Looks like we have mapped IDs! + +So now we have all the mouse genes mapped to human, but there might be places where there are multiple mouse genes that are orthologous to the same human gene, or vice versa. + Let's get a summary of the Ensembl IDs returned in the `human_ensembl` column of our mapped data frame. ```{r} -# We can use this `count()` function after `group_by()`to get a count of how many +# We can use this `count()` function to get a tally of how many # `mouse_ensembl` IDs there are per `human_ensembl` IDs human_mouse_mapped_df %>% - dplyr::group_by(human_ensembl) %>% - dplyr::count() %>% - # Sort by highest `n` which would be the human Ensembl ID with the most mapped + # Count the number of rows per human gene + dplyr::count(human_ensembl) %>% + # Sort by highest `n` which will be the human Ensembl ID with the most mapped # mouse Ensembl IDs dplyr::arrange(desc(n)) ``` -Looks like we have mapped IDs! - -Now, let's get an idea of how many mouse Ensembl IDs we have that were not mapped to human Ensembl IDs. +There are certainly a good number of places where we mapped multiple mouse Ensembl IDs to the same human symbol! +We'll look at this in a bit. -```{r} -sum(is.na(human_mouse_mapped_df$human_ensembl)) -``` +We can also see that there 19,126 mouse Ensembl IDs that did not map to a human Ensembl ID. +These are the rows with a value of NA. +This seems like a lot, but most of these are likely to be non-protein-coding genes that do not map easily across species. -We have 18,801 NAs, which means we have 18,801 mouse Ensembl IDs out of the 64,816 in `human_mouse_mapped_df`, that were not mapped to human Ensembl IDs. -This is okay because we do not expect everything to map across species. ## Take a look at some multi-mappings @@ -360,9 +379,16 @@ In the next chunk, we show how we can collapse all the human Ensembl IDs into on collapsed_human_ensembl_df <- human_mouse_mapped_df %>% # Group by mouse Ensembl IDs dplyr::group_by(Gene) %>% - # Collapse the mapped values in `human_mouse_mapped_df` into one column named - # `all_human_ensembls` -- note that we will lose the `support` column in this summarizing step - dplyr::summarize(all_human_ensembls = paste(human_ensembl, collapse = ";")) + # Collapse the mapped values in `human_mouse_mapped_df` to a + # `all_human_ensembl` column, removing any duplicated human symbols + # note that we will lose the `support` column in this summarizing step + dplyr::summarize( + # combine unique symbols with semicolons between them + all_human_ensembl = paste( + sort(unique(human_ensembl)), + collapse = ";" + ) + ) head(collapsed_human_ensembl_df) ``` @@ -407,26 +433,22 @@ Before we do, let's take a look how many multi-mapped genes there are in the dat ```{r} human_mouse_mapped_df %>% - # Group by human Ensembl IDs - dplyr::group_by(human_ensembl) %>% # Count the number of rows in the data frame for each Ensembl ID - dplyr::count() %>% + dplyr::count(human_ensembl) %>% # Filter out the symbols without multimapped genes dplyr::filter(n > 1) ``` -Looks like we have 6,608 human gene Ensembl IDs with multiple mappings. +Looks like we have 6,971 human gene Ensembl IDs with multiple mappings. Now let's filter out the less reliable mappings. ```{r} filtered_mouse_ensembl_df <- human_mouse_mapped_df %>% - # Count the number of databases in the support column for each prediction + # Count the number of databases in the support column + # by using the number of commas that separate the databases dplyr::mutate(n_databases = stringr::str_count(support, ",") + 1) %>% - # Group by human Ensembl IDs - dplyr::group_by(human_ensembl) %>% - # Now filter for the rows with more than one database in support for each - # human Ensembl ID + # Now filter to the rows where more than one database supports the mapping dplyr::filter(n_databases > 1) head(filtered_mouse_ensembl_df) @@ -444,7 +466,7 @@ filtered_mouse_ensembl_df %>% dplyr::filter(n > 1) ``` -Now we only have 2,532 multi-mapped genes, compared to the 6,608 that we began with. +Now we only have 2,702 multi-mapped genes, compared to the 6,608 that we began with. Although we haven't filtered down to zero multi-mapped genes, we have hopefully removed some of the less _reliable_ mappings. ### Write results to file diff --git a/03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.html b/03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.html index 0efd979d..10a4dc31 100644 --- a/03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.html +++ b/03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3686,20 @@ @@ -2948,7 +3774,7 @@

    Ortholog Mapping - RNA-seq

    CCDL for ALSF

    -

    October 2020

    +

    December 2020

    @@ -2971,26 +3797,26 @@

    2.1 Obtain the .Rmd

    2.2 Set up your analysis folders

    Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

    If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

    -
    # Create the data folder if it doesn't exist
    -if (!dir.exists("data")) {
    -  dir.create("data")
    -}
    -
    -# Define the file path to the plots directory
    -plots_dir <- "plots" # Can replace with path to desired output plots directory
    -
    -# Create the plots folder if it doesn't exist
    -if (!dir.exists(plots_dir)) {
    -  dir.create(plots_dir)
    -}
    -
    -# Define the file path to the results directory
    -results_dir <- "results" # Can replace with path to desired output results directory
    -
    -# Create the results folder if it doesn't exist
    -if (!dir.exists(results_dir)) {
    -  dir.create(results_dir)
    -}
    +
    # Create the data folder if it doesn't exist
    +if (!dir.exists("data")) {
    +  dir.create("data")
    +}
    +
    +# Define the file path to the plots directory
    +plots_dir <- "plots"
    +
    +# Create the plots folder if it doesn't exist
    +if (!dir.exists(plots_dir)) {
    +  dir.create(plots_dir)
    +}
    +
    +# Define the file path to the results directory
    +results_dir <- "results"
    +
    +# Create the results folder if it doesn't exist
    +if (!dir.exists(results_dir)) {
    +  dir.create(results_dir)
    +}

    In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

    @@ -3008,7 +3834,7 @@

    2.3 Obtain the dataset from refin

    2.4 About the dataset we are using for this example

    For this example analysis, we will use this acute myeloid leukemia sample dataset.

    -

    The data that we downloaded from refine.bio for this analysis has 19 samples (obtained from 19 acute myeloid leukemia (AML) mouse models), containing RNA-sequencing results for types of AML under controlled treatment conditions. More specifically, IDH2-mutant AML mouse models were treated with either vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials). While, the TET2-mutant AML mouse models were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent).

    +

    The data that we downloaded from refine.bio for this analysis has 19 samples (obtained from 19 acute myeloid leukemia (AML) mouse models), containing RNA-sequencing results for types of AML under controlled treatment conditions. More specifically, IDH2-mutant AML mouse models were treated with either vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials). The TET2-mutant AML mouse models were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent).

    2.5 Place the dataset in your new data/ folder

    @@ -3042,20 +3868,25 @@

    2.6 Check out our file structure!

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

    First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

    -
    # Define the file path to the data directory
    -data_dir <- file.path("data", "SRP070849") # Replace with accession number which will be the name of the folder the files will be in
    -
    -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
    -data_file <- file.path(data_dir, "SRP070849.tsv") # Replace with file path to your dataset
    -
    -# Declare the file path to the metadata file using the data directory saved as `data_dir`
    -metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv") # Replace with file path to your metadata
    +
    # Define the file path to the data directory
    +# Replace with the path of the folder the files will be in
    +data_dir <- file.path("data", "SRP070849")
    +
    +# Declare the file path to the gene expression matrix file
    +# inside directory saved as `data_dir`
    +# Replace with the path to your dataset file
    +data_file <- file.path(data_dir, "SRP070849.tsv")
    +
    +# Declare the file path to the metadata file
    +# inside the directory saved as `data_dir`
    +# Replace with the path to your metadata file
    +metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv")

    Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

    -
    # Check if the gene expression matrix file is at the file path stored in `data_file`
    -file.exists(data_file)
    +
    # Check if the gene expression matrix file is at the path stored in `data_file`
    +file.exists(data_file)
    ## [1] TRUE
    -
    # Check if the metadata file is at the file path stored in `metadata_file`
    -file.exists(metadata_file)
    +
    # Check if the metadata file is at the file path stored in `metadata_file`
    +file.exists(metadata_file)
    ## [1] TRUE

    If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

    If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

    @@ -3074,31 +3905,43 @@

    4 Ortholog Mapping - RNA-seq

    4.1 Install libraries

    See our Getting Started page with instructions for package installation for a list of the software you will need, as well as more tips and resources.

    Attach a package we need for this analysis.

    -
    # We will need this so we can use the pipe: %>%
    -library(magrittr)
    +
    # We will need this so we can use the pipe: %>%
    +library(magrittr)

    4.2 Import data from HGNC

    -

    The HUGO Gene Nomenclature Committee (HGNC) assigns a unique and ideally meaningful name and symbol to every human gene. The HGNC database currently contains over 39,000 public records containing approved human gene nomenclature and associated gene information (Gray et al. 2015).

    -

    The HGNC Comparison of Orthology Predictions (HCOP) is a search tool that combines orthology predictions for a specified human gene, or set of human genes from a variety of sources, including Ensembl Compara, HGNC, and NCBI Gene Orthology (Wright et al. 2005). In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes (HGNC team 2020).

    -

    First, we need to download the file from the server holding the HGNC data. Go to this directory page of the HGNC Comparison of Orthology Predictions (HCOP) files.

    -

    This is where the files that reflect the data provided via the HGNC database are maintained. Ortholog species files with the ‘6 Column’ output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the ‘15 Column’ output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.

    -

    Note: If you are using Safari (or the above FTP server link does not open in a web browser), you may need to go to the link for the HCOP search tool and scroll down to “Bulk Downloads” to choose a file to download. Here, you can find the same files you would find at the server linked above.

    -

    To download a file, click the file name. For this notebook, you will want to download the file named human_mouse_hcop_fifteen_column.txt.gz. If you are using a different dataset, you can replace mouse in human_mouse_hcop_fifteen_column.txt.gz with the name of the species you have data for, and click on that file to download.

    -

    -

    Next, move the human_mouse_hcop_fifteen_column.txt.gz file into your data/ folder.

    -

    Note: If you are using Safari, this file will automatically be decompressed, so the name of the file would instead be human_mouse_hcop_fifteen_column.txt (don’t forget to change the file name in the chunk below if this is the case).

    -

    Now let’s double check that the file is in the right place.

    -
    # Define the file path to organism orthology file downloaded from the HGNC database
    -mouse_hgnc_file <- file.path("data", "human_mouse_hcop_fifteen_column.txt.gz")
    -
    -# Check if the organism orthology file file is in the `data` directory
    -file.exists(mouse_hgnc_file)
    +

    The HUGO Gene Nomenclature Committee (HGNC) assigns a unique and ideally meaningful name and symbol to every human gene. The HGNC database currently contains over 39,000 public records containing approved human gene nomenclature and associated gene information (Gray et al. 2015).

    +

    The HGNC Comparison of Orthology Predictions (HCOP) is a search tool that combines orthology predictions for a specified human gene, or set of human genes from a variety of sources, including Ensembl Compara, HGNC, and NCBI Gene Orthology (Wright et al. 2005). In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes, including mouse, which we will use in this notebook (HGNC Team 2020).

    +

    We can download the human mouse file we need for this example using download.file() command. For this notebook, we want to download the file named human_mouse_hcop_fifteen_column.txt.gz.

    +

    First we’ll declare a sensible file path for this.

    +
    # Declare what we want the downloaded file to be called and its location
    +mouse_hgnc_file <- file.path(
    +  data_dir,
    +  # The name the file will have locally
    +  "human_mouse_hcop_fifteen_column.txt.gz"
    +)
    +

    Using the file path we just declared, we can use the destfile argument to download the file we need to this directory and use this file name.

    +

    We are downloading this orthology predictions file from the HGNC database. If you are looking for a different species, see the directory page of the HGNC Comparison of Orthology Predictions (HCOP) files and find the file name of the species you are looking for.

    +
    download.file(
    +  paste0(
    +    "http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/",
    +    # Replace with the file name for the species conversion you want
    +    "human_mouse_hcop_fifteen_column.txt.gz"
    +  ),
    +  # The file will be saved to the name and location we defined earlier
    +  destfile = mouse_hgnc_file
    +)
    +

    If you are using a different dataset, in the last chunk you can replace mouse in human_mouse_hcop_fifteen_column.txt.gz with the name of the species you have data for (if you see it listed in the directory).

    +

    Ortholog species files with the ‘6 Column’ output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the ‘15 Column’ output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.

    +

    Now let’s double check that the mouse ortholog file is in the right place.

    +
    # Check if the organism orthology file file is in the `data` directory
    +file.exists(mouse_hgnc_file)
    ## [1] TRUE
    -

    In the next chunk, we will read in the orthology file that was just downloaded.

    -
    # Read in the data from HGNC
    -mouse <- readr::read_tsv(mouse_hgnc_file)
    -
    ## Parsed with column specification:
    +

    Now we can read in the orthology file that we downloaded.

    +
    # Read in the data from HGNC
    +mouse <- readr::read_tsv(mouse_hgnc_file)
    +
    ## 
    +## ── Column specification ──────────────────────────────────────────────
     ## cols(
     ##   human_entrez_gene = col_character(),
     ##   human_ensembl_gene = col_character(),
    @@ -3117,104 +3960,107 @@ 

    4.2 Import data from HGNC

    ## support = col_character() ## )

    Let’s take a look at what mouse looks like.

    -
    mouse
    +
    mouse

    We are going to manipulate some of the column names to make things easier when calling them downstream.

    -
    mouse <- mouse %>%
    -  set_names(names(.) %>%
    -    # Removing extra word in some of the column names
    -    gsub("_gene$", "", .))
    +
    mouse <- mouse %>%
    +  set_names(names(.) %>%
    +    # Removing extra word in some of the column names
    +    gsub("_gene$", "", .))

    4.3 Import and set up data

    Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the data TSV file and add it as an object to your environment.

    We stored our data file path as an object named data_file in this previous step.

    -
    # Read in data TSV file
    -mouse_genes <- readr::read_tsv(data_file) %>%
    -  # We only want the gene IDs so let's pull the `Gene` column
    -  dplyr::pull("Gene")
    -
    ## Parsed with column specification:
    +
    # Read in data TSV file
    +mouse_genes <- readr::read_tsv(data_file) %>%
    +  # We only want the gene IDs so let's keep only the `Gene` column
    +  dplyr::select("Gene")
    +
    ## 
    +## ── Column specification ──────────────────────────────────────────────
     ## cols(
     ##   .default = col_double(),
     ##   Gene = col_character()
    -## )
    -
    ## See spec(...) for full column specifications.
    +## ) +## ℹ Use `spec()` for the full column specifications.

    4.4 Mapping mouse Ensembl gene IDs to human Ensembl gene IDs

    refine.bio data uses Ensembl gene identifiers, which will be in the first column.

    -
    # Let's take a look at the first 6 items of `mouse_genes`
    -head(mouse_genes)
    -
    ## [1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028"
    -## [4] "ENSMUSG00000000031" "ENSMUSG00000000037" "ENSMUSG00000000049"
    +
    # Let's take a look at the first 6 rows of `mouse_genes`
    +head(mouse_genes)
    +
    + +

    Ensembl gene identifiers have different species-specific prefixes. In mouse, Ensembl gene identifiers begin with ENSMUSG (in human, ENSG, etc.).

    Now let’s do the mapping!

    We’re interested in the human_ensembl, mouse_ensembl, and support columns specifically. The support column contains a list of associated databases that support each assertion. This column may assist with addressing some of the multi-mappings that we will talk about later.

    -
    human_mouse_key <- mouse %>%
    -  # We'll want to subset mouse to only the columns we're interested in
    -  dplyr::select(mouse_ensembl, human_ensembl, support)
    -
    -# Since we ignored the additional columns in `mouse`, let's check to see if we
    -# have any duplicates in our `human_mouse_key`
    -any(duplicated(human_mouse_key))
    +
    human_mouse_key <- mouse %>%
    +  # We'll want to subset mouse to only the columns we're interested in
    +  dplyr::select(mouse_ensembl, human_ensembl, support)
    +
    +# Since we ignored the additional columns in `mouse`, let's check to see if we
    +# have any duplicates in our `human_mouse_key`
    +any(duplicated(human_mouse_key))
    ## [1] TRUE

    We do have duplicates! We don’t want to handle duplicate data, so let’s remove those duplicates before moving forward.

    -
    human_mouse_key <- human_mouse_key %>%
    -  # We need to use the `distinct()` function to remove duplicates resulted from
    -  # ignoring the additional columns in the `mouse` object
    -  dplyr::distinct()
    +
    human_mouse_key <- human_mouse_key %>%
    +  # Use the `distinct()` function to remove duplicates resulting from
    +  # dropping the additional columns in the `mouse` data frame
    +  dplyr::distinct()

    Now let’s join the mapped data from human_mouse_key with the gene data in mouse_genes.

    -
    # First, we need to convert our vector of mouse genes into a data frame
    -human_mouse_mapped_df <- data.frame("Gene" = mouse_genes) %>%
    -  # Now we can join the mapped data
    -  dplyr::left_join(human_mouse_key, by = c("Gene" = "mouse_ensembl"))
    +
    # First, we need to convert our vector of mouse genes into a data frame
    +human_mouse_mapped_df <- mouse_genes %>%
    +  # Now we can join the mapped data
    +  dplyr::left_join(human_mouse_key, by = c("Gene" = "mouse_ensembl"))

    Here’s what the new data frame looks like:

    -
    head(human_mouse_mapped_df, n = 25)
    +
    head(human_mouse_mapped_df, n = 10)
    +

    Looks like we have mapped IDs!

    +

    So now we have all the mouse genes mapped to human, but there might be places where there are multiple mouse genes that are orthologous to the same human gene, or vice versa.

    Let’s get a summary of the Ensembl IDs returned in the human_ensembl column of our mapped data frame.

    -
    # We can use this `count()` function after `group_by()`to get a count of how many
    -# `mouse_ensembl` IDs there are per `human_ensembl` IDs
    -human_mouse_mapped_df %>%
    -  dplyr::group_by(human_ensembl) %>%
    -  dplyr::count() %>%
    -  # Sort by highest `n` which would be the human Ensembl ID with the most mapped
    -  # mouse Ensembl IDs
    -  dplyr::arrange(desc(n))
    +
    # We can use this `count()` function to get a tally of how many
    +# `mouse_ensembl` IDs there are per `human_ensembl` IDs
    +human_mouse_mapped_df %>%
    +  # Count the number of rows per human gene
    +  dplyr::count(human_ensembl) %>%
    +  # Sort by highest `n` which will be the human Ensembl ID with the most mapped
    +  # mouse Ensembl IDs
    +  dplyr::arrange(desc(n))
    -

    Looks like we have mapped IDs!

    -

    Now, let’s get an idea of how many mouse Ensembl IDs we have that were not mapped to human Ensembl IDs.

    -
    sum(is.na(human_mouse_mapped_df$human_ensembl))
    -
    ## [1] 18801
    -

    We have 18,801 NAs, which means we have 18,801 mouse Ensembl IDs out of the 64,816 in human_mouse_mapped_df, that were not mapped to human Ensembl IDs. This is okay because we do not expect everything to map across species.

    +

    There are certainly a good number of places where we mapped multiple mouse Ensembl IDs to the same human symbol! We’ll look at this in a bit.

    +

    We can also see that there 19,126 mouse Ensembl IDs that did not map to a human Ensembl ID. These are the rows with a value of NA. This seems like a lot, but most of these are likely to be non-protein-coding genes that do not map easily across species.

    4.5 Take a look at some multi-mappings

    If a mouse Ensembl gene ID maps to multiple human Ensembl IDs, the associated values will get duplicated. Let’s look at the ENSMUSG00000000290 example below.

    -
    human_mouse_mapped_df %>%
    -  dplyr::filter(Gene == "ENSMUSG00000000290")
    +
    human_mouse_mapped_df %>%
    +  dplyr::filter(Gene == "ENSMUSG00000000290")

    On the other hand, if you were to look at the original data associated to the mouse Ensembl IDs, when a human Ensembl ID maps to multiple mouse Ensembl IDs, the values will not get duplicated, but you will have multiple rows associated with that human Ensembl ID. Let’s look at the ENSG00000001617 example below.

    -
    human_mouse_mapped_df %>%
    -  dplyr::filter(human_ensembl == "ENSG00000001617")
    +
    human_mouse_mapped_df %>%
    +  dplyr::filter(human_ensembl == "ENSG00000001617")
    @@ -3222,118 +4068,121 @@

    4.5 Take a look at some multi-map

    4.6 Collapse mouse genes mapping to multiple human genes

    Remember that if a mouse Ensembl gene ID maps to multiple human symbols, the values get duplicated. We can collapse the multi-mapped values into a list for each Ensembl ID as to not have duplicate values in our data frame.

    In the next chunk, we show how we can collapse all the human Ensembl IDs into one column where we store them all for each unique mouse Ensembl ID.

    -
    collapsed_human_ensembl_df <- human_mouse_mapped_df %>%
    -  # Group by mouse Ensembl IDs
    -  dplyr::group_by(Gene) %>%
    -  # Collapse the mapped values in `human_mouse_mapped_df` into one column named
    -  # `all_human_ensembls` -- note that we will lose the `support` column in this summarizing step
    -  dplyr::summarize(all_human_ensembls = paste(human_ensembl, collapse = ";"))
    +
    collapsed_human_ensembl_df <- human_mouse_mapped_df %>%
    +  # Group by mouse Ensembl IDs
    +  dplyr::group_by(Gene) %>%
    +  # Collapse the mapped values in `human_mouse_mapped_df` to a
    +  # `all_human_ensembl` column, removing any duplicated human symbols
    +  # note that we will lose the `support` column in this summarizing step
    +  dplyr::summarize(
    +    # combine unique symbols with semicolons between them
    +    all_human_ensembl = paste(
    +      sort(unique(human_ensembl)),
    +      collapse = ";"
    +    )
    +  )
    ## `summarise()` ungrouping output (override with `.groups` argument)
    -
    head(collapsed_human_ensembl_df)
    +
    head(collapsed_human_ensembl_df)

    4.6.1 Write results to file

    Now let’s write our results to file.

    -
    readr::write_tsv(
    -  collapsed_human_ensembl_df,
    -  file.path(
    -    results_dir,
    -    # Replace with a relevant output file name
    -    "SRP070849_mouse_ensembl_to_collapsed_human_gene_symbol.tsv"
    -  )
    -)
    +
    readr::write_tsv(
    +  collapsed_human_ensembl_df,
    +  file.path(
    +    results_dir,
    +    # Replace with a relevant output file name
    +    "SRP070849_mouse_ensembl_to_collapsed_human_gene_symbol.tsv"
    +  )
    +)

    4.7 Collapse human genes mapping to multiple mouse genes

    -

    Since multiple mouse Ensembl gene IDs map to the same human Ensembl gene ID, we may want to identify which one of these mappings represents the “true” ortholog, i.e. which mouse gene is most similar to the human gene we are interested in. This is not at all straightforward! (see this paper for just one example) (Stamboulian et al. 2020). Gene duplications along the mouse lineage may result in complicated relationships among genes, especially with regard to divisions of function.

    +

    Since multiple mouse Ensembl gene IDs map to the same human Ensembl gene ID, we may want to identify which one of these mappings represents the “true” ortholog, i.e. which mouse gene is most similar to the human gene we are interested in. This is not at all straightforward! (see this paper for just one example) (Stamboulian et al. 2020). Gene duplications along the mouse lineage may result in complicated relationships among genes, especially with regard to divisions of function.

    Simply combining values across mouse transcripts using an average may result in the loss of a lot of data and will likely not be representative of the mouse biology. One thing we might do to make the problem somewhat simpler is to reduce the number of multi-mapped genes by requiring a certain level of support for each mapping from across the various databases included in HCOP. This will not fully solve the problem (and may not even be desirable in some cases), but we present it here as an example of an approach one might take.

    Therefore, we will use the support column to decide which mappings to retain. Let’s take a look at support.

    -
    head(human_mouse_mapped_df$support)
    -
    ## [1] "OrthoMCL,OrthoDB"                                                                      
    -## [2] "OrthoDB"                                                                               
    -## [3] "Inparanoid,PhylomeDB,NCBI,HomoloGene,HGNC,Treefam,OrthoMCL,OMA,Panther,Ensembl,OrthoDB"
    -## [4] NA                                                                                      
    -## [5] "Inparanoid,PhylomeDB,HomoloGene,EggNOG,NCBI,HGNC,Treefam,OMA,Panther,Ensembl,OrthoDB"  
    +
    head(human_mouse_mapped_df$support)
    +
    ## [1] "OrthoDB,OrthoMCL"                                                                             
    +## [2] "OrthoDB,OrthoMCL"                                                                             
    +## [3] "HomoloGene,Inparanoid,PhylomeDB,Ensembl,Treefam,OMA,Panther,HGNC,NCBI,OrthoDB,OrthoMCL"       
    +## [4] NA                                                                                             
    +## [5] "HomoloGene,Inparanoid,PhylomeDB,Ensembl,EggNOG,Treefam,OMA,Panther,HGNC,NCBI,OrthoDB,OrthoMCL"
     ## [6] "NCBI"

    Looks like we have a variety of databases for multiple mappings, but we do have some instances of only one database reported in support of the mapping. As we noted earlier, an orthology prediction where more than one of the databases concur would be considered reliable. Therefore, where we have multi-mapped mouse Ensembl gene IDs, we will take the mappings with more than one database to support the assertion.

    Before we do, let’s take a look how many multi-mapped genes there are in the data frame.

    -
    human_mouse_mapped_df %>%
    -  # Group by human Ensembl IDs
    -  dplyr::group_by(human_ensembl) %>%
    -  # Count the number of rows in the data frame for each Ensembl ID
    -  dplyr::count() %>%
    -  # Filter out the symbols without multimapped genes
    -  dplyr::filter(n > 1)
    +
    human_mouse_mapped_df %>%
    +  # Count the number of rows in the data frame for each Ensembl ID
    +  dplyr::count(human_ensembl) %>%
    +  # Filter out the symbols without multimapped genes
    +  dplyr::filter(n > 1)
    -

    Looks like we have 6,608 human gene Ensembl IDs with multiple mappings.

    +

    Looks like we have 6,971 human gene Ensembl IDs with multiple mappings.

    Now let’s filter out the less reliable mappings.

    -
    filtered_mouse_ensembl_df <- human_mouse_mapped_df %>%
    -  # Count the number of databases in the support column for each prediction
    -  dplyr::mutate(n_databases = stringr::str_count(support, ",") + 1) %>%
    -  # Group by human Ensembl IDs
    -  dplyr::group_by(human_ensembl) %>%
    -  # Now filter for the rows with more than one database in support for each
    -  # human Ensembl ID
    -  dplyr::filter(n_databases > 1)
    -
    -head(filtered_mouse_ensembl_df)
    +
    filtered_mouse_ensembl_df <- human_mouse_mapped_df %>%
    +  # Count the number of databases in the support column
    +  # by using the number of commas that separate the databases
    +  dplyr::mutate(n_databases = stringr::str_count(support, ",") + 1) %>%
    +  # Now filter to the rows where more than one database supports the mapping
    +  dplyr::filter(n_databases > 1)
    +
    +head(filtered_mouse_ensembl_df)

    Let’s count how many multi-mapped genes we have now.

    -
    filtered_mouse_ensembl_df %>%
    -  # Group by human Ensembl IDs
    -  dplyr::group_by(human_ensembl) %>%
    -  # Count the number of rows in the data frame for each Ensembl ID
    -  dplyr::count() %>%
    -  # Filter out the symbols without multimapped genes
    -  dplyr::filter(n > 1)
    +
    filtered_mouse_ensembl_df %>%
    +  # Group by human Ensembl IDs
    +  dplyr::group_by(human_ensembl) %>%
    +  # Count the number of rows in the data frame for each Ensembl ID
    +  dplyr::count() %>%
    +  # Filter out the symbols without multimapped genes
    +  dplyr::filter(n > 1)
    -

    Now we only have 2,532 multi-mapped genes, compared to the 6,608 that we began with. Although we haven’t filtered down to zero multi-mapped genes, we have hopefully removed some of the less reliable mappings.

    +

    Now we only have 2,702 multi-mapped genes, compared to the 6,608 that we began with. Although we haven’t filtered down to zero multi-mapped genes, we have hopefully removed some of the less reliable mappings.

    4.7.1 Write results to file

    Now let’s write our filtered_mouse_ensembl_df object, with the reliable mouse Ensembl IDs for each unique human gene Ensembl ID, to file.

    -
    readr::write_tsv(
    -  filtered_mouse_ensembl_df,
    -  file.path(
    -    results_dir,
    -    # Replace with a relevant output file name
    -    "SRP070849_filtered_mouse_ensembl_to_human_gene_symbol.tsv"
    -  )
    -)
    +
    readr::write_tsv(
    +  filtered_mouse_ensembl_df,
    +  file.path(
    +    results_dir,
    +    # Replace with a relevant output file name
    +    "SRP070849_filtered_mouse_ensembl_to_human_gene_symbol.tsv"
    +  )
    +)

    5 Resources for further learning

    6 Session info

    At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

    -
    # Print session info
    -sessioninfo::session_info()
    -
    ## ─ Session info ───────────────────────────────────────────────────────────────
    +
    # Print session info
    +sessioninfo::session_info()
    +
    ## ─ Session info ─────────────────────────────────────────────────────
     ##  setting  value                       
     ##  version  R version 4.0.2 (2020-06-22)
     ##  os       Ubuntu 20.04 LTS            
    @@ -3343,13 +4192,13 @@ 

    6 Session info

    ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2020-10-16 +## date 2020-12-21 ## -## ─ Packages ─────────────────────────────────────────────────────────────────── +## ─ Packages ───────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0) ## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2) -## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) +## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) ## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2) @@ -3368,23 +4217,23 @@

    6 Session info

    ## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0) ## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0) +## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2) ## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0) ## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0) ## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2) ## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2) ## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2) ## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) -## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2) -## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2) +## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) ## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0) -## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2) +## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2) ## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2) ## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0) ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0) ## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0) ## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0) -## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2) +## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2) ## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0) ## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2) ## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) @@ -3398,20 +4247,25 @@

    6 Session info

    References

    -

    Gray K. A., B. Yates, R. L. Seal, M. W. Wright, and E. A. Bruford, 2015 Genenames.org: The hgnc resources in 2015. Nucleic Acids Res 43. https://doi.org/10.1038/nature11327

    +

    Gray K. A., B. Yates, R. L. Seal, M. W. Wright, and E. A. Bruford, 2015 Genenames.org: The HGNC resources in 2015. Nucleic Acids Research 43. https://doi.org/10.1038/nature11327

    -

    HGNC team, 2020 HCOP help

    +

    HGNC Team, 2020 HCOP help. https://www.genenames.org/help/hcop/

    Stamboulian M., R. F. Guerrero, M. W. Hahn, and P. Radivojac, 2020 The ortholog conjecture revisited: The value of orthologs and paralogs in function prediction. Bioinformatics 36: i219–i226. https://doi.org/10.1093/bioinformatics/btaa468

    -

    Wright M. W., T. A. Eyre, M. J. Lush, S. Povey, and E. A. Bruford, 2005 HCOP: The hgnc comparison of orthology predictions search tool. Mammalian Genome 16: 827–8. https://doi.org/10.1007/s00335-005-0103-2

    +

    Wright M. W., T. A. Eyre, M. J. Lush, S. Povey, and E. A. Bruford, 2005 HCOP: The HGNC comparison of orthology predictions search tool. Mammalian Genome 16: 827–8. https://doi.org/10.1007/s00335-005-0103-2

    + diff --git a/03-rnaseq/pathway-analysis_rnaseq_01_ora.Rmd b/03-rnaseq/pathway-analysis_rnaseq_01_ora.Rmd new file mode 100644 index 00000000..022f2249 --- /dev/null +++ b/03-rnaseq/pathway-analysis_rnaseq_01_ora.Rmd @@ -0,0 +1,542 @@ +--- +title: "Over-representation analysis - RNA-Seq" +author: "CCDL for ALSF" +date: "December 2020" +output: + html_notebook: + toc: true + toc_float: true + number_sections: true +--- + +# Purpose of this analysis + +This example is one of pathway analysis module set, we recommend looking at the [pathway analysis table below](#how-to-choose-a-pathway-analysis) to help you determine which pathway analysis method is best suited for your purposes. + +This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes shares more or fewer genes with gene sets/pathways than we would expect by chance. + +ORA is a broadly applicable technique that may be good in scenarios where your dataset or scientific questions don't fit the requirements of other pathway analyses methods. +It also does not require any particular sample size, since the only input from your dataset is a set of genes of interest [@Yaari2013]. + +If you have differential expression results or something with a gene-level ranking and a two-group comparison, we recommend considering [GSEA](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html) for your pathway analysis questions. +⬇️ [**Jump to the analysis code**](#analysis) ⬇️ + +### What is pathway analysis? + +Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. +In the context of [refine.bio](https://www.refine.bio/), we use these techniques to analyze and interpret genome-wide gene expression experiments. +The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. +In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis. + +We highly recommend taking a look at [Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002375) from @Khatri2012 for a more comprehensive overview. +We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the [`Resources for further learning` section](#resources-for-further-learning). + +### How to choose a pathway analysis? + +This table summarizes the pathway analyses examples in this module. + +|Analysis|What is required for input|What output looks like |✅ Pros| ⚠️ Cons| +|--------|--------------------------|-----------------------|-------|-------| +|[**ORA (Over-representation Analysis)**](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_01_ora.html)|A list of gene IDs (no stats needed)|A per-pathway hypergeometric test result|- Simple

    - Inexpensive computationally to calculate p-values| - Requires arbitrary thresholds and ignores any statistics associated with a gene

    - Assumes independence of genes and pathways| +|[**GSEA (Gene Set Enrichment Analysis)**](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html)|A list of genes IDs with gene-level summary statistics|A per-pathway enrichment score|- Includes all genes (no arbitrary threshold!)

    - Attempts to measure coordination of genes|- Permutations can be expensive

    - Does not account for pathway overlap

    - Two-group comparisons not always appropriate/feasible| +|[**GSVA (Gene Set Variation Analysis)**](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_03_gsva.html)|A gene expression matrix (like what you get from refine.bio directly)|Pathway-level scores on a per-sample basis|- Does not require two groups to compare upfront

    - Normally distributed scores|- Scores are not a good fit for gene sets that contain genes that go up AND down

    - Method doesn’t assign statistical significance itself

    - Recommended sample size n > 10| + +# How to run this example + +For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. + +## Obtain the `.Rmd` file + +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_01_ora.Rmd). + +Clicking this link will most likely send this to your downloads folder on your computer. +Move this `.Rmd` file to where you would like this example and its files to be stored. + +You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) + +## Set up your analysis folders + +Good file organization is helpful for keeping your data analysis project on track! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! + +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. + +```{r} +# Create the data folder if it doesn't exist +if (!dir.exists("data")) { + dir.create("data") +} + +# Define the file path to the plots directory +plots_dir <- "plots" + +# Create the plots folder if it doesn't exist +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} + +# Define the file path to the results directory +results_dir <- "results" + +# Create the results folder if it doesn't exist +if (!dir.exists(results_dir)) { + dir.create(results_dir) +} +``` + +In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! + +## About the dataset we are using for this example + +For this example analysis, we will use this [acute viral bronchiolitis dataset](https://www.refine.bio/experiments/SRP140558). +The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. +Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated "AV") and their recovery, their post-convalescence visit (abbreviated "CV"). + +We used this dataset to identify modules of co-expressed genes in an [example analysis](https://alexslemonade.github.io/refinebio-examples/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html) using [`WGCNA`](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/) [@Langfelder2008]. + +We have provided this file for you and the code in this notebook will read in the results that are stored online. +If you'd like to follow the steps for creating this results file from the refine.bio data, we suggest going through [that WGCNA example](https://alexslemonade.github.io/refinebio-examples/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html). + +Module 19 was the most differentially expressed between the datasets' two time points (during illness and recovering from illness). + +The heatmap below summarizes the expression of the genes that make up module 19. + + + +Each row is a gene that is a member of module 19, and the composite expression of these genes, called an eigengene, is shown in the barplot below. +This plot demonstrates how these genes, together as a module, are differentially expressed between the two time points. + +## Check out our file structure! + +Your new analysis folder should contain: + +- The example analysis `.Rmd` you downloaded +- A folder called `data` (currently empty) +- A folder for `plots` (currently empty) +- A folder for `results` (currently empty) + +Your example analysis folder should contain your `.Rmd` and three empty folders (which won't be empty for long!). + +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). + +# Using a different refine.bio dataset with this analysis? + +The file we use here has two columns from our WGCNA module results: the id of each gene and the module it is part of. +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend replacing the `gene_module_url` with a different file path to a read in a similar table of genes with the information that you are interested in. +If your gene table differs, many steps will need to be changed or deleted entirely depending on the contents of that file (particularly in the [`Determine our genes of interest list` section](#determined-our-genes-of-interest-list)). + +We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. +From here you can customize this analysis example to fit your own scientific questions and preferences. + +*** + +   + +# Over-Representation Analysis with `clusterProfiler` - RNA-seq + +## Install libraries + +See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. + +In this analysis, we will be using [`clusterProfiler`](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package to perform ORA and the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package which contains gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) already in the tidy format required by `clusterProfiler` [@Yu2012; @Dolgalev2020; @Subramanian2005; @Liberzon2011]. + +We will also need the [`org.Hs.eg.db`](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) package [@Carlson2020-human] to perform gene identifier conversion and [`ggupset`](https://cran.r-project.org/web/packages/ggupset/readme/README.html) to make an UpSet plot [@Ahlmann-Eltze2020]. + +```{r} +if (!("clusterProfiler" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("clusterProfiler", update = FALSE) +} + +# This is required to make one of the plots that clusterProfiler will make +if (!("ggupset" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("ggupset", update = FALSE) +} + +if (!("msigdbr" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("msigdbr", update = FALSE) +} + +if (!("org.Hs.eg.db" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("org.Hs.eg.db", update = FALSE) +} +``` + +Attach the packages we need for this analysis. + +```{r message=FALSE} +# Attach the library +library(clusterProfiler) + +# Package that contains MSigDB gene sets in tidy format +library(msigdbr) + +# Homo sapiens annotation package we'll use for gene identifier conversion +library(org.Hs.eg.db) + +# We will need this so we can use the pipe: %>% +library(magrittr) +``` + +## Download data file + +For ORA, we only need a list of genes of interest and a background gene set as our input, so this example can work for any situations where you have gene list and want to know more about what biological pathways are significantly represented. + +For this example, we will read in results from a co-expression network analysis that we have already performed. +Rather than reading from a local file, we will download the results table directly from a URL. +These results are from a acute bronchiolitis experiment we used for [an example WGCNA analysis](https://alexslemonade.github.io/refinebio-examples/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html) [@Langfelder2008]. + +The table contains two columns, one with Ensembl gene IDs, and the other with the name of the module they are a part of. +We will perform ORA on one of the modules we identified in the WGCNA analysis but the rest of the genes will be used as "background genes". + +Instead of using this URL below, you can use a file path to a TSV file with your desired gene list. +First we will assign the URL to its own variable called, `gene_module_url`. + +```{r} +# Define the url to your gene list file +gene_module_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/04-advanced-topics/results/SRP140558_wgcna_gene_to_module.tsv" +``` + +We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R. + +```{r} +gene_module_file <- file.path( + results_dir, + "SRP140558_wgcna_gene_to_module.tsv" +) +``` + +Using the URL (`gene_module_url`) and file path (`gene_module_file`) we can download the file and use the `destfile` argument to specify where it should be saved. + +```{r} +download.file( + gene_module_url, + # The file will be saved to this location and with this name + destfile = gene_module_file +) +``` + +Now let's double check that the results file is in the right place. + +```{r} +# Check if the file exists +file.exists(gene_module_file) +``` + +## Import data + +Read in the file that has WGCNA gene and module results. + +```{r} +# Read in the contents of the WGCNA gene modules file +gene_module_df <- readr::read_tsv(gene_module_file) +``` + +Note that `read_tsv()` can also read TSV files directly from a URL and doesn't necessarily require you download the file first. +If you wanted to use that feature, you could replace the call above with `readr::read_tsv(gene_module_url)` and skip the download steps. + +Let's take a look at this gene module table. + +```{r} +gene_module_df +``` + +## Getting familiar with MSigDB gene sets available via `msigdbr` + +The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses [@Subramanian2005; @Liberzon2011]. +We can use the `msigdbr` package to access these gene sets in a format compatible with the package we'll use for analysis, `clusterProfiler` [@Dolgalev2020; @Yu2012]. + +The gene sets available directly from MSigDB are applicable to human studies. +`msigdbr` also supports commonly studied model organisms. + +Let's take a look at what organisms the package supports with `msigdbr_species()`. + +```{r} +msigdbr_species() +``` + +The data we're interested in here comes from human samples, so we can obtain only the gene sets relevant to _H. sapiens_ with the `species` argument to `msigdbr()`. + +```{r} +hs_msigdb_df <- msigdbr(species = "Homo sapiens") +``` + +MSigDB contains [8 different gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) [@Subramanian2005; @Liberzon2011] that are distinguished by how they are derived (e.g., computationally mined, curated). +In this example, we will use pathways that are gene sets considered to be "canonical representations of a biological process compiled by domain experts" and are a subset of `C2: curated gene sets` [@Subramanian2005; @Liberzon2011]. + +Specifically, we will use the [KEGG (Kyoto Encyclopedia of Genes and Genomes)](https://www.genome.jp/kegg/) pathways [@Kanehisa2000]. + +First, let's take a look at what information is included in this data frame. + +```{r rownames.print=FALSE} +head(hs_msigdb_df) +``` + +We will need to use `gs_cat` and `gs_subcat` columns to construct a filter step that will only keep curated gene sets and KEGG pathways. + +```{r} +# Filter the human data frame to the KEGG pathways that are included in the +# curated gene sets +hs_kegg_df <- hs_msigdb_df %>% + dplyr::filter( + gs_cat == "C2", # This is to filter only to the C2 curated gene sets + gs_subcat == "CP:KEGG" # This is because we only want KEGG pathways + ) +``` + +The `clusterProfiler()` function we will use requires a data frame with two columns, where one column contains the term identifier or name and one column contains gene identifiers that match our gene lists we want to check for enrichment. + +Our data frame with KEGG terms contains Entrez IDs and gene symbols. + +In our differential expression results data frame, `gene_module_df` we have Ensembl gene identifiers. +So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs. + +## Gene identifier conversion + +We're going to convert our identifiers in `gene_module_df` to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers! + +The annotation package `org.Hs.eg.db` contains information for different identifiers [@Carlson2020-human]. +`org.Hs.eg.db` is specific to _Homo sapiens_ -- this is what the `Hs` in the package name is referencing. + +Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: +[the microarray example](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html) and [the RNA-seq example](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html). + +We can see what types of IDs are available to us in an annotation package with `keytypes()`. + +```{r} +keytypes(org.Hs.eg.db) +``` + +Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to Entrez IDs (`ENTREZID`), we could just as easily use it to convert from an Ensembl transcript ID (`ENSEMBLTRANS`) to gene symbols (`SYMBOL`). + +The function we will use to map from Ensembl gene IDs to Entrez IDs is called `mapIds()` and comes from the `AnnotationDbi` package. + +```{r} +# This returns a named vector which we can convert to a data frame, where +# the keys (Ensembl IDs) are the names +entrez_vector <- mapIds( + # Replace with annotation package for the organism relevant to your data + org.Hs.eg.db, + # The vector of gene identifiers we want to map + keys = gene_module_df$gene, + # Replace with the type of gene identifiers in your data + keytype = "ENSEMBL", + # Replace with the type of gene identifiers you would like to map to + column = "ENTREZID", + # In the case of 1:many mappings, return the + # first one. This is default behavior! + multiVals = "first" +) +``` + +This message is letting us know that sometimes Ensembl gene identifiers will map to multiple Entrez IDs. +In this case, it's also possible that an Entrez ID will map to multiple Ensembl IDs. +For more about how to explore this, take a look at our [RNA-seq gene ID conversion example](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html). + +Let's create a two column data frame that shows the gene symbols and their Ensembl IDs side-by-side. + +```{r} +# We would like a data frame we can join to the differential expression stats +gene_key_df <- data.frame( + ensembl_id = names(entrez_vector), + entrez_id = entrez_vector, + stringsAsFactors = FALSE +) %>% + # If an Ensembl gene identifier doesn't map to a gene symbol, drop that + # from the data frame + dplyr::filter(!is.na(entrez_id)) +``` + +Let's see a preview of `entrez_id`. + +```{r rownames.print=FALSE} +head(gene_key_df) +``` + +Now we are ready to add the `gene_key_df` to our data frame with the module labels, `gene_module_df`. +Here we're using a `dplyr::left_join()` because we only want to retain the genes that have Entrez IDs and this will filter out anything in our `gene_module_df` that does not have an Entrez ID when we join using the Ensembl gene identifiers. + +```{r} +module_annot_df <- gene_key_df %>% + # Using a left join removes the rows without gene symbols because those rows + # have already been removed in `gene_key_df` + dplyr::left_join(gene_module_df, + # The name of the column that contains the Ensembl gene IDs + # in the left data frame and right data frame + by = c("ensembl_id" = "gene") + ) +``` + +Let's take a look at what this data frame looks like. + +```{r rownames.print=FALSE} +# Print out a preview +head(module_annot_df) +``` + +## Over-representation Analysis (ORA) + +Over-representation testing using `clusterProfiler` is based on a hypergeometric test (often referred to as Fisher's exact test) [@clusterProfiler-book]. +For more background on hypergeometric tests, this [handy tutorial](https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html) explains more about how hypergeometric tests work [@Puthier2015]. + +We will need to provide to `clusterProfiler` two genes lists: + +1) Our genes of interest +2) What genes were in our total background set. (All genes that originally had an opportunity to be measured). + +## Determine our genes of interest list + +This step is highly variable depending on the format of your gene table, what information it contains and what your goals are. +You may want to delete this next chunk entirely if you supply an already determined list of genes OR you may need to introduce cutoffs and filters that we don't need here, given the nature of our data. + +Here, we will focus on one module, module 19, to identify pathways associated with it. +We previously identified this module as differentially expressed between our dataset's two time points (during acute illness and during recovery). +See [the previous section](#about-the-dataset-we-are-using-for-this-example) for more background on the structure and content of the data table we are using. + +```{r} +module_19_genes <- module_annot_df %>% + dplyr::filter(module == "ME19") %>% + dplyr::pull("entrez_id") +``` + +Because one `entrez_id` may map to multiple Ensembl IDs, we need to make sure we have no repeated Entrez IDs in this list. + +```{r} +# Reduce to only unique Entrez IDs +genes_of_interest <- unique(as.character(module_19_genes)) + +# Let's print out some of these genes +head(genes_of_interest) +``` + +## Determine our background set gene list + +Sometimes folks consider genes from the entire genome to comprise the background, but for this RNA-seq example, we will consider all detectable genes as our background set. +The dataset that these genes were selected from already had unreliably detected, [low count genes removed](https://alexslemonade.github.io/refinebio-examples/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html#prepare-data-for-deseq2). +Because of this, we can obtain our detected genes list from our data frame, `module_annot_df` (which we have not done any further filtering on in this notebook). + +```{r} +# Remove any duplicated entrez_ids +background_set <- unique(as.character(module_annot_df$entrez_id)) +``` + +## Run ORA using the `enricher()` function + +Now that we have our background set, our genes of interest, and our pathway information, we're ready to run ORA using the `enricher()` function. + +```{r} +kegg_ora_results <- enricher( + gene = genes_of_interest, # A vector of your genes of interest + pvalueCutoff = 0.1, # Can choose a FDR cutoff + pAdjustMethod = "BH", # Method to be used for multiple testing correction + universe = background_set, # A vector containing your background set genes + # The pathway information should be a data frame with a term name or + # identifier and the gene identifiers + TERM2GENE = dplyr::select( + hs_kegg_df, + gs_name, + human_entrez_gene + ) +) +``` + +*Note: using `enrichKEGG()` is a shortcut for doing ORA using KEGG, but the approach we covered here can be used with any gene sets you'd like!* + +The information we're most likely interested in is in the `results` slot. +Let's convert this into a data frame that we can write to file. + +```{r} +kegg_result_df <- data.frame(kegg_ora_results@result) +``` + +Let's print out a sneak peek of the results here and take a look at how many gene sets we have using an FDR cutoff of `0.1`. + +```{r} +kegg_result_df %>% + dplyr::filter(p.adjust < 0.1) +``` + +Looks like there are four KEGG sets returned as significant at FDR of `0.1`. + +## Visualizing results + +We can use a dot plot to visualize our significant enrichment results. +The `enrichplot::dotplot()` function will only plot gene sets that are significant according to the multiple testing corrected p values (in the `p.adjust` column) and the `pvalueCutoff` you provided in the [`enricher()` step](#run-ora-using-the-enricher-function). + +```{r} +enrich_plot <- enrichplot::dotplot(kegg_ora_results) + +# Print out the plot here +enrich_plot +``` + +Use `?enrichplot::dotplot` to see the help page for more about how to use this function. + +This plot is arguably more useful when we have a large number of significant pathways. + +Let's save it to a PNG. + +```{r} +ggplot2::ggsave(file.path(plots_dir, "SRP140558_ora_enrich_plot_module_19.png"), + plot = enrich_plot +) +``` + +We can use an [UpSet plot](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720993/) to visualize the **overlap** between the gene sets that were returned as significant. + +```{r} +upset_plot <- enrichplot::upsetplot(kegg_ora_results) + +# Print out the plot here +upset_plot +``` + +See that `KEGG_CELL_CYCLE` and `KEGG_OOCYTE_MEIOSIS` have genes in common, as do `KEGG_CELL_CYCLE` and `KEGG_DNA_REPLICATION`. +Gene sets or pathways aren't independent! +Based on the context of your samples, you may be able to narrow down which ones make sense. +In this instance, we are dealing with PBMCs, so the oocyte meiosis is not relevant to the biology of the samples at hand, and all of the identified genes in that pathway are also part of the cell cycle pathway. + +Let's also save this to a PNG. + +```{r} +ggplot2::ggsave(file.path(plots_dir, "SRP140558_ora_upset_plot_module_19.png"), + plot = upset_plot +) +``` + +## Write results to file + +```{r} +readr::write_tsv( + kegg_result_df, + file.path( + results_dir, + "SRP140558_module_19_pathway_analysis_results.tsv" + ) +) +``` + +# Resources for further learning + +- [Hypergeometric test exercises](https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html)[@Puthier2015]. +- [clusterProfiler ORA tutorial](https://learn.gencore.bio.nyu.edu/rna-seq-analysis/over-representation-analysis/#:~:text=Over%2Drepresentation%20(or%20enrichment),a%20subset%20of%20your%20data.) +- [clusterProfiler paper](https://doi.org/10.1089/omi.2011.0118) [@Yu2012]. +- [clusterProfiler book](https://yulab-smu.github.io/clusterProfiler-book/index.html) [@clusterProfiler-book]. +- [This handy review](https://doi.org/10.1371/journal.pcbi.1002375) which summarizes the different types of pathway analysis and their limitations [@Khatri2012]. + +# Session info + +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. + +```{r} +# Print session info +sessioninfo::session_info() +``` + +# References diff --git a/03-rnaseq/pathway-analysis_rnaseq_01_ora.html b/03-rnaseq/pathway-analysis_rnaseq_01_ora.html new file mode 100644 index 00000000..4f0fb71c --- /dev/null +++ b/03-rnaseq/pathway-analysis_rnaseq_01_ora.html @@ -0,0 +1,4445 @@ + + + + + + + + + + + + + + +Over-representation analysis - RNA-Seq + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    +
    + +
    + + + + + + + + + +
    +

    1 Purpose of this analysis

    +

    This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

    +

    This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes shares more or fewer genes with gene sets/pathways than we would expect by chance.

    +

    ORA is a broadly applicable technique that may be good in scenarios where your dataset or scientific questions don’t fit the requirements of other pathway analyses methods. It also does not require any particular sample size, since the only input from your dataset is a set of genes of interest (Yaari et al. 2013).

    +

    If you have differential expression results or something with a gene-level ranking and a two-group comparison, we recommend considering GSEA for your pathway analysis questions. ⬇️ Jump to the analysis code ⬇️

    +
    +

    1.0.1 What is pathway analysis?

    +

    Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

    +

    We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

    +
    +
    +

    1.0.2 How to choose a pathway analysis?

    +

    This table summarizes the pathway analyses examples in this module.

    + +++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    AnalysisWhat is required for inputWhat output looks like✅ Pros⚠️ Cons
    ORA (Over-representation Analysis)A list of gene IDs (no stats needed)A per-pathway hypergeometric test result- Simple

    - Inexpensive computationally to calculate p-values
    - Requires arbitrary thresholds and ignores any statistics associated with a gene

    - Assumes independence of genes and pathways
    GSEA (Gene Set Enrichment Analysis)A list of genes IDs with gene-level summary statisticsA per-pathway enrichment score- Includes all genes (no arbitrary threshold!)

    - Attempts to measure coordination of genes
    - Permutations can be expensive

    - Does not account for pathway overlap

    - Two-group comparisons not always appropriate/feasible
    GSVA (Gene Set Variation Analysis)A gene expression matrix (like what you get from refine.bio directly)Pathway-level scores on a per-sample basis- Does not require two groups to compare upfront

    - Normally distributed scores
    - Scores are not a good fit for gene sets that contain genes that go up AND down

    - Method doesn’t assign statistical significance itself

    - Recommended sample size n > 10
    +
    +
    +
    +

    2 How to run this example

    +

    For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

    +
    +

    2.1 Obtain the .Rmd file

    +

    To run this example yourself, download the .Rmd for this analysis by clicking this link.

    +

    Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

    +

    You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

    +
    +
    +

    2.2 Set up your analysis folders

    +

    Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

    +

    If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

    +
    # Create the data folder if it doesn't exist
    +if (!dir.exists("data")) {
    +  dir.create("data")
    +}
    +
    +# Define the file path to the plots directory
    +plots_dir <- "plots"
    +
    +# Create the plots folder if it doesn't exist
    +if (!dir.exists(plots_dir)) {
    +  dir.create(plots_dir)
    +}
    +
    +# Define the file path to the results directory
    +results_dir <- "results"
    +
    +# Create the results folder if it doesn't exist
    +if (!dir.exists(results_dir)) {
    +  dir.create(results_dir)
    +}
    +

    In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

    +
    +
    +

    2.3 About the dataset we are using for this example

    +

    For this example analysis, we will use this acute viral bronchiolitis dataset. The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated “AV”) and their recovery, their post-convalescence visit (abbreviated “CV”).

    +

    We used this dataset to identify modules of co-expressed genes in an example analysis using WGCNA (Langfelder and Horvath 2008).

    +

    We have provided this file for you and the code in this notebook will read in the results that are stored online. If you’d like to follow the steps for creating this results file from the refine.bio data, we suggest going through that WGCNA example.

    +

    Module 19 was the most differentially expressed between the datasets’ two time points (during illness and recovering from illness).

    +

    The heatmap below summarizes the expression of the genes that make up module 19.

    +

    +

    Each row is a gene that is a member of module 19, and the composite expression of these genes, called an eigengene, is shown in the barplot below. This plot demonstrates how these genes, together as a module, are differentially expressed between the two time points.

    +
    +
    +

    2.4 Check out our file structure!

    +

    Your new analysis folder should contain:

    +
      +
    • The example analysis .Rmd you downloaded
      +
    • +
    • A folder called data (currently empty)
    • +
    • A folder for plots (currently empty)
      +
    • +
    • A folder for results (currently empty)
    • +
    +

    Your example analysis folder should contain your .Rmd and three empty folders (which won’t be empty for long!).

    +

    If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

    +
    +
    +
    +

    3 Using a different refine.bio dataset with this analysis?

    +

    The file we use here has two columns from our WGCNA module results: the id of each gene and the module it is part of. If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend replacing the gene_module_url with a different file path to a read in a similar table of genes with the information that you are interested in. If your gene table differs, many steps will need to be changed or deleted entirely depending on the contents of that file (particularly in the Determine our genes of interest list section).

    +

    We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

    +
    + +

     

    +
    +
    +

    4 Over-Representation Analysis with clusterProfiler - RNA-seq

    +
    +

    4.1 Install libraries

    +

    See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

    +

    In this analysis, we will be using clusterProfiler package to perform ORA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005; Liberzon et al. 2011).

    +

    We will also need the org.Hs.eg.db package (Carlson 2020) to perform gene identifier conversion and ggupset to make an UpSet plot (Ahlmann-Eltze 2020).

    +
    if (!("clusterProfiler" %in% installed.packages())) {
    +  # Install this package if it isn't installed yet
    +  BiocManager::install("clusterProfiler", update = FALSE)
    +}
    +
    +# This is required to make one of the plots that clusterProfiler will make
    +if (!("ggupset" %in% installed.packages())) {
    +  # Install this package if it isn't installed yet
    +  BiocManager::install("ggupset", update = FALSE)
    +}
    +
    +if (!("msigdbr" %in% installed.packages())) {
    +  # Install this package if it isn't installed yet
    +  BiocManager::install("msigdbr", update = FALSE)
    +}
    +
    +if (!("org.Hs.eg.db" %in% installed.packages())) {
    +  # Install this package if it isn't installed yet
    +  BiocManager::install("org.Hs.eg.db", update = FALSE)
    +}
    +

    Attach the packages we need for this analysis.

    +
    # Attach the library
    +library(clusterProfiler)
    +
    +# Package that contains MSigDB gene sets in tidy format
    +library(msigdbr)
    +
    +# Homo sapiens annotation package we'll use for gene identifier conversion
    +library(org.Hs.eg.db)
    +
    +# We will need this so we can use the pipe: %>%
    +library(magrittr)
    +
    +
    +

    4.2 Download data file

    +

    For ORA, we only need a list of genes of interest and a background gene set as our input, so this example can work for any situations where you have gene list and want to know more about what biological pathways are significantly represented.

    +

    For this example, we will read in results from a co-expression network analysis that we have already performed. Rather than reading from a local file, we will download the results table directly from a URL. These results are from a acute bronchiolitis experiment we used for an example WGCNA analysis (Langfelder and Horvath 2008).

    +

    The table contains two columns, one with Ensembl gene IDs, and the other with the name of the module they are a part of. We will perform ORA on one of the modules we identified in the WGCNA analysis but the rest of the genes will be used as “background genes”.

    +

    Instead of using this URL below, you can use a file path to a TSV file with your desired gene list. First we will assign the URL to its own variable called, gene_module_url.

    +
    # Define the url to your gene list file
    +gene_module_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/04-advanced-topics/results/SRP140558_wgcna_gene_to_module.tsv"
    +

    We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.

    +
    gene_module_file <- file.path(
    +  results_dir,
    +  "SRP140558_wgcna_gene_to_module.tsv"
    +)
    +

    Using the URL (gene_module_url) and file path (gene_module_file) we can download the file and use the destfile argument to specify where it should be saved.

    +
    download.file(
    +  gene_module_url,
    +  # The file will be saved to this location and with this name
    +  destfile = gene_module_file
    +)
    +

    Now let’s double check that the results file is in the right place.

    +
    # Check if the file exists
    +file.exists(gene_module_file)
    +
    ## [1] TRUE
    +
    +
    +

    4.3 Import data

    +

    Read in the file that has WGCNA gene and module results.

    +
    # Read in the contents of the WGCNA gene modules file
    +gene_module_df <- readr::read_tsv(gene_module_file)
    +
    ## 
    +## ── Column specification ──────────────────────────────────────────────
    +## cols(
    +##   gene = col_character(),
    +##   module = col_character()
    +## )
    +

    Note that read_tsv() can also read TSV files directly from a URL and doesn’t necessarily require you download the file first. If you wanted to use that feature, you could replace the call above with readr::read_tsv(gene_module_url) and skip the download steps.

    +

    Let’s take a look at this gene module table.

    +
    gene_module_df
    +
    + +
    +
    +
    +

    4.4 Getting familiar with MSigDB gene sets available via msigdbr

    +

    The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). We can use the msigdbr package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler (Yu et al. 2012; Dolgalev 2020).

    +

    The gene sets available directly from MSigDB are applicable to human studies. msigdbr also supports commonly studied model organisms.

    +

    Let’s take a look at what organisms the package supports with msigdbr_species().

    +
    msigdbr_species()
    +
    + +
    +

    The data we’re interested in here comes from human samples, so we can obtain only the gene sets relevant to H. sapiens with the species argument to msigdbr().

    +
    hs_msigdb_df <- msigdbr(species = "Homo sapiens")
    +

    MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated). In this example, we will use pathways that are gene sets considered to be “canonical representations of a biological process compiled by domain experts” and are a subset of C2: curated gene sets (Subramanian et al. 2005; Liberzon et al. 2011).

    +

    Specifically, we will use the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways (Kanehisa and Goto 2000).

    +

    First, let’s take a look at what information is included in this data frame.

    +
    head(hs_msigdb_df)
    +
    + +
    +

    We will need to use gs_cat and gs_subcat columns to construct a filter step that will only keep curated gene sets and KEGG pathways.

    +
    # Filter the human data frame to the KEGG pathways that are included in the
    +# curated gene sets
    +hs_kegg_df <- hs_msigdb_df %>%
    +  dplyr::filter(
    +    gs_cat == "C2", # This is to filter only to the C2 curated gene sets
    +    gs_subcat == "CP:KEGG" # This is because we only want KEGG pathways
    +  )
    +

    The clusterProfiler() function we will use requires a data frame with two columns, where one column contains the term identifier or name and one column contains gene identifiers that match our gene lists we want to check for enrichment.

    +

    Our data frame with KEGG terms contains Entrez IDs and gene symbols.

    +

    In our differential expression results data frame, gene_module_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs.

    +
    +
    +

    4.5 Gene identifier conversion

    +

    We’re going to convert our identifiers in gene_module_df to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

    +

    The annotation package org.Hs.eg.db contains information for different identifiers (Carlson 2020). org.Hs.eg.db is specific to Homo sapiens – this is what the Hs in the package name is referencing.

    +

    Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

    +

    We can see what types of IDs are available to us in an annotation package with keytypes().

    +
    keytypes(org.Hs.eg.db)
    +
    ##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
    +##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
    +##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
    +## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
    +## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
    +## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
    +## [25] "UNIGENE"      "UNIPROT"
    +

    Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to Entrez IDs (ENTREZID), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS) to gene symbols (SYMBOL).

    +

    The function we will use to map from Ensembl gene IDs to Entrez IDs is called mapIds() and comes from the AnnotationDbi package.

    +
    # This returns a named vector which we can convert to a data frame, where
    +# the keys (Ensembl IDs) are the names
    +entrez_vector <- mapIds(
    +  # Replace with annotation package for the organism relevant to your data
    +  org.Hs.eg.db,
    +  # The vector of gene identifiers we want to map
    +  keys = gene_module_df$gene,
    +  # Replace with the type of gene identifiers in your data
    +  keytype = "ENSEMBL",
    +  # Replace with the type of gene identifiers you would like to map to
    +  column = "ENTREZID",
    +  # In the case of 1:many mappings, return the
    +  # first one. This is default behavior!
    +  multiVals = "first"
    +)
    +
    ## 'select()' returned 1:many mapping between keys and columns
    +

    This message is letting us know that sometimes Ensembl gene identifiers will map to multiple Entrez IDs. In this case, it’s also possible that an Entrez ID will map to multiple Ensembl IDs. For more about how to explore this, take a look at our RNA-seq gene ID conversion example.

    +

    Let’s create a two column data frame that shows the gene symbols and their Ensembl IDs side-by-side.

    +
    # We would like a data frame we can join to the differential expression stats
    +gene_key_df <- data.frame(
    +  ensembl_id = names(entrez_vector),
    +  entrez_id = entrez_vector,
    +  stringsAsFactors = FALSE
    +) %>%
    +  # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
    +  # from the data frame
    +  dplyr::filter(!is.na(entrez_id))
    +

    Let’s see a preview of entrez_id.

    +
    head(gene_key_df)
    +
    + +
    +

    Now we are ready to add the gene_key_df to our data frame with the module labels, gene_module_df. Here we’re using a dplyr::left_join() because we only want to retain the genes that have Entrez IDs and this will filter out anything in our gene_module_df that does not have an Entrez ID when we join using the Ensembl gene identifiers.

    +
    module_annot_df <- gene_key_df %>%
    +  # Using a left join removes the rows without gene symbols because those rows
    +  # have already been removed in `gene_key_df`
    +  dplyr::left_join(gene_module_df,
    +    # The name of the column that contains the Ensembl gene IDs
    +    # in the left data frame and right data frame
    +    by = c("ensembl_id" = "gene")
    +  )
    +

    Let’s take a look at what this data frame looks like.

    +
    # Print out a preview
    +head(module_annot_df)
    +
    + +
    +
    +
    +

    4.6 Over-representation Analysis (ORA)

    +

    Over-representation testing using clusterProfiler is based on a hypergeometric test (often referred to as Fisher’s exact test) (Yu 2020). For more background on hypergeometric tests, this handy tutorial explains more about how hypergeometric tests work (Puthier and van Helden 2015).

    +

    We will need to provide to clusterProfiler two genes lists:

    +
      +
    1. Our genes of interest
    2. +
    3. What genes were in our total background set. (All genes that originally had an opportunity to be measured).
    4. +
    +
    +
    +

    4.7 Determine our genes of interest list

    +

    This step is highly variable depending on the format of your gene table, what information it contains and what your goals are. You may want to delete this next chunk entirely if you supply an already determined list of genes OR you may need to introduce cutoffs and filters that we don’t need here, given the nature of our data.

    +

    Here, we will focus on one module, module 19, to identify pathways associated with it. We previously identified this module as differentially expressed between our dataset’s two time points (during acute illness and during recovery). See the previous section for more background on the structure and content of the data table we are using.

    +
    module_19_genes <- module_annot_df %>%
    +  dplyr::filter(module == "ME19") %>%
    +  dplyr::pull("entrez_id")
    +

    Because one entrez_id may map to multiple Ensembl IDs, we need to make sure we have no repeated Entrez IDs in this list.

    +
    # Reduce to only unique Entrez IDs
    +genes_of_interest <- unique(as.character(module_19_genes))
    +
    +# Let's print out some of these genes
    +head(genes_of_interest)
    +
    ## [1] "5704"  "578"   "23471" "5255"  "4171"  "8898"
    +
    +
    +

    4.8 Determine our background set gene list

    +

    Sometimes folks consider genes from the entire genome to comprise the background, but for this RNA-seq example, we will consider all detectable genes as our background set. The dataset that these genes were selected from already had unreliably detected, low count genes removed. Because of this, we can obtain our detected genes list from our data frame, module_annot_df (which we have not done any further filtering on in this notebook).

    +
    # Remove any duplicated entrez_ids
    +background_set <- unique(as.character(module_annot_df$entrez_id))
    +
    +
    +

    4.9 Run ORA using the enricher() function

    +

    Now that we have our background set, our genes of interest, and our pathway information, we’re ready to run ORA using the enricher() function.

    +
    kegg_ora_results <- enricher(
    +  gene = genes_of_interest, # A vector of your genes of interest
    +  pvalueCutoff = 0.1, # Can choose a FDR cutoff
    +  pAdjustMethod = "BH", # Method to be used for multiple testing correction
    +  universe = background_set, # A vector containing your background set genes
    +  # The pathway information should be a data frame with a term name or
    +  # identifier and the gene identifiers
    +  TERM2GENE = dplyr::select(
    +    hs_kegg_df,
    +    gs_name,
    +    human_entrez_gene
    +  )
    +)
    +

    Note: using enrichKEGG() is a shortcut for doing ORA using KEGG, but the approach we covered here can be used with any gene sets you’d like!

    +

    The information we’re most likely interested in is in the results slot. Let’s convert this into a data frame that we can write to file.

    +
    kegg_result_df <- data.frame(kegg_ora_results@result)
    +

    Let’s print out a sneak peek of the results here and take a look at how many gene sets we have using an FDR cutoff of 0.1.

    +
    kegg_result_df %>%
    +  dplyr::filter(p.adjust < 0.1)
    +
    + +
    +

    Looks like there are four KEGG sets returned as significant at FDR of 0.1.

    +
    +
    +

    4.10 Visualizing results

    +

    We can use a dot plot to visualize our significant enrichment results. The enrichplot::dotplot() function will only plot gene sets that are significant according to the multiple testing corrected p values (in the p.adjust column) and the pvalueCutoff you provided in the enricher() step.

    +
    enrich_plot <- enrichplot::dotplot(kegg_ora_results)
    +
    ## wrong orderBy parameter; set to default `orderBy = "x"`
    +
    # Print out the plot here
    +enrich_plot
    +

    +

    Use ?enrichplot::dotplot to see the help page for more about how to use this function.

    +

    This plot is arguably more useful when we have a large number of significant pathways.

    +

    Let’s save it to a PNG.

    +
    ggplot2::ggsave(file.path(plots_dir, "SRP140558_ora_enrich_plot_module_19.png"),
    +  plot = enrich_plot
    +)
    +
    ## Saving 7 x 5 in image
    +

    We can use an UpSet plot to visualize the overlap between the gene sets that were returned as significant.

    +
    upset_plot <- enrichplot::upsetplot(kegg_ora_results)
    +
    +# Print out the plot here
    +upset_plot
    +

    +

    See that KEGG_CELL_CYCLE and KEGG_OOCYTE_MEIOSIS have genes in common, as do KEGG_CELL_CYCLE and KEGG_DNA_REPLICATION. Gene sets or pathways aren’t independent! Based on the context of your samples, you may be able to narrow down which ones make sense. In this instance, we are dealing with PBMCs, so the oocyte meiosis is not relevant to the biology of the samples at hand, and all of the identified genes in that pathway are also part of the cell cycle pathway.

    +

    Let’s also save this to a PNG.

    +
    ggplot2::ggsave(file.path(plots_dir, "SRP140558_ora_upset_plot_module_19.png"),
    +  plot = upset_plot
    +)
    +
    ## Saving 7 x 5 in image
    +
    +
    +

    4.11 Write results to file

    +
    readr::write_tsv(
    +  kegg_result_df,
    +  file.path(
    +    results_dir,
    +    "SRP140558_module_19_pathway_analysis_results.tsv"
    +  )
    +)
    +
    +
    +
    +

    5 Resources for further learning

    + +
    +
    +

    6 Session info

    +

    At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

    +
    # Print session info
    +sessioninfo::session_info()
    +
    ## ─ Session info ─────────────────────────────────────────────────────
    +##  setting  value                       
    +##  version  R version 4.0.2 (2020-06-22)
    +##  os       Ubuntu 20.04 LTS            
    +##  system   x86_64, linux-gnu           
    +##  ui       X11                         
    +##  language (EN)                        
    +##  collate  en_US.UTF-8                 
    +##  ctype    en_US.UTF-8                 
    +##  tz       Etc/UTC                     
    +##  date     2020-12-21                  
    +## 
    +## ─ Packages ─────────────────────────────────────────────────────────
    +##  package         * version  date       lib source        
    +##  AnnotationDbi   * 1.52.0   2020-10-27 [1] Bioconductor  
    +##  assertthat        0.2.1    2019-03-21 [1] RSPM (R 4.0.0)
    +##  backports         1.1.10   2020-09-15 [1] RSPM (R 4.0.2)
    +##  Biobase         * 2.50.0   2020-10-27 [1] Bioconductor  
    +##  BiocGenerics    * 0.36.0   2020-10-27 [1] Bioconductor  
    +##  BiocManager       1.30.10  2019-11-16 [1] RSPM (R 4.0.0)
    +##  BiocParallel      1.24.1   2020-11-06 [1] Bioconductor  
    +##  bit               4.0.4    2020-08-04 [1] RSPM (R 4.0.2)
    +##  bit64             4.0.5    2020-08-30 [1] RSPM (R 4.0.2)
    +##  blob              1.2.1    2020-01-20 [1] RSPM (R 4.0.0)
    +##  cli               2.1.0    2020-10-12 [1] RSPM (R 4.0.2)
    +##  clusterProfiler * 3.18.0   2020-10-27 [1] Bioconductor  
    +##  colorspace        1.4-1    2019-03-18 [1] RSPM (R 4.0.0)
    +##  cowplot           1.1.0    2020-09-08 [1] RSPM (R 4.0.2)
    +##  crayon            1.3.4    2017-09-16 [1] RSPM (R 4.0.0)
    +##  data.table        1.13.0   2020-07-24 [1] RSPM (R 4.0.2)
    +##  DBI               1.1.0    2019-12-15 [1] RSPM (R 4.0.0)
    +##  digest            0.6.25   2020-02-23 [1] RSPM (R 4.0.0)
    +##  DO.db             2.9      2020-12-16 [1] Bioconductor  
    +##  DOSE              3.16.0   2020-10-27 [1] Bioconductor  
    +##  downloader        0.4      2015-07-09 [1] RSPM (R 4.0.0)
    +##  dplyr             1.0.2    2020-08-18 [1] RSPM (R 4.0.2)
    +##  ellipsis          0.3.1    2020-05-15 [1] RSPM (R 4.0.0)
    +##  enrichplot        1.10.1   2020-11-14 [1] Bioconductor  
    +##  evaluate          0.14     2019-05-28 [1] RSPM (R 4.0.0)
    +##  fansi             0.4.1    2020-01-08 [1] RSPM (R 4.0.0)
    +##  farver            2.0.3    2020-01-16 [1] RSPM (R 4.0.0)
    +##  fastmatch         1.1-0    2017-01-28 [1] RSPM (R 4.0.0)
    +##  fgsea             1.16.0   2020-10-27 [1] Bioconductor  
    +##  generics          0.0.2    2018-11-29 [1] RSPM (R 4.0.0)
    +##  getopt            1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
    +##  ggforce           0.3.2    2020-06-23 [1] RSPM (R 4.0.2)
    +##  ggplot2           3.3.2    2020-06-19 [1] RSPM (R 4.0.1)
    +##  ggraph            2.0.3    2020-05-20 [1] RSPM (R 4.0.2)
    +##  ggrepel           0.8.2    2020-03-08 [1] RSPM (R 4.0.2)
    +##  ggupset           0.3.0    2020-05-05 [1] RSPM (R 4.0.0)
    +##  glue              1.4.2    2020-08-27 [1] RSPM (R 4.0.2)
    +##  GO.db             3.12.1   2020-12-16 [1] Bioconductor  
    +##  GOSemSim          2.16.1   2020-10-29 [1] Bioconductor  
    +##  graphlayouts      0.7.0    2020-04-25 [1] RSPM (R 4.0.2)
    +##  gridExtra         2.3      2017-09-09 [1] RSPM (R 4.0.0)
    +##  gtable            0.3.0    2019-03-25 [1] RSPM (R 4.0.0)
    +##  hms               0.5.3    2020-01-08 [1] RSPM (R 4.0.0)
    +##  htmltools         0.5.0    2020-06-16 [1] RSPM (R 4.0.1)
    +##  igraph            1.2.6    2020-10-06 [1] RSPM (R 4.0.2)
    +##  IRanges         * 2.24.1   2020-12-12 [1] Bioconductor  
    +##  jsonlite          1.7.1    2020-09-07 [1] RSPM (R 4.0.2)
    +##  knitr             1.30     2020-09-22 [1] RSPM (R 4.0.2)
    +##  labeling          0.3      2014-08-23 [1] RSPM (R 4.0.0)
    +##  lattice           0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
    +##  lifecycle         0.2.0    2020-03-06 [1] RSPM (R 4.0.0)
    +##  magrittr        * 1.5      2014-11-22 [1] RSPM (R 4.0.0)
    +##  MASS              7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
    +##  Matrix            1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
    +##  memoise           1.1.0    2017-04-21 [1] RSPM (R 4.0.0)
    +##  msigdbr         * 7.2.1    2020-10-02 [1] RSPM (R 4.0.2)
    +##  munsell           0.5.0    2018-06-12 [1] RSPM (R 4.0.0)
    +##  optparse        * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
    +##  org.Hs.eg.db    * 3.12.0   2020-12-16 [1] Bioconductor  
    +##  pillar            1.4.6    2020-07-10 [1] RSPM (R 4.0.2)
    +##  pkgconfig         2.0.3    2019-09-22 [1] RSPM (R 4.0.0)
    +##  plyr              1.8.6    2020-03-03 [1] RSPM (R 4.0.2)
    +##  polyclip          1.10-0   2019-03-14 [1] RSPM (R 4.0.0)
    +##  ps                1.4.0    2020-10-07 [1] RSPM (R 4.0.2)
    +##  purrr             0.3.4    2020-04-17 [1] RSPM (R 4.0.0)
    +##  qvalue            2.22.0   2020-10-27 [1] Bioconductor  
    +##  R.cache           0.14.0   2019-12-06 [1] RSPM (R 4.0.0)
    +##  R.methodsS3       1.8.1    2020-08-26 [1] RSPM (R 4.0.2)
    +##  R.oo              1.24.0   2020-08-26 [1] RSPM (R 4.0.2)
    +##  R.utils           2.10.1   2020-08-26 [1] RSPM (R 4.0.2)
    +##  R6                2.4.1    2019-11-12 [1] RSPM (R 4.0.0)
    +##  RColorBrewer      1.1-2    2014-12-07 [1] RSPM (R 4.0.0)
    +##  Rcpp              1.0.5    2020-07-06 [1] RSPM (R 4.0.2)
    +##  readr             1.4.0    2020-10-05 [1] RSPM (R 4.0.2)
    +##  rematch2          2.1.2    2020-05-01 [1] RSPM (R 4.0.0)
    +##  reshape2          1.4.4    2020-04-09 [1] RSPM (R 4.0.2)
    +##  rlang             0.4.8    2020-10-08 [1] RSPM (R 4.0.2)
    +##  rmarkdown         2.4      2020-09-30 [1] RSPM (R 4.0.2)
    +##  RSQLite           2.2.1    2020-09-30 [1] RSPM (R 4.0.2)
    +##  rstudioapi        0.11     2020-02-07 [1] RSPM (R 4.0.0)
    +##  rvcheck           0.1.8    2020-03-01 [1] RSPM (R 4.0.0)
    +##  S4Vectors       * 0.28.1   2020-12-09 [1] Bioconductor  
    +##  scales            1.1.1    2020-05-11 [1] RSPM (R 4.0.0)
    +##  scatterpie        0.1.5    2020-09-09 [1] RSPM (R 4.0.2)
    +##  sessioninfo       1.1.1    2018-11-05 [1] RSPM (R 4.0.0)
    +##  shadowtext        0.0.7    2019-11-06 [1] RSPM (R 4.0.0)
    +##  stringi           1.5.3    2020-09-09 [1] RSPM (R 4.0.2)
    +##  stringr           1.4.0    2019-02-10 [1] RSPM (R 4.0.0)
    +##  styler            1.3.2    2020-02-23 [1] RSPM (R 4.0.0)
    +##  tibble            3.0.4    2020-10-12 [1] RSPM (R 4.0.2)
    +##  tidygraph         1.2.0    2020-05-12 [1] RSPM (R 4.0.2)
    +##  tidyr             1.1.2    2020-08-27 [1] RSPM (R 4.0.2)
    +##  tidyselect        1.1.0    2020-05-11 [1] RSPM (R 4.0.0)
    +##  tweenr            1.0.1    2018-12-14 [1] RSPM (R 4.0.2)
    +##  vctrs             0.3.4    2020-08-29 [1] RSPM (R 4.0.2)
    +##  viridis           0.5.1    2018-03-29 [1] RSPM (R 4.0.0)
    +##  viridisLite       0.3.0    2018-02-01 [1] RSPM (R 4.0.0)
    +##  withr             2.3.0    2020-09-22 [1] RSPM (R 4.0.2)
    +##  xfun              0.18     2020-09-29 [1] RSPM (R 4.0.2)
    +##  yaml              2.2.1    2020-02-01 [1] RSPM (R 4.0.0)
    +## 
    +## [1] /usr/local/lib/R/site-library
    +## [2] /usr/local/lib/R/library
    +
    +
    +

    References

    +
    +
    +

    Ahlmann-Eltze C., 2020 ggupset: Combination matrix axis for ’ggplot2’ to create ’upset’ plots. https://github.com/const-ae/ggupset

    +
    +
    +

    Carlson M., 2020 org.Hs.eg.db: Genome wide annotation for human. http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html

    +
    +
    +

    Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html

    +
    +
    +

    Kanehisa M., and S. Goto, 2000 KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28: 27–30. https://doi.org/10.1093/nar/28.1.27

    +
    +
    +

    Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

    +
    +
    +

    Langfelder P., and S. Horvath, 2008 WGCNA: An r package for weighted correlation network analysis. BMC Bioinformatics 9. https://doi.org/10.1186/1471-2105-9-559

    +
    +
    +

    Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, and P. Tamayo et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260

    +
    +
    +

    Puthier D., and J. van Helden, 2015 Statistics for Bioinformatics - Practicals - Gene enrichment statistics. https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html

    +
    +
    +

    Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

    +
    +
    +

    Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660

    +
    +
    +

    Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118

    +
    +
    +

    Yu G., 2020 clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html

    +
    +
    +
    + + + + +
    +
    + +
    + + + + + + + + + + + + + + + + diff --git a/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd b/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd new file mode 100644 index 00000000..d8c4315a --- /dev/null +++ b/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd @@ -0,0 +1,589 @@ +--- +title: "Gene set enrichment analysis - RNA-seq" +author: "CCDL for ALSF" +date: "December 2020" +output: + html_notebook: + toc: true + toc_float: true + number_sections: true +--- + +# Purpose of this analysis + +This example is one of pathway analysis module set, we recommend looking at the [pathway analysis table below](#how-to-choose-a-pathway-analysis) to help you determine which pathway analysis method is best suited for your purposes. + +This particular example analysis shows how you can use Gene Set Enrichment Analysis (GSEA) to detect situations where genes in a predefined gene set or pathway change in a coordinated way between two conditions [@Subramanian2005]. +Changes at the pathway-level may be statistically significant, and contribute to phenotypic differences, even if the changes in the expression level of individual genes are small. + +⬇️ [**Jump to the analysis code**](#analysis) ⬇️ + +### What is pathway analysis? + +Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. +In the context of [refine.bio](https://www.refine.bio/), we use these techniques to analyze and interpret genome-wide gene expression experiments. +The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. +In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis. + +We highly recommend taking a look at [Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002375) from @Khatri2012 for a more comprehensive overview. +We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the [`Resources for further learning` section](#resources-for-further-learning). + +### How to choose a pathway analysis? + +This table summarizes the pathway analyses examples in this module. + +|Analysis|What is required for input|What output looks like |✅ Pros| ⚠️ Cons| +|--------|--------------------------|-----------------------|-------|-------| +|[**ORA (Over-representation Analysis)**](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_01_ora.html)|A list of gene IDs (no stats needed)|A per-pathway hypergeometric test result|- Simple

    - Inexpensive computationally to calculate p-values| - Requires arbitrary thresholds and ignores any statistics associated with a gene

    - Assumes independence of genes and pathways| +|[**GSEA (Gene Set Enrichment Analysis)**](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html)|A list of genes IDs with gene-level summary statistics|A per-pathway enrichment score|- Includes all genes (no arbitrary threshold!)

    - Attempts to measure coordination of genes|- Permutations can be expensive

    - Does not account for pathway overlap

    - Two-group comparisons not always appropriate/feasible| +|[**GSVA (Gene Set Variation Analysis)**](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_03_gsva.html)|A gene expression matrix (like what you get from refine.bio directly)|Pathway-level scores on a per-sample basis|- Does not require two groups to compare upfront

    - Normally distributed scores|- Scores are not a good fit for gene sets that contain genes that go up AND down

    - Method doesn’t assign statistical significance itself

    - Recommended sample size n > 10| + +# How to run this example + +For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. + +## Obtain the `.Rmd` file + +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd). + +Clicking this link will most likely send this to your downloads folder on your computer. +Move this `.Rmd` file to where you would like this example and its files to be stored. + +You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) + +## Set up your analysis folders + +Good file organization is helpful for keeping your data analysis project on track! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! + +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. + +```{r} +# Create the data folder if it doesn't exist +if (!dir.exists("data")) { + dir.create("data") +} + +# Define the file path to the plots directory +plots_dir <- "plots" # Can replace with path to desired output plots directory + +# Create the plots folder if it doesn't exist +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} + +# Define the file path to the results directory +results_dir <- "results" # Can replace with path to desired output results directory + +# Create the results folder if it doesn't exist +if (!dir.exists(results_dir)) { + dir.create(results_dir) +} +``` + +In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! + +## Obtain the gene set for this example + +In this example, we are using the differential expression results table we obtained from an [example analysis of an acute lymphoblastic leukemia (ALL) mouse lymphoid cell model](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential-expression_rnaseq_01.html) using the `DESeq2` package [@Love2014]. +The table contains summary statistics including Ensembl gene IDs, log2 fold change values, and adjusted p-values (FDR in this case). + +We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you'd like to follow the steps for obtaining this results file yourself, we suggest going through [that differential expression analysis example](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential-expression_rnaseq_01.html). + +## About the dataset we are using for this example + +For this example analysis, we are using RNA-seq data from an [acute lymphoblastic leukemia (ALL) mouse lymphoid cell model](https://www.refine.bio/experiments/SRP123625) [@Kampen2019]. +All lymphoid mouse cells have human RPL10 but three of the mice have a knock-in R98S mutated RPL10 and three have the human reference RPL10. +Differential expression was performed using these mutated and reference RPL10 gene designations. + +## Check out our file structure! + +Your new analysis folder should contain: + +- The example analysis `.Rmd` you downloaded +- A folder called `data` (currently empty) +- A folder for `plots` (currently empty) +- A folder for `results` (currently empty) + +Your example analysis folder should contain your `.Rmd` and three empty folders (which won't be empty for long!). + +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). + +# Using a different refine.bio dataset with this analysis? + +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). +We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. +From here you can customize this analysis example to fit your own scientific questions and preferences. + +*** + +   + +# Gene set enrichment analysis - RNA-seq + +## Install libraries + +See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. + +In this analysis, we will be using [`clusterProfiler`](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package to perform GSEA and the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package which contains gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) already in the tidy format required by `clusterProfiler` [@Yu2012; @Dolgalev2020; @Subramanian2005; @Liberzon2011]. + +We will also need the [`org.Mm.eg.db`](https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html) package to perform gene identifier conversion [@Carlson2019-mouse]. + +```{r} +if (!("clusterProfiler" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("clusterProfiler", update = FALSE) +} + +if (!("msigdbr" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("msigdbr", update = FALSE) +} + +if (!("org.Mm.eg.db" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("org.Mm.eg.db", update = FALSE) +} +``` + +Attach the packages we need for this analysis. + +```{r message=FALSE} +# Attach the library +library(clusterProfiler) + +# Package that contains MSigDB gene sets in tidy format +library(msigdbr) + +# Human annotation package we'll use for gene identifier conversion +library(org.Mm.eg.db) + +# We will need this so we can use the pipe: %>% +library(magrittr) +``` + +## Download data file + +We will read in the differential expression results we will download from online. +These results are from an acute lymphoblastic leukemia (ALL) mouse lymphoid cell model we used for [differential expression analysis](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential-expression_rnaseq_01.html) using `DESeq2` [@Love2014]. +The table contains summary statistics including Ensembl gene IDs, log2 fold change values, and adjusted p-values (FDR in this case). +We can identify differentially regulated genes by filtering these results and use this list as input to GSEA. + +Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. +First we will assign the URL to its own variable called, `dge_url`. + +```{r} +# Define the url to your differential expression results file +dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/03-rnaseq/results/SRP123625/SRP123625_differential_expression_results.tsv" +``` + +We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R. + +```{r} +dge_results_file <- file.path( + results_dir, + "SRP123625_differential_expression_results.tsv" +) +``` + +Using the URL (`dge_url`) and file path (`dge_results_file`) we can download the file and use the `destfile` argument to specify where it should be saved. + +```{r} +download.file( + dge_url, + # The file will be saved to this location and with this name + destfile = dge_results_file +) +``` + +Now let's double check that the results file is in the right place. + +```{r} +# Check if the results file exists +file.exists(dge_results_file) +``` + +## Import data + +Read in the file that has differential expression results. + +```{r} +# Read in the contents of the differential expression results file +dge_df <- readr::read_tsv(dge_results_file) +``` + +Note that `read_tsv()` can also read TSV files directly from a URL and doesn't necessarily require you download the file first. +If you wanted to use that feature, you could replace the call above with `readr::read_tsv(dge_url)` and skip the download steps. + +Let's take a look at what the contrast results from the differential expression analysis looks like. + +```{r} +dge_df +``` + +## Getting familiar with MSigDB gene sets available via `msigdbr` + +The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses [@Subramanian2005; @Liberzon2011]. +We can use the `msigdbr` package to access these gene sets in a format compatible with the package we'll use for analysis, `clusterProfiler` [@Dolgalev2020; @Yu2012]. + +The gene sets available directly from MSigDB are applicable to human studies. +`msigdbr` also supports commonly studied model organisms. + +Let's take a look at what organisms the package supports with `msigdbr_species()`. + +```{r} +msigdbr_species() +``` + +MSigDB contains [8 different gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) [@Subramanian2005; @Liberzon2011] that are distinguished by how they are derived (e.g., computationally mined, curated). + +In this example, we will use a collection called Hallmark gene sets for GSEA [@Liberzon2015]. +Here's an excerpt of [the collection description from MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H): + +> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. +> These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. +> The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA. + +Notably, there are only 50 gene sets included in this collection. +The fewer gene sets we test, the lower our multiple hypothesis testing burden. + +The data we're interested in here comes from mouse samples, so we can obtain only the Hallmarks gene sets relevant to _M. musculus_ by specifying `category = "H"` and `species = "Mus musculus"`, respectively, to the `msigdbr()` function. + +```{r} +mm_hallmark_sets <- msigdbr( + species = "Mus musculus", # Replace with species name relevant to your data + category = "H" +) +``` + +If you run the chunk above without specifying a `category` to the `msigdbr()` function, it will return all of the MSigDB gene sets for mouse. +See `?msigdbr` for more options. + +Let's preview what's in `mm_hallmark_sets`. + +```{r rownames.print=FALSE} +head(mm_hallmark_sets) +``` + +Looks like we have a data frame of gene sets with associated gene symbols and Entrez IDs. + +In our differential expression results data frame, `dge_df` we have Ensembl gene identifiers. +So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs for GSEA. + +## Gene identifier conversion + +We're going to convert our identifiers in `dge_df` to gene symbols, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers! + +The annotation package `org.Mm.eg.db` contains information for different identifiers [@Carlson2019-mouse]. +`org.Mm.eg.db` is specific to _Mus musculus_ -- this is what the `Mm` in the package name is referencing. + +We can see what types of IDs are available to us in an annotation package with `keytypes()`. + +```{r} +keytypes(org.Mm.eg.db) +``` + +Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to gene symbols (`SYMBOL`), we could just as easily use it to convert to and from any of these `keytypes()` listed above. + +The function we will use to map from Ensembl gene IDs to gene symbols is called `mapIds()` and comes from the `AnnotationDbi`. + +Let's create a data frame that shows the mapped gene symbols along with the differential expression stats for the respective Ensembl IDs. + +```{r} +# First let's create a mapped data frame we can join to the differential +# expression stats +dge_mapped_df <- data.frame( + gene_symbol = mapIds( + # Replace with annotation package for the organism relevant to your data + org.Mm.eg.db, + keys = dge_df$Gene, + # Replace with the type of gene identifiers in your data + keytype = "ENSEMBL", + # Replace with the type of gene identifiers you would like to map to + column = "SYMBOL", + # This will keep only the first mapped value for each Ensembl ID + multiVals = "first" + ) +) %>% + # If an Ensembl gene identifier doesn't map to a gene symbol, drop that + # from the data frame + dplyr::filter(!is.na(gene_symbol)) %>% + # Make an `Ensembl` column to store the rownames + tibble::rownames_to_column("Ensembl") %>% + # Now let's join the rest of the expression data + dplyr::inner_join(dge_df, by = c("Ensembl" = "Gene")) +``` + +This `1:many mapping between keys and columns` message means that some Ensembl gene identifiers map to multiple gene symbols. +In this case, it's also possible that a gene symbol will map to multiple Ensembl IDs. +For the purpose of performing GSEA later in this notebook, we keep only the first mapped IDs. +Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: +[the microarray example](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html) and [the RNA-seq example](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html). + +Let's see a preview of `dge_mapped_df`. + +```{r rownames.print=FALSE} +head(dge_mapped_df) +``` + +## Perform gene set enrichment analysis (GSEA) + +The goal of GSEA is to detect situations where many genes in a gene set change in a coordinated way, even when individual changes are small in magnitude [@Subramanian2005]. + +GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic. +This score reflects whether or not a gene set or pathway is overrepresented at the top or bottom of the gene rankings [@Subramanian2005; @clusterProfiler-book]. +Specifically, genes are ranked from most positive to most negative based on their statistic and a running sum is calculated by starting with the most highly ranked genes and increasing the score when a gene is in the pathway and decreasing the score when a gene is not. +In this example, the enrichment score for a pathway is the running sum's maximum deviation from zero. +GSEA also assesses statistical significance of the scores for each pathway through permutation testing. +As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing [@Subramanian2005; @clusterProfiler-book]. + +The implementation of GSEA we use in this examples requires a gene list ordered by some statistic (here we'll use log2 fold changes calculated as part of differential gene expression analysis) and input gene sets (Hallmark collection). +When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked. + +### Determine our pre-ranked genes list + +The `GSEA()` function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. +It requires _unique gene identifiers_ to produce the most accurate results, so we will need to resolve any duplicates found in our dataset. +(The `GSEA()` function will throw a warning if we do not do this ahead of time.) + +Let's check to see if we have any gene symbols that mapped to multiple Ensembl IDs. + +```{r} +any(duplicated(dge_mapped_df$gene_symbol)) +``` + +Looks like we do have duplicated gene symbols. +Let's find out which ones. + +```{r} +dup_gene_symbols <- dge_mapped_df %>% + dplyr::filter(duplicated(gene_symbol)) %>% + dplyr::pull(gene_symbol) +``` + +Now let's take a look at the rows associated with the duplicated gene symbols. + +```{r} +dge_mapped_df %>% + dplyr::filter(gene_symbol %in% dup_gene_symbols) %>% + dplyr::arrange(gene_symbol) +``` + +We can see that the associated values vary for each row. + +As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the `GSEA()` step. +Let's keep the Entrez IDs associated with the higher absolute value of the log2 fold change. +GSEA relies on genes' rankings on the basis of a gene-level statistic and the enrichment score that is calculated reflects the degree to which genes in a gene set are overrepresented in the top or bottom of the rankings [@Subramanian2005; @clusterProfiler-book]. + +Retaining the instance of the Entrez ID with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. +We should keep this decision in mind when interpreting our results. +For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking. + +We are removing values for 33 out of thousands of genes here, so it is unlikely to have a considerable impact on our results. + +In the next chunk, we are going to filter out the duplicated row using the `dplyr::distinct()` function +This will keep the first row with the duplicated value thus keeping the row with the highest absolute value of the log2 fold change. + +```{r} +filtered_dge_mapped_df <- dge_mapped_df %>% + # Sort so that the highest absolute values of the log2 fold change are at the + # top + dplyr::arrange(dplyr::desc(abs(log2FoldChange))) %>% + # Filter out the duplicated rows using `dplyr::distinct()` + dplyr::distinct(gene_symbol, .keep_all = TRUE) +``` + +_Note that the log2 fold change estimates we use here have been subject to shrinkage to account for genes with low counts or highly variable counts. +See the [`DESeq2` package vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) for more information on how `DESeq2` handles the log2 fold change values with the `lfcShrink()` function._ + +Let's check to see that we removed the duplicate gene symbols and kept the rows with the higher absolute value of the log2 fold change. + +```{r} +any(duplicated(filtered_dge_mapped_df$gene_symbol)) +``` + +Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher absolute value of the log2 fold change! + +In the next chunk, we will create a named vector ranked based on the gene-level log2 fold change values. + +```{r} +# Let's create a named vector ranked based on the log2 fold change values +lfc_vector <- filtered_dge_mapped_df$log2FoldChange +names(lfc_vector) <- filtered_dge_mapped_df$gene_symbol + +# We need to sort the log2 fold change values in descending order here +lfc_vector <- sort(lfc_vector, decreasing = TRUE) +``` + +Let's preview our pre-ranked named vector. + +```{r} +# Look at first entries of the ranked log2 fold change vector +head(lfc_vector) +``` + +### Run GSEA using the `GSEA()` function + +Genes were ranked from most positive to most negative, weighted according to their gene-level statistic, in the previous section. +In this section, we will implement GSEA to calculate the enrichment score for each gene set using our pre-ranked gene list. + +The GSEA algorithm utilizes random sampling so we are going to set the seed to make our results reproducible. + +```{r} +# Set the seed so our results are reproducible: +set.seed(2020) +``` + +We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`). + +Significance is assessed by permuting the gene labels of the pre-ranked gene list and recomputing the enrichment scores of the gene set for the permuted data, which generates a null distribution for the enrichment score. +The `pAdjustMethod` argument to `GSEA()` above specifies what method to use for adjusting the p-values to account for multiple hypothesis testing; the `pvalueCutoff` argument tells the function to only return pathways with adjusted p-values less than that threshold in the `results` slot. + +```{r} +gsea_results <- GSEA( + geneList = lfc_vector, # Ordered ranked gene list + minGSSize = 25, # Minimum gene set size + maxGSSize = 500, # Maximum gene set set + pvalueCutoff = 0.05, # p-value cutoff + eps = 0, # Boundary for calculating the p value + seed = TRUE, # Set seed to make results reproducible + pAdjustMethod = "BH", # Benjamini-Hochberg correction + TERM2GENE = dplyr::select( + mm_hallmark_sets, + gs_name, + gene_symbol + ) +) +``` + +The warning message above tells us that there are few genes that have the same log2 fold change value and they are therefore ranked equally. +`fgsea`, the method that underlies `GSEA()`, will arbitrarily choose which comes first in the ranked list [@gene-set-testing-rnaseq]. +This percentage of `0.19` is small, so we are not concerned that it will significantly effect our results. +If the percentage much larger on the other hand, we would be concerned about the log2 fold change results. + +Let's take a look at the table in the `result` slot of `gsea_results`. + +```{r rownames.print=FALSE} +# We can access the results from our `gsea_results` object using `@result` +head(gsea_results@result) +``` + +Looks like we have gene sets returned as significant at FDR (false discovery rate) of `0.05`. +If we did not have results that met the `pvalueCutoff` condition, this table would be empty. +If we wanted all results returned we would need to set the `pvalueCutoff = 1`. + +The `NES` column contains the normalized enrichment score, which normalizes for the gene set size, for that pathway. + +Let's convert the contents of `result` into a data frame that we can use for further analysis and write to a file later. + +```{r} +gsea_result_df <- data.frame(gsea_results@result) +``` + +## Visualizing results + +We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`. +Let's take a look at 2 different pathways -- one with a highly positive NES and one with a highly negative NES -- to get more insight into how ES are calculated. + +### Most Positive NES + +Let's look at the 3 gene sets with the most positive NES. + +```{r rownames.print=FALSE} +gsea_result_df %>% + # This returns the 3 rows with the largest NES values + dplyr::slice_max(NES, n = 3) +``` + +The gene set `HALLMARK_MYC_TARGETS_V2` has the most positive NES score. + +```{r} +most_positive_nes_plot <- enrichplot::gseaplot( + gsea_results, + geneSetID = "HALLMARK_MYC_TARGETS_V2", + title = "HALLMARK_MYC_TARGETS_V2", + color.line = "#0d76ff" +) +most_positive_nes_plot +``` + +Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. +The red dashed line indicates the enrichment score, which is the maximum deviation from zero. +As mentioned earlier, an enrichment is calculated by starting with the most highly ranked genes (according to the gene-level log2 fold changes values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. + +The plots returned by `enrichplot::gseaplot` are ggplots, so we can use `ggplot2::ggsave()` to save them to file. + +Let's save to PNG. + +```{r} +ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_positive_plot.png"), + plot = most_positive_nes_plot +) +``` + +### Most Negative NES + +Let's look for the 3 gene sets with the most negative NES. + +```{r rownames.print = FALSE} +gsea_result_df %>% + # Return the 3 rows with the smallest (most negative) NES values + dplyr::slice_min(NES, n = 3) +``` + +The gene set `HALLMARK_HYPOXIA` has the most negative NES. + +```{r} +most_negative_nes_plot <- enrichplot::gseaplot( + gsea_results, + geneSetID = "HALLMARK_HYPOXIA", + title = "HALLMARK_HYPOXIA", + color.line = "#0d76ff" +) +most_negative_nes_plot +``` + +This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph. +Again, the red dashed line here indicates the maximum deviation from zero, in other words, the enrichment score. +A _negative_ enrichment score will be returned when many genes are near the bottom of the ranked list. + +Let's save this plot to PNG as well. + +```{r} +ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_negative_plot.png"), + plot = most_negative_nes_plot +) +``` + +## Write results to file + +```{r} +readr::write_tsv( + gsea_result_df, + file.path( + results_dir, + "SRP123625_gsea_results.tsv" + ) +) +``` + +# Resources for further learning + +- [clusterProfiler paper](https://doi.org/10.1089/omi.2011.0118) [@Yu2012]. +- [clusterProfiler book](https://yulab-smu.github.io/clusterProfiler-book/index.html) [@clusterProfiler-book]. +- [This handy review](https://doi.org/10.1371/journal.pcbi.1002375) which summarizes the different types of pathway analysis and their limitations [@Khatri2012]. +- See this [Broad Institute page](https://www.gsea-msigdb.org/gsea/index.jsp) for more on GSEA and MSigDB [@GSEA-broad-institute]. + +# Session info + +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. + +```{r} +# Print session info +sessioninfo::session_info() +``` + +# References diff --git a/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html b/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html new file mode 100644 index 00000000..22a664a4 --- /dev/null +++ b/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html @@ -0,0 +1,4483 @@ + + + + + + + + + + + + + + +Gene set enrichment analysis - RNA-seq + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    +
    + +
    + + + + + + + + + +
    +

    1 Purpose of this analysis

    +

    This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

    +

    This particular example analysis shows how you can use Gene Set Enrichment Analysis (GSEA) to detect situations where genes in a predefined gene set or pathway change in a coordinated way between two conditions (Subramanian et al. 2005). Changes at the pathway-level may be statistically significant, and contribute to phenotypic differences, even if the changes in the expression level of individual genes are small.

    +

    ⬇️ Jump to the analysis code ⬇️

    +
    +

    1.0.1 What is pathway analysis?

    +

    Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

    +

    We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

    +
    +
    +

    1.0.2 How to choose a pathway analysis?

    +

    This table summarizes the pathway analyses examples in this module.

    + +++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    AnalysisWhat is required for inputWhat output looks like✅ Pros⚠️ Cons
    ORA (Over-representation Analysis)A list of gene IDs (no stats needed)A per-pathway hypergeometric test result- Simple

    - Inexpensive computationally to calculate p-values
    - Requires arbitrary thresholds and ignores any statistics associated with a gene

    - Assumes independence of genes and pathways
    GSEA (Gene Set Enrichment Analysis)A list of genes IDs with gene-level summary statisticsA per-pathway enrichment score- Includes all genes (no arbitrary threshold!)

    - Attempts to measure coordination of genes
    - Permutations can be expensive

    - Does not account for pathway overlap

    - Two-group comparisons not always appropriate/feasible
    GSVA (Gene Set Variation Analysis)A gene expression matrix (like what you get from refine.bio directly)Pathway-level scores on a per-sample basis- Does not require two groups to compare upfront

    - Normally distributed scores
    - Scores are not a good fit for gene sets that contain genes that go up AND down

    - Method doesn’t assign statistical significance itself

    - Recommended sample size n > 10
    +
    +
    +
    +

    2 How to run this example

    +

    For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

    +
    +

    2.1 Obtain the .Rmd file

    +

    To run this example yourself, download the .Rmd for this analysis by clicking this link.

    +

    Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

    +

    You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

    +
    +
    +

    2.2 Set up your analysis folders

    +

    Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

    +

    If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

    +
    # Create the data folder if it doesn't exist
    +if (!dir.exists("data")) {
    +  dir.create("data")
    +}
    +
    +# Define the file path to the plots directory
    +plots_dir <- "plots" # Can replace with path to desired output plots directory
    +
    +# Create the plots folder if it doesn't exist
    +if (!dir.exists(plots_dir)) {
    +  dir.create(plots_dir)
    +}
    +
    +# Define the file path to the results directory
    +results_dir <- "results" # Can replace with path to desired output results directory
    +
    +# Create the results folder if it doesn't exist
    +if (!dir.exists(results_dir)) {
    +  dir.create(results_dir)
    +}
    +

    In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

    +
    +
    +

    2.3 Obtain the gene set for this example

    +

    In this example, we are using the differential expression results table we obtained from an example analysis of an acute lymphoblastic leukemia (ALL) mouse lymphoid cell model using the DESeq2 package (Love et al. 2014). The table contains summary statistics including Ensembl gene IDs, log2 fold change values, and adjusted p-values (FDR in this case).

    +

    We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you’d like to follow the steps for obtaining this results file yourself, we suggest going through that differential expression analysis example.

    +
    +
    +

    2.4 About the dataset we are using for this example

    +

    For this example analysis, we are using RNA-seq data from an acute lymphoblastic leukemia (ALL) mouse lymphoid cell model (Kampen et al. 2019). All lymphoid mouse cells have human RPL10 but three of the mice have a knock-in R98S mutated RPL10 and three have the human reference RPL10. Differential expression was performed using these mutated and reference RPL10 gene designations.

    +
    +
    +

    2.5 Check out our file structure!

    +

    Your new analysis folder should contain:

    +
      +
    • The example analysis .Rmd you downloaded
      +
    • +
    • A folder called data (currently empty)
    • +
    • A folder for plots (currently empty)
      +
    • +
    • A folder for results (currently empty)
    • +
    +

    Your example analysis folder should contain your .Rmd and three empty folders (which won’t be empty for long!).

    +

    If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

    +
    +
    +
    +

    3 Using a different refine.bio dataset with this analysis?

    +

    If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

    +
    + +

     

    +
    +
    +

    4 Gene set enrichment analysis - RNA-seq

    +
    +

    4.1 Install libraries

    +

    See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

    +

    In this analysis, we will be using clusterProfiler package to perform GSEA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005; Liberzon et al. 2011).

    +

    We will also need the org.Mm.eg.db package to perform gene identifier conversion (Carlson 2019).

    +
    if (!("clusterProfiler" %in% installed.packages())) {
    +  # Install this package if it isn't installed yet
    +  BiocManager::install("clusterProfiler", update = FALSE)
    +}
    +
    +if (!("msigdbr" %in% installed.packages())) {
    +  # Install this package if it isn't installed yet
    +  BiocManager::install("msigdbr", update = FALSE)
    +}
    +
    +if (!("org.Mm.eg.db" %in% installed.packages())) {
    +  # Install this package if it isn't installed yet
    +  BiocManager::install("org.Mm.eg.db", update = FALSE)
    +}
    +

    Attach the packages we need for this analysis.

    +
    # Attach the library
    +library(clusterProfiler)
    +
    +# Package that contains MSigDB gene sets in tidy format
    +library(msigdbr)
    +
    +# Human annotation package we'll use for gene identifier conversion
    +library(org.Mm.eg.db)
    +
    +# We will need this so we can use the pipe: %>%
    +library(magrittr)
    +
    +
    +

    4.2 Download data file

    +

    We will read in the differential expression results we will download from online. These results are from an acute lymphoblastic leukemia (ALL) mouse lymphoid cell model we used for differential expression analysis using DESeq2 (Love et al. 2014). The table contains summary statistics including Ensembl gene IDs, log2 fold change values, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to GSEA.

    +

    Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. First we will assign the URL to its own variable called, dge_url.

    +
    # Define the url to your differential expression results file
    +dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/03-rnaseq/results/SRP123625/SRP123625_differential_expression_results.tsv"
    +

    We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.

    +
    dge_results_file <- file.path(
    +  results_dir,
    +  "SRP123625_differential_expression_results.tsv"
    +)
    +

    Using the URL (dge_url) and file path (dge_results_file) we can download the file and use the destfile argument to specify where it should be saved.

    +
    download.file(
    +  dge_url,
    +  # The file will be saved to this location and with this name
    +  destfile = dge_results_file
    +)
    +

    Now let’s double check that the results file is in the right place.

    +
    # Check if the results file exists
    +file.exists(dge_results_file)
    +
    ## [1] TRUE
    +
    +
    +

    4.3 Import data

    +

    Read in the file that has differential expression results.

    +
    # Read in the contents of the differential expression results file
    +dge_df <- readr::read_tsv(dge_results_file)
    +
    ## 
    +## ── Column specification ──────────────────────────────────────────────
    +## cols(
    +##   Gene = col_character(),
    +##   baseMean = col_double(),
    +##   log2FoldChange = col_double(),
    +##   lfcSE = col_double(),
    +##   pvalue = col_double(),
    +##   padj = col_double(),
    +##   threshold = col_logical()
    +## )
    +

    Note that read_tsv() can also read TSV files directly from a URL and doesn’t necessarily require you download the file first. If you wanted to use that feature, you could replace the call above with readr::read_tsv(dge_url) and skip the download steps.

    +

    Let’s take a look at what the contrast results from the differential expression analysis looks like.

    +
    dge_df
    +
    + +
    +
    +
    +

    4.4 Getting familiar with MSigDB gene sets available via msigdbr

    +

    The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). We can use the msigdbr package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler (Yu et al. 2012; Dolgalev 2020).

    +

    The gene sets available directly from MSigDB are applicable to human studies. msigdbr also supports commonly studied model organisms.

    +

    Let’s take a look at what organisms the package supports with msigdbr_species().

    +
    msigdbr_species()
    +
    + +
    +

    MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).

    +

    In this example, we will use a collection called Hallmark gene sets for GSEA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:

    +
    +

    Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

    +
    +

    Notably, there are only 50 gene sets included in this collection. The fewer gene sets we test, the lower our multiple hypothesis testing burden.

    +

    The data we’re interested in here comes from mouse samples, so we can obtain only the Hallmarks gene sets relevant to M. musculus by specifying category = "H" and species = "Mus musculus", respectively, to the msigdbr() function.

    +
    mm_hallmark_sets <- msigdbr(
    +  species = "Mus musculus", # Replace with species name relevant to your data
    +  category = "H"
    +)
    +

    If you run the chunk above without specifying a category to the msigdbr() function, it will return all of the MSigDB gene sets for mouse. See ?msigdbr for more options.

    +

    Let’s preview what’s in mm_hallmark_sets.

    +
    head(mm_hallmark_sets)
    +
    + +
    +

    Looks like we have a data frame of gene sets with associated gene symbols and Entrez IDs.

    +

    In our differential expression results data frame, dge_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs for GSEA.

    +
    +
    +

    4.5 Gene identifier conversion

    +

    We’re going to convert our identifiers in dge_df to gene symbols, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

    +

    The annotation package org.Mm.eg.db contains information for different identifiers (Carlson 2019). org.Mm.eg.db is specific to Mus musculus – this is what the Mm in the package name is referencing.

    +

    We can see what types of IDs are available to us in an annotation package with keytypes().

    +
    keytypes(org.Mm.eg.db)
    +
    ##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
    +##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
    +##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
    +## [13] "IPI"          "MGI"          "ONTOLOGY"     "ONTOLOGYALL" 
    +## [17] "PATH"         "PFAM"         "PMID"         "PROSITE"     
    +## [21] "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"
    +

    Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to gene symbols (SYMBOL), we could just as easily use it to convert to and from any of these keytypes() listed above.

    +

    The function we will use to map from Ensembl gene IDs to gene symbols is called mapIds() and comes from the AnnotationDbi.

    +

    Let’s create a data frame that shows the mapped gene symbols along with the differential expression stats for the respective Ensembl IDs.

    +
    # First let's create a mapped data frame we can join to the differential
    +# expression stats
    +dge_mapped_df <- data.frame(
    +  gene_symbol = mapIds(
    +    # Replace with annotation package for the organism relevant to your data
    +    org.Mm.eg.db,
    +    keys = dge_df$Gene,
    +    # Replace with the type of gene identifiers in your data
    +    keytype = "ENSEMBL",
    +    # Replace with the type of gene identifiers you would like to map to
    +    column = "SYMBOL",
    +    # This will keep only the first mapped value for each Ensembl ID
    +    multiVals = "first"
    +  )
    +) %>%
    +  # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
    +  # from the data frame
    +  dplyr::filter(!is.na(gene_symbol)) %>%
    +  # Make an `Ensembl` column to store the rownames
    +  tibble::rownames_to_column("Ensembl") %>%
    +  # Now let's join the rest of the expression data
    +  dplyr::inner_join(dge_df, by = c("Ensembl" = "Gene"))
    +
    ## 'select()' returned 1:many mapping between keys and columns
    +

    This 1:many mapping between keys and columns message means that some Ensembl gene identifiers map to multiple gene symbols. In this case, it’s also possible that a gene symbol will map to multiple Ensembl IDs. For the purpose of performing GSEA later in this notebook, we keep only the first mapped IDs. Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

    +

    Let’s see a preview of dge_mapped_df.

    +
    head(dge_mapped_df)
    +
    + +
    +
    +
    +

    4.6 Perform gene set enrichment analysis (GSEA)

    +

    The goal of GSEA is to detect situations where many genes in a gene set change in a coordinated way, even when individual changes are small in magnitude (Subramanian et al. 2005).

    +

    GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic. This score reflects whether or not a gene set or pathway is overrepresented at the top or bottom of the gene rankings (Yu 2020; Subramanian et al. 2005). Specifically, genes are ranked from most positive to most negative based on their statistic and a running sum is calculated by starting with the most highly ranked genes and increasing the score when a gene is in the pathway and decreasing the score when a gene is not. In this example, the enrichment score for a pathway is the running sum’s maximum deviation from zero. GSEA also assesses statistical significance of the scores for each pathway through permutation testing. As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing (Yu 2020; Subramanian et al. 2005).

    +

    The implementation of GSEA we use in this examples requires a gene list ordered by some statistic (here we’ll use log2 fold changes calculated as part of differential gene expression analysis) and input gene sets (Hallmark collection). When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.

    +
    +

    4.6.1 Determine our pre-ranked genes list

    +

    The GSEA() function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. It requires unique gene identifiers to produce the most accurate results, so we will need to resolve any duplicates found in our dataset. (The GSEA() function will throw a warning if we do not do this ahead of time.)

    +

    Let’s check to see if we have any gene symbols that mapped to multiple Ensembl IDs.

    +
    any(duplicated(dge_mapped_df$gene_symbol))
    +
    ## [1] TRUE
    +

    Looks like we do have duplicated gene symbols. Let’s find out which ones.

    +
    dup_gene_symbols <- dge_mapped_df %>%
    +  dplyr::filter(duplicated(gene_symbol)) %>%
    +  dplyr::pull(gene_symbol)
    +

    Now let’s take a look at the rows associated with the duplicated gene symbols.

    +
    dge_mapped_df %>%
    +  dplyr::filter(gene_symbol %in% dup_gene_symbols) %>%
    +  dplyr::arrange(gene_symbol)
    +
    + +
    +

    We can see that the associated values vary for each row.

    +

    As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSEA() step. Let’s keep the Entrez IDs associated with the higher absolute value of the log2 fold change. GSEA relies on genes’ rankings on the basis of a gene-level statistic and the enrichment score that is calculated reflects the degree to which genes in a gene set are overrepresented in the top or bottom of the rankings (Yu 2020; Subramanian et al. 2005).

    +

    Retaining the instance of the Entrez ID with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. We should keep this decision in mind when interpreting our results. For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking.

    +

    We are removing values for 33 out of thousands of genes here, so it is unlikely to have a considerable impact on our results.

    +

    In the next chunk, we are going to filter out the duplicated row using the dplyr::distinct() function This will keep the first row with the duplicated value thus keeping the row with the highest absolute value of the log2 fold change.

    +
    filtered_dge_mapped_df <- dge_mapped_df %>%
    +  # Sort so that the highest absolute values of the log2 fold change are at the
    +  # top
    +  dplyr::arrange(dplyr::desc(abs(log2FoldChange))) %>%
    +  # Filter out the duplicated rows using `dplyr::distinct()`
    +  dplyr::distinct(gene_symbol, .keep_all = TRUE)
    +

    Note that the log2 fold change estimates we use here have been subject to shrinkage to account for genes with low counts or highly variable counts. See the DESeq2 package vignette for more information on how DESeq2 handles the log2 fold change values with the lfcShrink() function.

    +

    Let’s check to see that we removed the duplicate gene symbols and kept the rows with the higher absolute value of the log2 fold change.

    +
    any(duplicated(filtered_dge_mapped_df$gene_symbol))
    +
    ## [1] FALSE
    +

    Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher absolute value of the log2 fold change!

    +

    In the next chunk, we will create a named vector ranked based on the gene-level log2 fold change values.

    +
    # Let's create a named vector ranked based on the log2 fold change values
    +lfc_vector <- filtered_dge_mapped_df$log2FoldChange
    +names(lfc_vector) <- filtered_dge_mapped_df$gene_symbol
    +
    +# We need to sort the log2 fold change values in descending order here
    +lfc_vector <- sort(lfc_vector, decreasing = TRUE)
    +

    Let’s preview our pre-ranked named vector.

    +
    # Look at first entries of the ranked log2 fold change vector
    +head(lfc_vector)
    +
    ##   Lpgat1   Lgals7    Gm973     Bbs7     Clnk   Zfp575 
    +## 13.34941 12.64196 12.51824 12.19278 11.52481 10.20900
    +
    +
    +

    4.6.2 Run GSEA using the GSEA() function

    +

    Genes were ranked from most positive to most negative, weighted according to their gene-level statistic, in the previous section. In this section, we will implement GSEA to calculate the enrichment score for each gene set using our pre-ranked gene list.

    +

    The GSEA algorithm utilizes random sampling so we are going to set the seed to make our results reproducible.

    +
    # Set the seed so our results are reproducible:
    +set.seed(2020)
    +

    We can use the GSEA() function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., gseKEGG()).

    +

    Significance is assessed by permuting the gene labels of the pre-ranked gene list and recomputing the enrichment scores of the gene set for the permuted data, which generates a null distribution for the enrichment score. The pAdjustMethod argument to GSEA() above specifies what method to use for adjusting the p-values to account for multiple hypothesis testing; the pvalueCutoff argument tells the function to only return pathways with adjusted p-values less than that threshold in the results slot.

    +
    gsea_results <- GSEA(
    +  geneList = lfc_vector, # Ordered ranked gene list
    +  minGSSize = 25, # Minimum gene set size
    +  maxGSSize = 500, # Maximum gene set set
    +  pvalueCutoff = 0.05, # p-value cutoff
    +  eps = 0, # Boundary for calculating the p value
    +  seed = TRUE, # Set seed to make results reproducible
    +  pAdjustMethod = "BH", # Benjamini-Hochberg correction
    +  TERM2GENE = dplyr::select(
    +    mm_hallmark_sets,
    +    gs_name,
    +    gene_symbol
    +  )
    +)
    +
    ## preparing geneSet collections...
    +
    ## GSEA analysis...
    +
    ## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.19% of the list).
    +## The order of those tied genes will be arbitrary, which may produce unexpected results.
    +
    ## leading edge analysis...
    +
    ## done...
    +

    The warning message above tells us that there are few genes that have the same log2 fold change value and they are therefore ranked equally. fgsea, the method that underlies GSEA(), will arbitrarily choose which comes first in the ranked list (Ballereau et al. 2018). This percentage of 0.19 is small, so we are not concerned that it will significantly effect our results. If the percentage much larger on the other hand, we would be concerned about the log2 fold change results.

    +

    Let’s take a look at the table in the result slot of gsea_results.

    +
    # We can access the results from our `gsea_results` object using `@result`
    +head(gsea_results@result)
    +
    + +
    +

    Looks like we have gene sets returned as significant at FDR (false discovery rate) of 0.05. If we did not have results that met the pvalueCutoff condition, this table would be empty. If we wanted all results returned we would need to set the pvalueCutoff = 1.

    +

    The NES column contains the normalized enrichment score, which normalizes for the gene set size, for that pathway.

    +

    Let’s convert the contents of result into a data frame that we can use for further analysis and write to a file later.

    +
    gsea_result_df <- data.frame(gsea_results@result)
    +
    +
    +
    +

    4.7 Visualizing results

    +

    We can visualize GSEA results for individual pathways or gene sets using enrichplot::gseaplot(). Let’s take a look at 2 different pathways – one with a highly positive NES and one with a highly negative NES – to get more insight into how ES are calculated.

    +
    +

    4.7.1 Most Positive NES

    +

    Let’s look at the 3 gene sets with the most positive NES.

    +
    gsea_result_df %>%
    +  # This returns the 3 rows with the largest NES values
    +  dplyr::slice_max(NES, n = 3)
    +
    + +
    +

    The gene set HALLMARK_MYC_TARGETS_V2 has the most positive NES score.

    +
    most_positive_nes_plot <- enrichplot::gseaplot(
    +  gsea_results,
    +  geneSetID = "HALLMARK_MYC_TARGETS_V2",
    +  title = "HALLMARK_MYC_TARGETS_V2",
    +  color.line = "#0d76ff"
    +)
    +most_positive_nes_plot
    +

    +

    Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. The red dashed line indicates the enrichment score, which is the maximum deviation from zero. As mentioned earlier, an enrichment is calculated by starting with the most highly ranked genes (according to the gene-level log2 fold changes values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway.

    +

    The plots returned by enrichplot::gseaplot are ggplots, so we can use ggplot2::ggsave() to save them to file.

    +

    Let’s save to PNG.

    +
    ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_positive_plot.png"),
    +  plot = most_positive_nes_plot
    +)
    +
    ## Saving 7 x 5 in image
    +
    +
    +

    4.7.2 Most Negative NES

    +

    Let’s look for the 3 gene sets with the most negative NES.

    +
    gsea_result_df %>%
    +  # Return the 3 rows with the smallest (most negative) NES values
    +  dplyr::slice_min(NES, n = 3)
    +
    + +
    +

    The gene set HALLMARK_HYPOXIA has the most negative NES.

    +
    most_negative_nes_plot <- enrichplot::gseaplot(
    +  gsea_results,
    +  geneSetID = "HALLMARK_HYPOXIA",
    +  title = "HALLMARK_HYPOXIA",
    +  color.line = "#0d76ff"
    +)
    +most_negative_nes_plot
    +

    +

    This gene set shows the opposite pattern – genes in the pathway tend to be on the right side of the graph. Again, the red dashed line here indicates the maximum deviation from zero, in other words, the enrichment score. A negative enrichment score will be returned when many genes are near the bottom of the ranked list.

    +

    Let’s save this plot to PNG as well.

    +
    ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_negative_plot.png"),
    +  plot = most_negative_nes_plot
    +)
    +
    ## Saving 7 x 5 in image
    +
    +
    +
    +

    4.8 Write results to file

    +
    readr::write_tsv(
    +  gsea_result_df,
    +  file.path(
    +    results_dir,
    +    "SRP123625_gsea_results.tsv"
    +  )
    +)
    +
    +
    +
    +

    5 Resources for further learning

    + +
    +
    +

    6 Session info

    +

    At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

    +
    # Print session info
    +sessioninfo::session_info()
    +
    ## ─ Session info ─────────────────────────────────────────────────────
    +##  setting  value                       
    +##  version  R version 4.0.2 (2020-06-22)
    +##  os       Ubuntu 20.04 LTS            
    +##  system   x86_64, linux-gnu           
    +##  ui       X11                         
    +##  language (EN)                        
    +##  collate  en_US.UTF-8                 
    +##  ctype    en_US.UTF-8                 
    +##  tz       Etc/UTC                     
    +##  date     2020-12-21                  
    +## 
    +## ─ Packages ─────────────────────────────────────────────────────────
    +##  package         * version  date       lib source        
    +##  AnnotationDbi   * 1.52.0   2020-10-27 [1] Bioconductor  
    +##  assertthat        0.2.1    2019-03-21 [1] RSPM (R 4.0.0)
    +##  backports         1.1.10   2020-09-15 [1] RSPM (R 4.0.2)
    +##  Biobase         * 2.50.0   2020-10-27 [1] Bioconductor  
    +##  BiocGenerics    * 0.36.0   2020-10-27 [1] Bioconductor  
    +##  BiocManager       1.30.10  2019-11-16 [1] RSPM (R 4.0.0)
    +##  BiocParallel      1.24.1   2020-11-06 [1] Bioconductor  
    +##  bit               4.0.4    2020-08-04 [1] RSPM (R 4.0.2)
    +##  bit64             4.0.5    2020-08-30 [1] RSPM (R 4.0.2)
    +##  blob              1.2.1    2020-01-20 [1] RSPM (R 4.0.0)
    +##  cli               2.1.0    2020-10-12 [1] RSPM (R 4.0.2)
    +##  clusterProfiler * 3.18.0   2020-10-27 [1] Bioconductor  
    +##  colorspace        1.4-1    2019-03-18 [1] RSPM (R 4.0.0)
    +##  cowplot           1.1.0    2020-09-08 [1] RSPM (R 4.0.2)
    +##  crayon            1.3.4    2017-09-16 [1] RSPM (R 4.0.0)
    +##  data.table        1.13.0   2020-07-24 [1] RSPM (R 4.0.2)
    +##  DBI               1.1.0    2019-12-15 [1] RSPM (R 4.0.0)
    +##  digest            0.6.25   2020-02-23 [1] RSPM (R 4.0.0)
    +##  DO.db             2.9      2020-12-16 [1] Bioconductor  
    +##  DOSE              3.16.0   2020-10-27 [1] Bioconductor  
    +##  downloader        0.4      2015-07-09 [1] RSPM (R 4.0.0)
    +##  dplyr             1.0.2    2020-08-18 [1] RSPM (R 4.0.2)
    +##  ellipsis          0.3.1    2020-05-15 [1] RSPM (R 4.0.0)
    +##  enrichplot        1.10.1   2020-11-14 [1] Bioconductor  
    +##  evaluate          0.14     2019-05-28 [1] RSPM (R 4.0.0)
    +##  fansi             0.4.1    2020-01-08 [1] RSPM (R 4.0.0)
    +##  farver            2.0.3    2020-01-16 [1] RSPM (R 4.0.0)
    +##  fastmatch         1.1-0    2017-01-28 [1] RSPM (R 4.0.0)
    +##  fgsea             1.16.0   2020-10-27 [1] Bioconductor  
    +##  generics          0.0.2    2018-11-29 [1] RSPM (R 4.0.0)
    +##  getopt            1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
    +##  ggforce           0.3.2    2020-06-23 [1] RSPM (R 4.0.2)
    +##  ggplot2           3.3.2    2020-06-19 [1] RSPM (R 4.0.1)
    +##  ggraph            2.0.3    2020-05-20 [1] RSPM (R 4.0.2)
    +##  ggrepel           0.8.2    2020-03-08 [1] RSPM (R 4.0.2)
    +##  glue              1.4.2    2020-08-27 [1] RSPM (R 4.0.2)
    +##  GO.db             3.12.1   2020-12-16 [1] Bioconductor  
    +##  GOSemSim          2.16.1   2020-10-29 [1] Bioconductor  
    +##  graphlayouts      0.7.0    2020-04-25 [1] RSPM (R 4.0.2)
    +##  gridExtra         2.3      2017-09-09 [1] RSPM (R 4.0.0)
    +##  gtable            0.3.0    2019-03-25 [1] RSPM (R 4.0.0)
    +##  hms               0.5.3    2020-01-08 [1] RSPM (R 4.0.0)
    +##  htmltools         0.5.0    2020-06-16 [1] RSPM (R 4.0.1)
    +##  igraph            1.2.6    2020-10-06 [1] RSPM (R 4.0.2)
    +##  IRanges         * 2.24.1   2020-12-12 [1] Bioconductor  
    +##  jsonlite          1.7.1    2020-09-07 [1] RSPM (R 4.0.2)
    +##  knitr             1.30     2020-09-22 [1] RSPM (R 4.0.2)
    +##  labeling          0.3      2014-08-23 [1] RSPM (R 4.0.0)
    +##  lattice           0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
    +##  lifecycle         0.2.0    2020-03-06 [1] RSPM (R 4.0.0)
    +##  magrittr        * 1.5      2014-11-22 [1] RSPM (R 4.0.0)
    +##  MASS              7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
    +##  Matrix            1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
    +##  memoise           1.1.0    2017-04-21 [1] RSPM (R 4.0.0)
    +##  msigdbr         * 7.2.1    2020-10-02 [1] RSPM (R 4.0.2)
    +##  munsell           0.5.0    2018-06-12 [1] RSPM (R 4.0.0)
    +##  optparse        * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
    +##  org.Mm.eg.db    * 3.12.0   2020-12-16 [1] Bioconductor  
    +##  pillar            1.4.6    2020-07-10 [1] RSPM (R 4.0.2)
    +##  pkgconfig         2.0.3    2019-09-22 [1] RSPM (R 4.0.0)
    +##  plyr              1.8.6    2020-03-03 [1] RSPM (R 4.0.2)
    +##  polyclip          1.10-0   2019-03-14 [1] RSPM (R 4.0.0)
    +##  ps                1.4.0    2020-10-07 [1] RSPM (R 4.0.2)
    +##  purrr             0.3.4    2020-04-17 [1] RSPM (R 4.0.0)
    +##  qvalue            2.22.0   2020-10-27 [1] Bioconductor  
    +##  R.cache           0.14.0   2019-12-06 [1] RSPM (R 4.0.0)
    +##  R.methodsS3       1.8.1    2020-08-26 [1] RSPM (R 4.0.2)
    +##  R.oo              1.24.0   2020-08-26 [1] RSPM (R 4.0.2)
    +##  R.utils           2.10.1   2020-08-26 [1] RSPM (R 4.0.2)
    +##  R6                2.4.1    2019-11-12 [1] RSPM (R 4.0.0)
    +##  RColorBrewer      1.1-2    2014-12-07 [1] RSPM (R 4.0.0)
    +##  Rcpp              1.0.5    2020-07-06 [1] RSPM (R 4.0.2)
    +##  readr             1.4.0    2020-10-05 [1] RSPM (R 4.0.2)
    +##  rematch2          2.1.2    2020-05-01 [1] RSPM (R 4.0.0)
    +##  reshape2          1.4.4    2020-04-09 [1] RSPM (R 4.0.2)
    +##  rlang             0.4.8    2020-10-08 [1] RSPM (R 4.0.2)
    +##  rmarkdown         2.4      2020-09-30 [1] RSPM (R 4.0.2)
    +##  RSQLite           2.2.1    2020-09-30 [1] RSPM (R 4.0.2)
    +##  rstudioapi        0.11     2020-02-07 [1] RSPM (R 4.0.0)
    +##  rvcheck           0.1.8    2020-03-01 [1] RSPM (R 4.0.0)
    +##  S4Vectors       * 0.28.1   2020-12-09 [1] Bioconductor  
    +##  scales            1.1.1    2020-05-11 [1] RSPM (R 4.0.0)
    +##  scatterpie        0.1.5    2020-09-09 [1] RSPM (R 4.0.2)
    +##  sessioninfo       1.1.1    2018-11-05 [1] RSPM (R 4.0.0)
    +##  shadowtext        0.0.7    2019-11-06 [1] RSPM (R 4.0.0)
    +##  stringi           1.5.3    2020-09-09 [1] RSPM (R 4.0.2)
    +##  stringr           1.4.0    2019-02-10 [1] RSPM (R 4.0.0)
    +##  styler            1.3.2    2020-02-23 [1] RSPM (R 4.0.0)
    +##  tibble            3.0.4    2020-10-12 [1] RSPM (R 4.0.2)
    +##  tidygraph         1.2.0    2020-05-12 [1] RSPM (R 4.0.2)
    +##  tidyr             1.1.2    2020-08-27 [1] RSPM (R 4.0.2)
    +##  tidyselect        1.1.0    2020-05-11 [1] RSPM (R 4.0.0)
    +##  tweenr            1.0.1    2018-12-14 [1] RSPM (R 4.0.2)
    +##  vctrs             0.3.4    2020-08-29 [1] RSPM (R 4.0.2)
    +##  viridis           0.5.1    2018-03-29 [1] RSPM (R 4.0.0)
    +##  viridisLite       0.3.0    2018-02-01 [1] RSPM (R 4.0.0)
    +##  withr             2.3.0    2020-09-22 [1] RSPM (R 4.0.2)
    +##  xfun              0.18     2020-09-29 [1] RSPM (R 4.0.2)
    +##  yaml              2.2.1    2020-02-01 [1] RSPM (R 4.0.0)
    +## 
    +## [1] /usr/local/lib/R/site-library
    +## [2] /usr/local/lib/R/library
    +
    +
    +

    References

    +
    +
    +

    Ballereau S., M. Dunning, A. Edwards, O. Rueda, and A. Sawle, 2018 RNA-seq analysis in R: Gene set testing for RNA-seq. https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2018/RNASeq2018/html/06_Gene_set_testing.nb.html

    +
    +
    +

    Carlson M., 2019 Genome wide annotation for mouse. https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html

    +
    +
    +

    Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html

    +
    +
    +

    Kampen K. R., L. Fancello, T. Girardi, G. Rinaldi, and M. Planque et al., 2019 Translatome analysis reveals altered serine and glycine metabolism in t-cell acute lymphoblastic leukemia cells. Nature Communications 10. https://doi.org/10.1038/s41467-019-10508-2

    +
    +
    +

    Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

    +
    +
    +

    Liberzon A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, and J. P. Mesirov et al., 2015 The molecular signatures database hallmark gene set collection. Cell Systems 1. https://doi.org/10.1016/j.cels.2015.12.004

    +
    +
    +

    Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, and P. Tamayo et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260

    +
    +
    +

    Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8

    +
    +
    +

    Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

    +
    +
    +

    UC San Diego and Broad Institute Team, GSEA: Gene set enrichment analysis. https://www.gsea-msigdb.org/gsea/index.jsp

    +
    +
    +

    Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118

    +
    +
    +

    Yu G., 2020 clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html

    +
    +
    +
    + + + + +
    +
    + +
    + + + + + + + + + + + + + + + + diff --git a/03-rnaseq/pathway-analysis_rnaseq_03_gsva.Rmd b/03-rnaseq/pathway-analysis_rnaseq_03_gsva.Rmd new file mode 100644 index 00000000..2437753d --- /dev/null +++ b/03-rnaseq/pathway-analysis_rnaseq_03_gsva.Rmd @@ -0,0 +1,699 @@ +--- +title: "Gene set variation analysis - RNA-seq" +author: "CCDL for ALSF" +date: "December 2020" +output: + html_notebook: + toc: true + toc_float: true + number_sections: true +--- + +# Purpose of this analysis + +This example is one of pathway analysis module set, we recommend looking at the [pathway analysis table below](#how-to-choose-a-pathway-analysis) to help you determine which pathway analysis method is best suited for your purposes. + +In this example we will cover a method called Gene Set Variation Analysis (GSVA) to calculate gene set or pathway scores on a per-sample basis [@Hanzelmann2013]. +GSVA transforms a gene by sample gene expression matrix into a gene set by sample pathway enrichment matrix [@Hanzelmann-github]. +We'll make a heatmap of the enrichment matrix, but you can use the GSVA scores for a number of other downstream analyses such as differential expression analysis. + +⬇️ [**Jump to the analysis code**](#analysis) ⬇️ + +### What is pathway analysis? + +Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. +In the context of [refine.bio](https://www.refine.bio/), we use these techniques to analyze and interpret genome-wide gene expression experiments. +The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. +In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis. + +We highly recommend taking a look at [Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002375) from @Khatri2012 for a more comprehensive overview. +We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the [`Resources for further learning` section](#resources-for-further-learning). + +### How to choose a pathway analysis? + +This table summarizes the pathway analyses examples in this module. + +|Analysis|What is required for input|What output looks like |✅ Pros| ⚠️ Cons| +|--------|--------------------------|-----------------------|-------|-------| +|[**ORA (Over-representation Analysis)**](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_01_ora.html)|A list of gene IDs (no stats needed)|A per-pathway hypergeometric test result|- Simple

    - Inexpensive computationally to calculate p-values| - Requires arbitrary thresholds and ignores any statistics associated with a gene

    - Assumes independence of genes and pathways| +|[**GSEA (Gene Set Enrichment Analysis)**](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html)|A list of genes IDs with gene-level summary statistics|A per-pathway enrichment score|- Includes all genes (no arbitrary threshold!)

    - Attempts to measure coordination of genes|- Permutations can be expensive

    - Does not account for pathway overlap

    - Two-group comparisons not always appropriate/feasible| +|[**GSVA (Gene Set Variation Analysis)**](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_03_gsva.html)|A gene expression matrix (like what you get from refine.bio directly)|Pathway-level scores on a per-sample basis|- Does not require two groups to compare upfront

    - Normally distributed scores|- Scores are not a good fit for gene sets that contain genes that go up AND down

    - Method doesn’t assign statistical significance itself

    - Recommended sample size n > 10| + +# How to run this example + +For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. + +## Obtain the `.Rmd` file + +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/pathway-analysis_rnaseq_03_gsva.Rmd). + +Clicking this link will most likely send this to your downloads folder on your computer. +Move this `.Rmd` file to where you would like this example and its files to be stored. + +You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) + +## Set up your analysis folders + +Good file organization is helpful for keeping your data analysis project on track! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! + +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. + +```{r} +# Create the data folder if it doesn't exist +if (!dir.exists("data")) { + dir.create("data") +} + +# Define the file path to the plots directory +plots_dir <- "plots" + +# Create the plots folder if it doesn't exist +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} + +# Define the file path to the results directory +results_dir <- "results" + +# Create the results folder if it doesn't exist +if (!dir.exists(results_dir)) { + dir.create(results_dir) +} +``` + +In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! + +## Obtain the dataset from refine.bio + +For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). + +Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP140558). + +Click the "Download Now" button on the right side of this screen. + + + +Fill out the pop up window with your email and our Terms and Conditions: + + + +It may take a few minutes for the dataset to process. +You will get an email when it is ready. + +## About the dataset we are using for this example + +For this example analysis, we will use this [acute viral bronchiolitis dataset](https://www.refine.bio/experiments/SRP140558). +The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. +Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated "AV") and their recovery, their post-convalescence visit (abbreviated "CV"). + +## Place the dataset in your new `data/` folder + +refine.bio will send you a download button in the email when it is ready. +Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. +Double clicking should unzip this for you and create a folder of the same name. + + + +For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). + +The `SRP140558` folder has the data and metadata TSV files you will need for this example analysis. +Experiment accession ids usually look something like `GSE1235` or `SRP12345`. + +Copy and paste the `SRP140558` folder into your newly created `data/` folder. + +## Check out our file structure! + +Your new analysis folder should contain: + +- The example analysis `.Rmd` you downloaded +- A folder called "data" which contains: + - The `SRP140558` folder which contains: + - The gene expression + - The metadata TSV +- A folder for `plots` (currently empty) +- A folder for `results` (currently empty) + +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): + + + +In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. +These chunks will declare your file paths and double check that your files are in the right place. + +First we will declare our file paths to our data and metadata files, which should be in our data directory. +This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started. + +```{r} +# Define the file path to the data directory +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "SRP140558") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "SRP140558.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv") +``` + +Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. + +```{r} +# Check if the gene expression matrix file is at the path stored in `data_file` +file.exists(data_file) + +# Check if the metadata file is at the file path stored in `metadata_file` +file.exists(metadata_file) +``` + +If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. + +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). + +# Using a different refine.bio dataset with this analysis? + +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). +We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. +From here you can customize this analysis example to fit your own scientific questions and preferences. + +*** + +   + +# Gene set variation analysis - RNA-Seq + +## Install libraries + +See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. + +We will be using `DESeq2` to normalize and transform our RNA-seq data before running GSVA, so we will need to install that [@Love2014]. + +In this analysis, we will be using the [`GSVA`](https://www.bioconductor.org/packages/release/bioc/html/GSVA.html) package to perform GSVA and the [`qusage`](https://www.bioconductor.org/packages/release/bioc/html/qusage.html) package to read in the GMT file containing the gene set data [@Hanzelmann2013; @Yaari2013]. + +We will also need the [`org.Hs.eg.db`](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) package to perform gene identifier conversion [@Carlson2020-human]. + +We'll create a heatmap from our pathway analysis results using `pheatmap` [@Slowikowski2017]. + +```{r} +if (!("DESeq2" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("DESeq2", update = FALSE) +} + +if (!("GSVA" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("GSVA", update = FALSE) +} + +if (!("qusage" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("qusage", update = FALSE) +} + +if (!("org.Hs.eg.db" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("org.Hs.eg.db", update = FALSE) +} + +if (!("pheatmap" %in% installed.packages())) { + # Install pheatmap + install.packages("pheatmap", update = FALSE) +} +``` + +Attach the packages we need for this analysis. + +```{r} +# Attach the DESeq2 library +library(DESeq2) + +# Attach the `qusage` library +library(qusage) + +# Attach the `GSVA` library +library(GSVA) + +# Human annotation package we'll use for gene identifier conversion +library(org.Hs.eg.db) + +# We will need this so we can use the pipe: %>% +library(magrittr) +``` + +## Import and set up data + +Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. +This chunk of code will read the both TSV files and add them as data frames to your environment. + +We stored our file paths as objects named `metadata_file` and `data_file` in [this previous step](#check-out-our-file-structure). + +```{r} +# Read in metadata TSV file +metadata <- readr::read_tsv(metadata_file) + +# Read in data TSV file +expression_df <- readr::read_tsv(data_file) %>% + # Here we are going to store the gene IDs as row names so that we can have a numeric matrix to perform calculations on later + tibble::column_to_rownames("Gene") +``` + +Let's ensure that the metadata and data are in the same sample order. + +```{r} +# Make the data in the order of the metadata +expression_df <- expression_df %>% + dplyr::select(metadata$refinebio_accession_code) + +# Check if this is in the same order +all.equal(colnames(expression_df), metadata$refinebio_accession_code) +``` + +### Prepare data for `DESeq2` + +There are two things we need to do to prep our expression data for DESeq2. + +First, we need to make sure all of the values in our data are converted to integers as required by a `DESeq2` function we will use later. + +Then, we need to filter out the genes that have not been expressed or that have low expression counts since we can not be as confident in those genes being reliably measured. +We are going to do some pre-filtering to keep only genes with 50 or more reads in total across the samples. + +```{r} +expression_df <- expression_df %>% + # Only keep rows that have total counts above the cutoff + dplyr::filter(rowSums(.) >= 50) %>% + # The next DESeq2 functions need the values to be converted to integers + round() +``` + +## Create a DESeqDataset + +We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object. +We turn the data frame (or matrix) into a [`DESeqDataSet` object](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2) and specify which variable labels our experimental groups using the [`design` argument](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs) [@Love2014]. +In this chunk of code, we will not provide a specific model to the `design` argument because we are not performing a differential expression analysis. + +```{r} +# Create a `DESeqDataSet` object +dds <- DESeqDataSetFromMatrix( + countData = expression_df, # Our prepped data frame with counts + colData = metadata, # Data frame with annotation for our samples + design = ~1 # Here we are not specifying a model +) +``` + +## Perform DESeq2 normalization and transformation + +We often suggest normalizing and transforming your data for various applications including for GSVA. +We are going to use the `vst()` function from the `DESeq2` package to normalize and transform the data. +For more information about these transformation methods, [see here](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods). + +```{r} +# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` +# function from the `DESEq2` R package +dds_norm <- vst(dds) +``` + +At this point, if your data set has any outlier samples, you should look into removing them as they can affect your results. +For this example data set, we will skip this step (there are no obvious outliers) and proceed. + +But now we are ready to format our dataset for input into `gsva::gsva()`. +We need to extract the normalized counts to a matrix and make it into a data frame so we can use with tidyverse functions later. + +```{r} +# Retrieve the normalized data from the `DESeqDataSet` +vst_df <- assay(dds_norm) %>% + as.data.frame() %>% # Make into a data frame + tibble::rownames_to_column("ensembl_id") # Make Gene IDs into their own column +``` + +### Import Gene Sets + +The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses [@Subramanian2005; @Liberzon2011]. +MSigDB contains [8 different gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) [@Subramanian2005; @Liberzon2011] that are distinguished by how they are derived (e.g., computationally mined, curated). + +In this example, we will use a collection called Hallmark gene sets for GSVA [@Liberzon2015]. +Here's an excerpt of [the collection description from MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H): + +> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. +> These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. + +Here we are obtaining the pathway information from the main function of the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package [@Dolgalev2020]. +Because we are using human data in this example, we supply the formal organism name to the `species` argument. +We will want only the hallmark pathways, so we use the `category = "H"` argument. + +```{r} +hallmark_gene_sets <- msigdbr::msigdbr( + species = "Homo sapiens", # Can change this to what species you need + category = "H" # Only hallmark gene sets +) +``` + +Let's take a look at the format of `hallmarks_gene_set`. + +```{r rownames.print=FALSE} +head(hallmark_gene_sets) +``` + +We can see this object is in a tabular format; each row corresponds to a gene and gene set pair. +A row exists if that gene (`entrez_gene`, `gene_symbol`) belongs to a gene set (`gs_name`). + +The function that we will use to run GSVA wants the gene sets to be in a list, where each entry in the list is a vector of genes that comprise the pathway the element is named for. +In the next step, we'll demonstrate how to go from this data frame format to a list. + +For this example we will use Entrez IDs (but note that there are gene symbols we could use just as easily). +The info we need is in two columns: `entrez_gene` contains the gene ID and `gs_name` contains the name of the pathway that the gene is a part of. + +To make this into the list format we need, we can use the `split()` function. +We want a list where each element of the list is a vector that contains the Entrez gene IDs that are in a particular pathway set. + +```{r} +hallmarks_list <- split( + hallmark_gene_sets$entrez_gene, # The genes we want split into pathways + hallmark_gene_sets$gs_name # The pathways made as the higher levels of the list +) +``` + +What does this `hallmarks_list` look like? + +```{r} +head(hallmarks_list, n = 2) +``` + +Looks like we have a list of gene sets with associated Entrez IDs. + +In our gene expression data frame, `expression_df` we have Ensembl gene identifiers. +So we will need to convert our Ensembl IDs into Entrez IDs for GSVA. + +### Gene identifier conversion + +We're going to convert our identifiers in `expression_df` to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers! + +The annotation package `org.Hs.eg.db` contains information for different identifiers [@Carlson2020-human]. +`org.Hs.eg.db` is specific to _Homo sapiens_ -- this is what the `Hs` in the package name is referencing. + +We can see what types of IDs are available to us in an annotation package with `keytypes()`. + +```{r} +keytypes(org.Hs.eg.db) +``` + +We' use this package to convert from Ensembl gene IDs (`ENSEMBL`) to Entrez IDs (`ENTREZID`) -- since this is the IDs we used in our `hallmarks_list` in the previous step. +But, we could just as easily use it to convert to gene symbols (`SYMBOL`) if we had built `hallmarks_list` using gene symbols. + +The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called `mapIds()` and comes from the `AnnotationDbi` package. + +Let's create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs. + +```{r} +# First let's create a mapped data frame we can join to the gene expression values +mapped_df <- data.frame( + "entrez_id" = mapIds( + # Replace with annotation package for the organism relevant to your data + org.Hs.eg.db, + keys = vst_df$ensembl_id, + # Replace with the type of gene identifiers in your data + keytype = "ENSEMBL", + # Replace with the type of gene identifiers you would like to map to + column = "ENTREZID", + # This will keep only the first mapped value for each Ensembl ID + multiVals = "first" + ) +) %>% + # If an Ensembl gene identifier doesn't map to a Entrez gene identifier, + # drop that from the data frame + dplyr::filter(!is.na(entrez_id)) %>% + # Make an `Ensembl` column to store the row names + tibble::rownames_to_column("Ensembl") %>% + # Now let's join the rest of the expression data + dplyr::inner_join(vst_df, by = c("Ensembl" = "ensembl_id")) +``` + +This `1:many mapping between keys and columns` message means that some Ensembl gene identifiers map to multiple Entrez IDs. +In this case, it's also possible that a Entrez ID will map to multiple Ensembl IDs. +For the purpose of performing GSVA later in this notebook, we keep only the first mapped IDs. + +For more info on gene ID conversion, take a look at our other examples: +[the microarray example](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html) and [the RNA-seq example](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html). + +Let's see a preview of `mapped_df`. + +```{r rownames.print=FALSE} +head(mapped_df) +``` + +We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as row names. +In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers. + +Let's count up how many Entrez IDs mapped to multiple Ensembl IDs. + +```{r} +sum(duplicated(mapped_df$entrez_id)) +``` + +Looks like we have 68 duplicated Entrez IDs. + +#### Handling duplicate gene identifiers + +As we mentioned earlier, we will not want any duplicate gene identifiers in our data frame when we convert it into a matrix in preparation for running GSVA. + +For RNA-seq processing in refine.bio, transcripts were quantified (Ensembl transcript IDs) and aggregated to the gene-level (Ensembl gene IDs). +For a single Entrez ID that maps to multiple Ensembl gene IDs, we will use the values associated with the Ensembl gene ID that seems to be most highly expressed. +Specifically, we're going retain the Ensembl gene ID with maximum mean expression value. +We expect that this approach may be a better reflection of the reads that were quantified than taking the mean or median of the values for multiple Ensembl gene IDs would be. + +Our example doesn't contain too many duplicates; ultimately we only are losing 68 rows of data. +If you find yourself using a dataset that has large proportion of duplicates, we'd recommend exercising some caution and exploring how well values for multiple gene IDs are correlated and the identity of those genes. + +First, we first need to calculate the gene means, but we'll need to move our non-numeric variables (the gene ID columns) out of the way for that calculation. + +```{r} +# First let's determine the gene means +gene_means <- rowMeans(mapped_df %>% dplyr::select(-Ensembl, -entrez_id)) + +# Let's add this as a column in our `mapped_df`. +mapped_df <- mapped_df %>% + # Add gene_means as a column called gene_means + dplyr::mutate(gene_means) %>% + # Reorder the columns so `gene_means` column is upfront + dplyr::select(Ensembl, entrez_id, gene_means, dplyr::everything()) +``` + +Now we can filter out the duplicate gene identifiers using the gene mean values. +First, we'll use `dplyr::arrange()` by `gene_means` such that the the rows will be in order of highest gene mean to lowest gene mean. +For the duplicate values of `entrez_id`, the row with the lower index will be the one that's kept by `dplyr::distinct()`. +In practice, this means that we'll keep the instance of the Entrez ID with the highest gene mean value as intended. + +```{r} +filtered_mapped_df <- mapped_df %>% + # Sort so that the highest mean expression values are at the top + dplyr::arrange(dplyr::desc(gene_means)) %>% + # Filter out the duplicated rows using `dplyr::distinct()` + dplyr::distinct(entrez_id, .keep_all = TRUE) +``` + +Let's do our check again to see if we still have duplicates. + +```{r} +sum(duplicated(filtered_mapped_df$entrez_id)) +``` + +We now have `0` duplicates which is what we want. All set! + +Now we should prep this data so GSVA can use it. + +```{r} +filtered_mapped_matrix <- filtered_mapped_df %>% + # GSVA can't the Ensembl IDs so we should drop this column as well as the means + dplyr::select(-Ensembl, -gene_means) %>% + # We need to store our gene identifiers as row names + tibble::column_to_rownames("entrez_id") %>% + # Now we can convert our object into a matrix + as.matrix() +``` + +Note that if we had duplicate gene identifiers here, we would not be able to set them as row names. + +## Gene Set Variation Analysis + +GSVA fits a model and ranks genes based on their expression level relative to the sample distribution [@Hanzelmann2013]. +The pathway-level score calculated is a way of asking how genes _within_ a gene set vary as compared to genes that are _outside_ of that gene set [@Malhotra2018]. + +The idea here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) [@Hanzelmann2013]. +This means that GSVA scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and ran GSVA again, you would expect the scores to change [@Hanzelmann2013]. + +The output is a gene set by sample matrix of GSVA scores. + +### Perform GSVA + +Let's perform GSVA using the `gsva()` function. +See `?gsva` for more options. + +```{r} +gsva_results <- gsva( + filtered_mapped_matrix, + hallmarks_list, + method = "gsva", + # Appropriate for our vst transformed data + kcdf = "Gaussian", + # Minimum gene set size + min.sz = 15, + # Maximum gene set size + max.sz = 500, + # Compute Gaussian-distributed scores + mx.diff = TRUE, + # Don't print out the progress bar + verbose = FALSE +) +``` + +Note that the `gsva()` function documentation says we can use `kcdf = "Gaussian"` if we have expression values that are continuous such as log-CPMs, log-RPKMs or log-TPMs, but we would use `kcdf = "Poisson"` on integer counts. +Our `vst()` transformed data is on a log2-like scale, so `Gaussian` works for us. + +Let's explore what the output of `gsva()` looks like. + +```{r} +# Print 6 rows, +head(gsva_results[, 1:10]) +``` + +## Write results to file + +Let's write all of our GSVA results to file. + +```{r} +gsva_results %>% + as.data.frame() %>% + tibble::rownames_to_column("pathway") %>% + readr::write_tsv(file.path( + results_dir, + "SRP140558_gsva_results.tsv" + )) +``` + +## Visualizing results with a heatmap + +Let's make a heatmap for our pathways! + +### Neaten up our metadata labels + +We will want our heatmap to include some information about the sample labels, but unfortunately some of the metadata for this dataset are not set up into separate, neat columns. + +The most salient information for these samples is combined into one column, `refinebio_title`. +Let's preview what this column looks like. + +```{r} +head(metadata$refinebio_title) +``` + +If we used these labels as is, it wouldn't be very informative! + +Looking at the author's descriptions, PBMCs were collected at two time points: during the patients' first, acute bronchiolitis visit (abbreviated "AV") and their recovery visit, (called post-convalescence and abbreviated "CV"). + +We can create a new variable, `time_point`, that states this info more clearly. +This new `time_point` variable will have two labels: `acute illness` and `recovering` based on the `AV` or `CV` coding located in the `refinebio_title` string variable. + +```{r} +annot_df <- metadata %>% + # We need the sample IDs and the main column that contains the metadata info + dplyr::select( + refinebio_accession_code, + refinebio_title + ) %>% + # Create our `time_point` variable based on `refinebio_title` + dplyr::mutate( + time_point = dplyr::case_when( + # Create our new variable based whether the refinebio_title column + # contains _AV_ or _CV_ + stringr::str_detect(refinebio_title, "_AV_") ~ "acute illness", + stringr::str_detect(refinebio_title, "_CV_") ~ "recovering" + ) + ) %>% + # We don't need the older version of the variable anymore + dplyr::select(-refinebio_title) +``` + +These time point samples are paired, so you could also add the `refinebio_subject` to the labels. +For simplicity, we've left them off for now. + +The `pheatmap::pheatmap()` will want the annotation data frame to have matching row names to the data we supply it (which is our `gsva_results`). + +```{r} +annot_df <- annot_df %>% + # pheatmap will want our sample names that match our data to + tibble::column_to_rownames("refinebio_accession_code") +``` + +### Set up the heatmap itself + +Great! We're all set. +We can see that they are in a wide format with the GSVA scores for each sample spread across a row associated with each pathway. + +```{r} +pathway_heatmap <- pheatmap::pheatmap(gsva_results, + annotation_col = annot_df, # Add metadata labels! + show_colnames = FALSE, # Don't show sample labels + fontsize_row = 6 # Shrink the pathway labels a tad +) + +# Print out heatmap here +pathway_heatmap +``` + +Here we've used clustering and can see that samples somewhat cluster by `time_point`. + +We can also see that some pathways that share biology seem to cluster together (e.g. `HALLMARK_INTERFERON_ALPHA_RESPONSE` and `HALLMARK_INTERFERON_GAMMA_RESPONSE`). +Pathways may cluster together, or have similar GSVA scores, because the genes in those pathways overlap. + +Taking this example, we can look at how many genes are in common for `HALLMARK_INTERFERON_ALPHA_RESPONSE` and `HALLMARK_INTERFERON_GAMMA_RESPONSE`. + +```{r} +length(intersect( + hallmarks_list$HALLMARK_INTERFERON_ALPHA_RESPONSE, + hallmarks_list$HALLMARK_INTERFERON_GAMMA_RESPONSE +)) +``` + +These `73` genes out of `HALLMARK_INTERFERON_ALPHA_RESPONSE`'s `r length(hallmarks_list$HALLMARK_INTERFERON_ALPHA_RESPONSE)` and `hallmarks_list$HALLMARK_INTERFERON_GAMMA_RESPONSE`'s `r length(hallmarks_list$HALLMARK_INTERFERON_GAMMA_RESPONSE)` is probably why those cluster together. + +The pathways share genes and are not independent! + +Now, let's save this plot to PNG. + +```{r} +# Replace file name with a relevant output plot name +heatmap_png_file <- file.path(plots_dir, "SRP140558_heatmap.png") + +# Open a PNG file - width and height arguments control the size of the output +png(heatmap_png_file, width = 1000, height = 800) + +# Print your heatmap +pathway_heatmap + +# Close the PNG file: +dev.off() +``` + +# Resources for further learning + +- [GSVA Paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-7) [@Hanzelmann2013] +- [Gene Set Enrichment Analysis (GSEA) User guide](https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html) [@GSEA-user-guide]. +- [Decoding Gene Set Variation Analysis](https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3) [@Malhotra2018] +- [Make heatmaps in R with pheatmap](https://slowkow.com/notes/pheatmap-tutorial/) [@Slowikowski2017] +- To customize heatmaps even further than the functions in the `pheatmap` package allow, see the [ComplexHeatmap Complete Reference Manual](https://jokergoo.github.io/ComplexHeatmap-reference/book/) [@Gu2016] + +# Session info + +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. + +```{r} +# Print session info +sessioninfo::session_info() +``` + +# References diff --git a/03-rnaseq/pathway-analysis_rnaseq_03_gsva.html b/03-rnaseq/pathway-analysis_rnaseq_03_gsva.html new file mode 100644 index 00000000..68cbbe28 --- /dev/null +++ b/03-rnaseq/pathway-analysis_rnaseq_03_gsva.html @@ -0,0 +1,4723 @@ + + + + + + + + + + + + + + +Gene set variation analysis - RNA-seq + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    +
    + +
    + + + + + + + + + +
    +

    1 Purpose of this analysis

    +

    This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

    +

    In this example we will cover a method called Gene Set Variation Analysis (GSVA) to calculate gene set or pathway scores on a per-sample basis (Hänzelmann et al. 2013). GSVA transforms a gene by sample gene expression matrix into a gene set by sample pathway enrichment matrix (Hänzelmann et al. 2013). We’ll make a heatmap of the enrichment matrix, but you can use the GSVA scores for a number of other downstream analyses such as differential expression analysis.

    +

    ⬇️ Jump to the analysis code ⬇️

    +
    +

    1.0.1 What is pathway analysis?

    +

    Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

    +

    We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

    +
    +
    +

    1.0.2 How to choose a pathway analysis?

    +

    This table summarizes the pathway analyses examples in this module.

    + +++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    AnalysisWhat is required for inputWhat output looks like✅ Pros⚠️ Cons
    ORA (Over-representation Analysis)A list of gene IDs (no stats needed)A per-pathway hypergeometric test result- Simple

    - Inexpensive computationally to calculate p-values
    - Requires arbitrary thresholds and ignores any statistics associated with a gene

    - Assumes independence of genes and pathways
    GSEA (Gene Set Enrichment Analysis)A list of genes IDs with gene-level summary statisticsA per-pathway enrichment score- Includes all genes (no arbitrary threshold!)

    - Attempts to measure coordination of genes
    - Permutations can be expensive

    - Does not account for pathway overlap

    - Two-group comparisons not always appropriate/feasible
    GSVA (Gene Set Variation Analysis)A gene expression matrix (like what you get from refine.bio directly)Pathway-level scores on a per-sample basis- Does not require two groups to compare upfront

    - Normally distributed scores
    - Scores are not a good fit for gene sets that contain genes that go up AND down

    - Method doesn’t assign statistical significance itself

    - Recommended sample size n > 10
    +
    +
    +
    +

    2 How to run this example

    +

    For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

    +
    +

    2.1 Obtain the .Rmd file

    +

    To run this example yourself, download the .Rmd for this analysis by clicking this link.

    +

    Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

    +

    You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

    +
    +
    +

    2.2 Set up your analysis folders

    +

    Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

    +

    If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

    +
    # Create the data folder if it doesn't exist
    +if (!dir.exists("data")) {
    +  dir.create("data")
    +}
    +
    +# Define the file path to the plots directory
    +plots_dir <- "plots"
    +
    +# Create the plots folder if it doesn't exist
    +if (!dir.exists(plots_dir)) {
    +  dir.create(plots_dir)
    +}
    +
    +# Define the file path to the results directory
    +results_dir <- "results"
    +
    +# Create the results folder if it doesn't exist
    +if (!dir.exists(results_dir)) {
    +  dir.create(results_dir)
    +}
    +

    In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

    +
    +
    +

    2.3 Obtain the dataset from refine.bio

    +

    For general information about downloading data for these examples, see our ‘Getting Started’ section.

    +

    Go to this dataset’s page on refine.bio.

    +

    Click the “Download Now” button on the right side of this screen.

    +

    +

    Fill out the pop up window with your email and our Terms and Conditions:

    +

    +

    It may take a few minutes for the dataset to process. You will get an email when it is ready.

    +
    +
    +

    2.4 About the dataset we are using for this example

    +

    For this example analysis, we will use this acute viral bronchiolitis dataset. The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated “AV”) and their recovery, their post-convalescence visit (abbreviated “CV”).

    +
    +
    +

    2.5 Place the dataset in your new data/ folder

    +

    refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

    +

    +

    For more details on the contents of this folder see these docs on refine.bio.

    +

    The SRP140558 folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

    +

    Copy and paste the SRP140558 folder into your newly created data/ folder.

    +
    +
    +

    2.6 Check out our file structure!

    +

    Your new analysis folder should contain:

    +
      +
    • The example analysis .Rmd you downloaded
      +
    • +
    • A folder called “data” which contains: +
        +
      • The SRP140558 folder which contains: +
          +
        • The gene expression
          +
        • +
        • The metadata TSV
          +
        • +
      • +
    • +
    • A folder for plots (currently empty)
      +
    • +
    • A folder for results (currently empty)
    • +
    +

    Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

    +

    +

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

    +

    First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

    +
    # Define the file path to the data directory
    +# Replace with the path of the folder the files will be in
    +data_dir <- file.path("data", "SRP140558")
    +
    +# Declare the file path to the gene expression matrix file
    +# inside directory saved as `data_dir`
    +# Replace with the path to your dataset file
    +data_file <- file.path(data_dir, "SRP140558.tsv")
    +
    +# Declare the file path to the metadata file
    +# inside the directory saved as `data_dir`
    +# Replace with the path to your metadata file
    +metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv")
    +

    Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

    +
    # Check if the gene expression matrix file is at the path stored in `data_file`
    +file.exists(data_file)
    +
    ## [1] TRUE
    +
    # Check if the metadata file is at the file path stored in `metadata_file`
    +file.exists(metadata_file)
    +
    ## [1] TRUE
    +

    If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

    +

    If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

    +
    +
    +
    +

    3 Using a different refine.bio dataset with this analysis?

    +

    If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

    +
    + +

     

    +
    +
    +

    4 Gene set variation analysis - RNA-Seq

    +
    +

    4.1 Install libraries

    +

    See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

    +

    We will be using DESeq2 to normalize and transform our RNA-seq data before running GSVA, so we will need to install that (Love et al. 2014).

    +

    In this analysis, we will be using the GSVA package to perform GSVA and the qusage package to read in the GMT file containing the gene set data (Hänzelmann et al. 2013; Yaari et al. 2013).

    +

    We will also need the org.Hs.eg.db package to perform gene identifier conversion (Carlson 2020).

    +

    We’ll create a heatmap from our pathway analysis results using pheatmap (Slowikowski 2017).

    +
    if (!("DESeq2" %in% installed.packages())) {
    +  # Install this package if it isn't installed yet
    +  BiocManager::install("DESeq2", update = FALSE)
    +}
    +
    +if (!("GSVA" %in% installed.packages())) {
    +  # Install this package if it isn't installed yet
    +  BiocManager::install("GSVA", update = FALSE)
    +}
    +
    +if (!("qusage" %in% installed.packages())) {
    +  # Install this package if it isn't installed yet
    +  BiocManager::install("qusage", update = FALSE)
    +}
    +
    +if (!("org.Hs.eg.db" %in% installed.packages())) {
    +  # Install this package if it isn't installed yet
    +  BiocManager::install("org.Hs.eg.db", update = FALSE)
    +}
    +
    +if (!("pheatmap" %in% installed.packages())) {
    +  # Install pheatmap
    +  install.packages("pheatmap", update = FALSE)
    +}
    +

    Attach the packages we need for this analysis.

    +
    # Attach the DESeq2 library
    +library(DESeq2)
    +
    ## Loading required package: S4Vectors
    +
    ## Loading required package: stats4
    +
    ## Loading required package: BiocGenerics
    +
    ## Loading required package: parallel
    +
    ## 
    +## Attaching package: 'BiocGenerics'
    +
    ## The following objects are masked from 'package:parallel':
    +## 
    +##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    +##     clusterExport, clusterMap, parApply, parCapply, parLapply,
    +##     parLapplyLB, parRapply, parSapply, parSapplyLB
    +
    ## The following objects are masked from 'package:stats':
    +## 
    +##     IQR, mad, sd, var, xtabs
    +
    ## The following objects are masked from 'package:base':
    +## 
    +##     anyDuplicated, append, as.data.frame, basename, cbind,
    +##     colnames, dirname, do.call, duplicated, eval, evalq,
    +##     Filter, Find, get, grep, grepl, intersect, is.unsorted,
    +##     lapply, Map, mapply, match, mget, order, paste, pmax,
    +##     pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
    +##     rownames, sapply, setdiff, sort, table, tapply, union,
    +##     unique, unsplit, which.max, which.min
    +
    ## 
    +## Attaching package: 'S4Vectors'
    +
    ## The following object is masked from 'package:base':
    +## 
    +##     expand.grid
    +
    ## Loading required package: IRanges
    +
    ## Loading required package: GenomicRanges
    +
    ## Loading required package: GenomeInfoDb
    +
    ## Loading required package: SummarizedExperiment
    +
    ## Loading required package: MatrixGenerics
    +
    ## Loading required package: matrixStats
    +
    ## 
    +## Attaching package: 'MatrixGenerics'
    +
    ## The following objects are masked from 'package:matrixStats':
    +## 
    +##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet,
    +##     colCollapse, colCounts, colCummaxs, colCummins,
    +##     colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs,
    +##     colLogSumExps, colMadDiffs, colMads, colMaxs, colMeans2,
    +##     colMedians, colMins, colOrderStats, colProds,
    +##     colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    +##     colSums2, colTabulates, colVarDiffs, colVars,
    +##     colWeightedMads, colWeightedMeans, colWeightedMedians,
    +##     colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs,
    +##     rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts,
    +##     rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs,
    +##     rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads,
    +##     rowMaxs, rowMeans2, rowMedians, rowMins, rowOrderStats,
    +##     rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs,
    +##     rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    +##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    +##     rowWeightedSds, rowWeightedVars
    +
    ## Loading required package: Biobase
    +
    ## Welcome to Bioconductor
    +## 
    +##     Vignettes contain introductory material; view with
    +##     'browseVignettes()'. To cite Bioconductor, see
    +##     'citation("Biobase")', and for packages
    +##     'citation("pkgname")'.
    +
    ## 
    +## Attaching package: 'Biobase'
    +
    ## The following object is masked from 'package:MatrixGenerics':
    +## 
    +##     rowMedians
    +
    ## The following objects are masked from 'package:matrixStats':
    +## 
    +##     anyMissing, rowMedians
    +
    # Attach the `qusage` library
    +library(qusage)
    +
    ## Loading required package: limma
    +
    ## 
    +## Attaching package: 'limma'
    +
    ## The following object is masked from 'package:DESeq2':
    +## 
    +##     plotMA
    +
    ## The following object is masked from 'package:BiocGenerics':
    +## 
    +##     plotMA
    +
    # Attach the `GSVA` library
    +library(GSVA)
    +
    +# Human annotation package we'll use for gene identifier conversion
    +library(org.Hs.eg.db)
    +
    ## Loading required package: AnnotationDbi
    +
    ## 
    +
    # We will need this so we can use the pipe: %>%
    +library(magrittr)
    +
    +
    +

    4.2 Import and set up data

    +

    Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

    +

    We stored our file paths as objects named metadata_file and data_file in this previous step.

    +
    # Read in metadata TSV file
    +metadata <- readr::read_tsv(metadata_file)
    +
    ## 
    +## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
    +## cols(
    +##   .default = col_logical(),
    +##   refinebio_accession_code = col_character(),
    +##   experiment_accession = col_character(),
    +##   refinebio_organism = col_character(),
    +##   refinebio_platform = col_character(),
    +##   refinebio_source_database = col_character(),
    +##   refinebio_subject = col_character(),
    +##   refinebio_title = col_character()
    +## )
    +## ℹ Use `spec()` for the full column specifications.
    +
    # Read in data TSV file
    +expression_df <- readr::read_tsv(data_file) %>%
    +  # Here we are going to store the gene IDs as row names so that we can have a numeric matrix to perform calculations on later
    +  tibble::column_to_rownames("Gene")
    +
    ## 
    +## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
    +## cols(
    +##   .default = col_double(),
    +##   Gene = col_character()
    +## )
    +## ℹ Use `spec()` for the full column specifications.
    +

    Let’s ensure that the metadata and data are in the same sample order.

    +
    # Make the data in the order of the metadata
    +expression_df <- expression_df %>%
    +  dplyr::select(metadata$refinebio_accession_code)
    +
    +# Check if this is in the same order
    +all.equal(colnames(expression_df), metadata$refinebio_accession_code)
    +
    ## [1] TRUE
    +
    +

    4.2.1 Prepare data for DESeq2

    +

    There are two things we need to do to prep our expression data for DESeq2.

    +

    First, we need to make sure all of the values in our data are converted to integers as required by a DESeq2 function we will use later.

    +

    Then, we need to filter out the genes that have not been expressed or that have low expression counts since we can not be as confident in those genes being reliably measured. We are going to do some pre-filtering to keep only genes with 50 or more reads in total across the samples.

    +
    expression_df <- expression_df %>%
    +  # Only keep rows that have total counts above the cutoff
    +  dplyr::filter(rowSums(.) >= 50) %>%
    +  # The next DESeq2 functions need the values to be converted to integers
    +  round()
    +
    +
    +
    +

    4.3 Create a DESeqDataset

    +

    We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

    +
    # Create a `DESeqDataSet` object
    +dds <- DESeqDataSetFromMatrix(
    +  countData = expression_df, # Our prepped data frame with counts
    +  colData = metadata, # Data frame with annotation for our samples
    +  design = ~1 # Here we are not specifying a model
    +)
    +
    ## converting counts to integer mode
    +
    +
    +

    4.4 Perform DESeq2 normalization and transformation

    +

    We often suggest normalizing and transforming your data for various applications including for GSVA. We are going to use the vst() function from the DESeq2 package to normalize and transform the data. For more information about these transformation methods, see here.

    +
    # Normalize and transform the data in the `DESeqDataSet` object using the `vst()`
    +# function from the `DESEq2` R package
    +dds_norm <- vst(dds)
    +

    At this point, if your data set has any outlier samples, you should look into removing them as they can affect your results. For this example data set, we will skip this step (there are no obvious outliers) and proceed.

    +

    But now we are ready to format our dataset for input into gsva::gsva(). We need to extract the normalized counts to a matrix and make it into a data frame so we can use with tidyverse functions later.

    +
    # Retrieve the normalized data from the `DESeqDataSet`
    +vst_df <- assay(dds_norm) %>%
    +  as.data.frame() %>% # Make into a data frame
    +  tibble::rownames_to_column("ensembl_id") # Make Gene IDs into their own column
    +
    +

    4.4.1 Import Gene Sets

    +

    The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).

    +

    In this example, we will use a collection called Hallmark gene sets for GSVA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:

    +
    +

    Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression.

    +
    +

    Here we are obtaining the pathway information from the main function of the msigdbr package (Dolgalev 2020). Because we are using human data in this example, we supply the formal organism name to the species argument. We will want only the hallmark pathways, so we use the category = "H" argument.

    +
    hallmark_gene_sets <- msigdbr::msigdbr(
    +  species = "Homo sapiens", # Can change this to what species you need
    +  category = "H" # Only hallmark gene sets
    +)
    +

    Let’s take a look at the format of hallmarks_gene_set.

    +
    head(hallmark_gene_sets)
    +
    + +
    +

    We can see this object is in a tabular format; each row corresponds to a gene and gene set pair. A row exists if that gene (entrez_gene, gene_symbol) belongs to a gene set (gs_name).

    +

    The function that we will use to run GSVA wants the gene sets to be in a list, where each entry in the list is a vector of genes that comprise the pathway the element is named for. In the next step, we’ll demonstrate how to go from this data frame format to a list.

    +

    For this example we will use Entrez IDs (but note that there are gene symbols we could use just as easily). The info we need is in two columns: entrez_gene contains the gene ID and gs_name contains the name of the pathway that the gene is a part of.

    +

    To make this into the list format we need, we can use the split() function. We want a list where each element of the list is a vector that contains the Entrez gene IDs that are in a particular pathway set.

    +
    hallmarks_list <- split(
    +  hallmark_gene_sets$entrez_gene, # The genes we want split into pathways
    +  hallmark_gene_sets$gs_name # The pathways made as the higher levels of the list
    +)
    +

    What does this hallmarks_list look like?

    +
    head(hallmarks_list, n = 2)
    +
    ## $HALLMARK_ADIPOGENESIS
    +##   [1]     19  11194  10449     33     34     35     47     50     51
    +##  [10]    112 149685   9370  79602  56894   9131    204    217    226
    +##  [19]    284  51129    334    348    369  10124  64225    483    539
    +##  [28]  11176    593  23786    604    718    847 284119   8436    901
    +##  [37]    977   9936    948   1031 400916   1147   1149 134147  51727
    +##  [46]   1306   1282  51805  84274  57017   1337   1349   1351   1376
    +##  [55]   1384   1431   1537   1580   1629   1652   1666   8694   1717
    +##  [64]  51635  25979   1737   1738   4189  29103 128338   1891   1892
    +##  [73]  84173  79071   5168   2053   2101  23344   2109   2167   2184
    +##  [82]   8322   9908   1647   2632  27069  57678 137964   2820  10243
    +##  [91]   2878   2879  80273   3033  26275  26353   3417   3419   3421
    +## [100]   3459  10989   3679  80760   6453  84522   3910   3952   3977
    +## [109]   3991  10162   4023   4056   8491  56922   4191   4199  11343
    +## [118]   4259  84895  56246  29088  54996  23788   4638  64859   4698
    +## [127]   4706   4713   4722  28512   4836   4958   5004  27250  10400
    +## [136]   5195   5209   5211   5236  23187   5264 415116    123   5447
    +## [145]   5468   5495  84919  10935  10113  55037   5733   5860  83871
    +## [154]   7905  92840  56729  54884   8780  55177  26994   6239  10313
    +## [163]  25813    949   6342   6390   6391   6573   6510   6576   1468
    +## [172] 376497   8884 130814   6623   6647  10580  65124   8404  58472
    +## [181]   8082   6776   2040   8802   6817   6888  10010   7086  10140
    +## [190]   7263   7316  29979  83549   7351  29796  10975   7384  27089
    +## [199]   7423   7532
    +## 
    +## $HALLMARK_ALLOGRAFT_REJECTION
    +##   [1]     16   6059  10006     43     92    207    322    567    586
    +##  [10]   8915    602    672    717    822   9607   6356   6357   6363
    +##  [19]   6347   6367   6351   6352   6354    894    896   1230 729230
    +##  [28]   1234    912    914    919    940    915    916    917    920
    +##  [37]    958    959    961    924    972    973    941    942    925
    +##  [46]    926  10225   1029   5199  56253   1435   1445   1520  10563
    +##  [55]   4283   2833   1615   8560   8444   1956   8661   8664   8669
    +##  [64]   8672   1984   1991   2000   2069   2113   2147   2149    355
    +##  [73]    356   2213   2268   2316   2533   2589   2634   2650  11146
    +##  [82]   8477   3001   3002   3059   9734   3091   3105   3108   3109
    +##  [91]   3111   3112   3117   3122   3133   3135   3383  23308   3455
    +## [100]   3458   3459   3460  10261   3551   3586   3589   3592   3593
    +## [109]   3594   3596   3600   3603   3606   8807   3553   3558   9466
    +## [118]   3559   3560   3561   3565   3566   3569   3574   3578   3624
    +## [127]   3625   3662   3665   3394   3683   3689   3702   3717   3824
    +## [136]   3848   3932   3937   3976   4050   4065   9450   4067   6885
    +## [145]  11184   4153   4318  11222   4528   4689   4690   9437 114548
    +## [154]   4830   4843   4869   5196   5551   5579   5582   5699   5777
    +## [163]   5788   5917   8767   6170   6123   6133   6223   6189   6203
    +## [172]  27240   8651   9655   6688   5552   7903  23166   6772   6775
    +## [181]   6890   6891   6892   7040   7042   7070   7076   7096   7097
    +## [190]   7098  10333   7124   7163   7186  50852   7321   7334   7453
    +## [199]   7454   7535
    +

    Looks like we have a list of gene sets with associated Entrez IDs.

    +

    In our gene expression data frame, expression_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into Entrez IDs for GSVA.

    +
    +
    +

    4.4.2 Gene identifier conversion

    +

    We’re going to convert our identifiers in expression_df to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

    +

    The annotation package org.Hs.eg.db contains information for different identifiers (Carlson 2020). org.Hs.eg.db is specific to Homo sapiens – this is what the Hs in the package name is referencing.

    +

    We can see what types of IDs are available to us in an annotation package with keytypes().

    +
    keytypes(org.Hs.eg.db)
    +
    ##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
    +##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
    +##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
    +## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
    +## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
    +## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
    +## [25] "UNIGENE"      "UNIPROT"
    +

    We’ use this package to convert from Ensembl gene IDs (ENSEMBL) to Entrez IDs (ENTREZID) – since this is the IDs we used in our hallmarks_list in the previous step. But, we could just as easily use it to convert to gene symbols (SYMBOL) if we had built hallmarks_list using gene symbols.

    +

    The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds() and comes from the AnnotationDbi package.

    +

    Let’s create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs.

    +
    # First let's create a mapped data frame we can join to the gene expression values
    +mapped_df <- data.frame(
    +  "entrez_id" = mapIds(
    +    # Replace with annotation package for the organism relevant to your data
    +    org.Hs.eg.db,
    +    keys = vst_df$ensembl_id,
    +    # Replace with the type of gene identifiers in your data
    +    keytype = "ENSEMBL",
    +    # Replace with the type of gene identifiers you would like to map to
    +    column = "ENTREZID",
    +    # This will keep only the first mapped value for each Ensembl ID
    +    multiVals = "first"
    +  )
    +) %>%
    +  # If an Ensembl gene identifier doesn't map to a Entrez gene identifier,
    +  # drop that from the data frame
    +  dplyr::filter(!is.na(entrez_id)) %>%
    +  # Make an `Ensembl` column to store the row names
    +  tibble::rownames_to_column("Ensembl") %>%
    +  # Now let's join the rest of the expression data
    +  dplyr::inner_join(vst_df, by = c("Ensembl" = "ensembl_id"))
    +
    ## 'select()' returned 1:many mapping between keys and columns
    +

    This 1:many mapping between keys and columns message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSVA later in this notebook, we keep only the first mapped IDs.

    +

    For more info on gene ID conversion, take a look at our other examples: the microarray example and the RNA-seq example.

    +

    Let’s see a preview of mapped_df.

    +
    head(mapped_df)
    +
    + +
    +

    We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as row names. In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers.

    +

    Let’s count up how many Entrez IDs mapped to multiple Ensembl IDs.

    +
    sum(duplicated(mapped_df$entrez_id))
    +
    ## [1] 68
    +

    Looks like we have 68 duplicated Entrez IDs.

    +
    +

    4.4.2.1 Handling duplicate gene identifiers

    +

    As we mentioned earlier, we will not want any duplicate gene identifiers in our data frame when we convert it into a matrix in preparation for running GSVA.

    +

    For RNA-seq processing in refine.bio, transcripts were quantified (Ensembl transcript IDs) and aggregated to the gene-level (Ensembl gene IDs). For a single Entrez ID that maps to multiple Ensembl gene IDs, we will use the values associated with the Ensembl gene ID that seems to be most highly expressed. Specifically, we’re going retain the Ensembl gene ID with maximum mean expression value. We expect that this approach may be a better reflection of the reads that were quantified than taking the mean or median of the values for multiple Ensembl gene IDs would be.

    +

    Our example doesn’t contain too many duplicates; ultimately we only are losing 68 rows of data. If you find yourself using a dataset that has large proportion of duplicates, we’d recommend exercising some caution and exploring how well values for multiple gene IDs are correlated and the identity of those genes.

    +

    First, we first need to calculate the gene means, but we’ll need to move our non-numeric variables (the gene ID columns) out of the way for that calculation.

    +
    # First let's determine the gene means
    +gene_means <- rowMeans(mapped_df %>% dplyr::select(-Ensembl, -entrez_id))
    +
    +# Let's add this as a column in our `mapped_df`.
    +mapped_df <- mapped_df %>%
    +  # Add gene_means as a column called gene_means
    +  dplyr::mutate(gene_means) %>%
    +  # Reorder the columns so `gene_means` column is upfront
    +  dplyr::select(Ensembl, entrez_id, gene_means, dplyr::everything())
    +

    Now we can filter out the duplicate gene identifiers using the gene mean values. First, we’ll use dplyr::arrange() by gene_means such that the the rows will be in order of highest gene mean to lowest gene mean. For the duplicate values of entrez_id, the row with the lower index will be the one that’s kept by dplyr::distinct(). In practice, this means that we’ll keep the instance of the Entrez ID with the highest gene mean value as intended.

    +
    filtered_mapped_df <- mapped_df %>%
    +  # Sort so that the highest mean expression values are at the top
    +  dplyr::arrange(dplyr::desc(gene_means)) %>%
    +  # Filter out the duplicated rows using `dplyr::distinct()`
    +  dplyr::distinct(entrez_id, .keep_all = TRUE)
    +

    Let’s do our check again to see if we still have duplicates.

    +
    sum(duplicated(filtered_mapped_df$entrez_id))
    +
    ## [1] 0
    +

    We now have 0 duplicates which is what we want. All set!

    +

    Now we should prep this data so GSVA can use it.

    +
    filtered_mapped_matrix <- filtered_mapped_df %>%
    +  # GSVA can't the Ensembl IDs so we should drop this column as well as the means
    +  dplyr::select(-Ensembl, -gene_means) %>%
    +  # We need to store our gene identifiers as row names
    +  tibble::column_to_rownames("entrez_id") %>%
    +  # Now we can convert our object into a matrix
    +  as.matrix()
    +

    Note that if we had duplicate gene identifiers here, we would not be able to set them as row names.

    +
    +
    +
    +
    +

    4.5 Gene Set Variation Analysis

    +

    GSVA fits a model and ranks genes based on their expression level relative to the sample distribution (Hänzelmann et al. 2013). The pathway-level score calculated is a way of asking how genes within a gene set vary as compared to genes that are outside of that gene set (Malhotra 2018).

    +

    The idea here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) (Hänzelmann et al. 2013). This means that GSVA scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and ran GSVA again, you would expect the scores to change (Hänzelmann et al. 2013).

    +

    The output is a gene set by sample matrix of GSVA scores.

    +
    +

    4.5.1 Perform GSVA

    +

    Let’s perform GSVA using the gsva() function. See ?gsva for more options.

    +
    gsva_results <- gsva(
    +  filtered_mapped_matrix,
    +  hallmarks_list,
    +  method = "gsva",
    +  # Appropriate for our vst transformed data
    +  kcdf = "Gaussian",
    +  # Minimum gene set size
    +  min.sz = 15,
    +  # Maximum gene set size
    +  max.sz = 500,
    +  # Compute Gaussian-distributed scores
    +  mx.diff = TRUE,
    +  # Don't print out the progress bar
    +  verbose = FALSE
    +)
    +

    Note that the gsva() function documentation says we can use kcdf = "Gaussian" if we have expression values that are continuous such as log-CPMs, log-RPKMs or log-TPMs, but we would use kcdf = "Poisson" on integer counts. Our vst() transformed data is on a log2-like scale, so Gaussian works for us.

    +

    Let’s explore what the output of gsva() looks like.

    +
    # Print 6 rows,
    +head(gsva_results[, 1:10])
    +
    ##                               SRR7011789  SRR7011790  SRR7011791
    +## HALLMARK_ADIPOGENESIS        -0.22774528 -0.36395241  0.22999820
    +## HALLMARK_ALLOGRAFT_REJECTION  0.22660346 -0.25407049 -0.04169663
    +## HALLMARK_ANDROGEN_RESPONSE    0.08568006 -0.13709858  0.32159028
    +## HALLMARK_ANGIOGENESIS        -0.33111804 -0.25529152  0.60728563
    +## HALLMARK_APICAL_JUNCTION     -0.11027645 -0.16642244  0.23723265
    +## HALLMARK_APICAL_SURFACE       0.01112321 -0.01699534  0.07994730
    +##                               SRR7011792 SRR7011793  SRR7011794
    +## HALLMARK_ADIPOGENESIS        -0.19727825 -0.2313671 -0.20810271
    +## HALLMARK_ALLOGRAFT_REJECTION -0.01823989 -0.1466423 -0.27374622
    +## HALLMARK_ANDROGEN_RESPONSE   -0.11634752  0.0743458 -0.09111262
    +## HALLMARK_ANGIOGENESIS        -0.28334284  0.4498812 -0.17887517
    +## HALLMARK_APICAL_JUNCTION      0.09654556 -0.2177673 -0.13366769
    +## HALLMARK_APICAL_SURFACE       0.25766960 -0.1668598 -0.15017936
    +##                               SRR7011795  SRR7011796   SRR7011797
    +## HALLMARK_ADIPOGENESIS        -0.00891876 -0.13059319 -0.072872699
    +## HALLMARK_ALLOGRAFT_REJECTION -0.14610335 -0.19305512 -0.191220843
    +## HALLMARK_ANDROGEN_RESPONSE    0.19100704  0.02244988  0.061162604
    +## HALLMARK_ANGIOGENESIS        -0.27122034 -0.10532059  0.238517354
    +## HALLMARK_APICAL_JUNCTION     -0.06955051 -0.07915702  0.005245755
    +## HALLMARK_APICAL_SURFACE       0.11007532 -0.08255951 -0.144939542
    +##                               SRR7011798
    +## HALLMARK_ADIPOGENESIS        -0.09425027
    +## HALLMARK_ALLOGRAFT_REJECTION -0.19280670
    +## HALLMARK_ANDROGEN_RESPONSE   -0.14700865
    +## HALLMARK_ANGIOGENESIS         0.22604015
    +## HALLMARK_APICAL_JUNCTION     -0.24149406
    +## HALLMARK_APICAL_SURFACE      -0.15649985
    +
    +
    +
    +

    4.6 Write results to file

    +

    Let’s write all of our GSVA results to file.

    +
    gsva_results %>%
    +  as.data.frame() %>%
    +  tibble::rownames_to_column("pathway") %>%
    +  readr::write_tsv(file.path(
    +    results_dir,
    +    "SRP140558_gsva_results.tsv"
    +  ))
    +
    +
    +

    4.7 Visualizing results with a heatmap

    +

    Let’s make a heatmap for our pathways!

    +
    +

    4.7.1 Neaten up our metadata labels

    +

    We will want our heatmap to include some information about the sample labels, but unfortunately some of the metadata for this dataset are not set up into separate, neat columns.

    +

    The most salient information for these samples is combined into one column, refinebio_title. Let’s preview what this column looks like.

    +
    head(metadata$refinebio_title)
    +
    ## [1] "AVB_006_AV_PBMC" "AVB_006_CV_PBMC" "AVB_007_AV_PBMC"
    +## [4] "AVB_007_CV_PBMC" "AVB_012_AV_PBMC" "AVB_012_CV_PBMC"
    +

    If we used these labels as is, it wouldn’t be very informative!

    +

    Looking at the author’s descriptions, PBMCs were collected at two time points: during the patients’ first, acute bronchiolitis visit (abbreviated “AV”) and their recovery visit, (called post-convalescence and abbreviated “CV”).

    +

    We can create a new variable, time_point, that states this info more clearly. This new time_point variable will have two labels: acute illness and recovering based on the AV or CV coding located in the refinebio_title string variable.

    +
    annot_df <- metadata %>%
    +  # We need the sample IDs and the main column that contains the metadata info
    +  dplyr::select(
    +    refinebio_accession_code,
    +    refinebio_title
    +  ) %>%
    +  # Create our `time_point` variable based on `refinebio_title`
    +  dplyr::mutate(
    +    time_point = dplyr::case_when(
    +      # Create our new variable based whether the refinebio_title column
    +      # contains _AV_ or _CV_
    +      stringr::str_detect(refinebio_title, "_AV_") ~ "acute illness",
    +      stringr::str_detect(refinebio_title, "_CV_") ~ "recovering"
    +    )
    +  ) %>%
    +  # We don't need the older version of the variable anymore
    +  dplyr::select(-refinebio_title)
    +

    These time point samples are paired, so you could also add the refinebio_subject to the labels. For simplicity, we’ve left them off for now.

    +

    The pheatmap::pheatmap() will want the annotation data frame to have matching row names to the data we supply it (which is our gsva_results).

    +
    annot_df <- annot_df %>%
    +  # pheatmap will want our sample names that match our data to
    +  tibble::column_to_rownames("refinebio_accession_code")
    +
    +
    +

    4.7.2 Set up the heatmap itself

    +

    Great! We’re all set. We can see that they are in a wide format with the GSVA scores for each sample spread across a row associated with each pathway.

    +
    pathway_heatmap <- pheatmap::pheatmap(gsva_results,
    +  annotation_col = annot_df, # Add metadata labels!
    +  show_colnames = FALSE, # Don't show sample labels
    +  fontsize_row = 6 # Shrink the pathway labels a tad
    +)
    +
    +# Print out heatmap here
    +pathway_heatmap
    +

    +

    Here we’ve used clustering and can see that samples somewhat cluster by time_point.

    +

    We can also see that some pathways that share biology seem to cluster together (e.g. HALLMARK_INTERFERON_ALPHA_RESPONSE and HALLMARK_INTERFERON_GAMMA_RESPONSE). Pathways may cluster together, or have similar GSVA scores, because the genes in those pathways overlap.

    +

    Taking this example, we can look at how many genes are in common for HALLMARK_INTERFERON_ALPHA_RESPONSE and HALLMARK_INTERFERON_GAMMA_RESPONSE.

    +
    length(intersect(
    +  hallmarks_list$HALLMARK_INTERFERON_ALPHA_RESPONSE,
    +  hallmarks_list$HALLMARK_INTERFERON_GAMMA_RESPONSE
    +))
    +
    ## [1] 73
    +

    These 73 genes out of HALLMARK_INTERFERON_ALPHA_RESPONSE’s 97 and hallmarks_list$HALLMARK_INTERFERON_GAMMA_RESPONSE’s 200 is probably why those cluster together.

    +

    The pathways share genes and are not independent!

    +

    Now, let’s save this plot to PNG.

    +
    # Replace file name with a relevant output plot name
    +heatmap_png_file <- file.path(plots_dir, "SRP140558_heatmap.png")
    +
    +# Open a PNG file - width and height arguments control the size of the output
    +png(heatmap_png_file, width = 1000, height = 800)
    +
    +# Print your heatmap
    +pathway_heatmap
    +
    +# Close the PNG file:
    +dev.off()
    +
    ## png 
    +##   2
    +
    +
    +
    +
    +

    5 Resources for further learning

    + +
    +
    +

    6 Session info

    +

    At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

    +
    # Print session info
    +sessioninfo::session_info()
    +
    ## ─ Session info ─────────────────────────────────────────────────────
    +##  setting  value                       
    +##  version  R version 4.0.2 (2020-06-22)
    +##  os       Ubuntu 20.04 LTS            
    +##  system   x86_64, linux-gnu           
    +##  ui       X11                         
    +##  language (EN)                        
    +##  collate  en_US.UTF-8                 
    +##  ctype    en_US.UTF-8                 
    +##  tz       Etc/UTC                     
    +##  date     2020-12-21                  
    +## 
    +## ─ Packages ─────────────────────────────────────────────────────────
    +##  package              * version  date       lib source        
    +##  annotate               1.68.0   2020-10-27 [1] Bioconductor  
    +##  AnnotationDbi        * 1.52.0   2020-10-27 [1] Bioconductor  
    +##  assertthat             0.2.1    2019-03-21 [1] RSPM (R 4.0.0)
    +##  backports              1.1.10   2020-09-15 [1] RSPM (R 4.0.2)
    +##  Biobase              * 2.50.0   2020-10-27 [1] Bioconductor  
    +##  BiocGenerics         * 0.36.0   2020-10-27 [1] Bioconductor  
    +##  BiocParallel           1.24.1   2020-11-06 [1] Bioconductor  
    +##  bit                    4.0.4    2020-08-04 [1] RSPM (R 4.0.2)
    +##  bit64                  4.0.5    2020-08-30 [1] RSPM (R 4.0.2)
    +##  bitops                 1.0-6    2013-08-17 [1] RSPM (R 4.0.0)
    +##  blob                   1.2.1    2020-01-20 [1] RSPM (R 4.0.0)
    +##  cli                    2.1.0    2020-10-12 [1] RSPM (R 4.0.2)
    +##  coda                   0.19-4   2020-09-30 [1] RSPM (R 4.0.2)
    +##  colorspace             1.4-1    2019-03-18 [1] RSPM (R 4.0.0)
    +##  crayon                 1.3.4    2017-09-16 [1] RSPM (R 4.0.0)
    +##  DBI                    1.1.0    2019-12-15 [1] RSPM (R 4.0.0)
    +##  DelayedArray           0.16.0   2020-10-27 [1] Bioconductor  
    +##  DESeq2               * 1.30.0   2020-10-27 [1] Bioconductor  
    +##  digest                 0.6.25   2020-02-23 [1] RSPM (R 4.0.0)
    +##  dplyr                  1.0.2    2020-08-18 [1] RSPM (R 4.0.2)
    +##  ellipsis               0.3.1    2020-05-15 [1] RSPM (R 4.0.0)
    +##  emmeans                1.5.1    2020-09-18 [1] RSPM (R 4.0.2)
    +##  estimability           1.3      2018-02-11 [1] RSPM (R 4.0.0)
    +##  evaluate               0.14     2019-05-28 [1] RSPM (R 4.0.0)
    +##  fansi                  0.4.1    2020-01-08 [1] RSPM (R 4.0.0)
    +##  farver                 2.0.3    2020-01-16 [1] RSPM (R 4.0.0)
    +##  fastmatch              1.1-0    2017-01-28 [1] RSPM (R 4.0.0)
    +##  fftw                   1.0-6    2020-02-24 [1] RSPM (R 4.0.2)
    +##  genefilter             1.72.0   2020-10-27 [1] Bioconductor  
    +##  geneplotter            1.68.0   2020-10-27 [1] Bioconductor  
    +##  generics               0.0.2    2018-11-29 [1] RSPM (R 4.0.0)
    +##  GenomeInfoDb         * 1.26.1   2020-11-20 [1] Bioconductor  
    +##  GenomeInfoDbData       1.2.4    2020-12-01 [1] Bioconductor  
    +##  GenomicRanges        * 1.42.0   2020-10-27 [1] Bioconductor  
    +##  getopt                 1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
    +##  ggplot2                3.3.2    2020-06-19 [1] RSPM (R 4.0.1)
    +##  glue                   1.4.2    2020-08-27 [1] RSPM (R 4.0.2)
    +##  graph                  1.68.0   2020-10-27 [1] Bioconductor  
    +##  GSEABase               1.52.1   2020-12-11 [1] Bioconductor  
    +##  GSVA                 * 1.38.0   2020-10-27 [1] Bioconductor  
    +##  gtable                 0.3.0    2019-03-25 [1] RSPM (R 4.0.0)
    +##  hms                    0.5.3    2020-01-08 [1] RSPM (R 4.0.0)
    +##  htmltools              0.5.0    2020-06-16 [1] RSPM (R 4.0.1)
    +##  httr                   1.4.2    2020-07-20 [1] RSPM (R 4.0.2)
    +##  IRanges              * 2.24.0   2020-10-27 [1] Bioconductor  
    +##  jsonlite               1.7.1    2020-09-07 [1] RSPM (R 4.0.2)
    +##  knitr                  1.30     2020-09-22 [1] RSPM (R 4.0.2)
    +##  lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
    +##  lifecycle              0.2.0    2020-03-06 [1] RSPM (R 4.0.0)
    +##  limma                * 3.46.0   2020-10-27 [1] Bioconductor  
    +##  locfit                 1.5-9.4  2020-03-25 [1] RSPM (R 4.0.0)
    +##  magrittr             * 1.5      2014-11-22 [1] RSPM (R 4.0.0)
    +##  Matrix                 1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
    +##  MatrixGenerics       * 1.2.0    2020-10-27 [1] Bioconductor  
    +##  matrixStats          * 0.57.0   2020-09-25 [1] RSPM (R 4.0.2)
    +##  memoise                1.1.0    2017-04-21 [1] RSPM (R 4.0.0)
    +##  msigdbr                7.2.1    2020-10-02 [1] RSPM (R 4.0.2)
    +##  munsell                0.5.0    2018-06-12 [1] RSPM (R 4.0.0)
    +##  mvtnorm                1.1-1    2020-06-09 [1] RSPM (R 4.0.0)
    +##  nlme                   3.1-148  2020-05-24 [2] CRAN (R 4.0.2)
    +##  optparse             * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
    +##  org.Hs.eg.db         * 3.12.0   2020-12-01 [1] Bioconductor  
    +##  pheatmap               1.0.12   2019-01-04 [1] RSPM (R 4.0.0)
    +##  pillar                 1.4.6    2020-07-10 [1] RSPM (R 4.0.2)
    +##  pkgconfig              2.0.3    2019-09-22 [1] RSPM (R 4.0.0)
    +##  ps                     1.4.0    2020-10-07 [1] RSPM (R 4.0.2)
    +##  purrr                  0.3.4    2020-04-17 [1] RSPM (R 4.0.0)
    +##  qusage               * 2.24.0   2020-10-27 [1] Bioconductor  
    +##  R.cache                0.14.0   2019-12-06 [1] RSPM (R 4.0.0)
    +##  R.methodsS3            1.8.1    2020-08-26 [1] RSPM (R 4.0.2)
    +##  R.oo                   1.24.0   2020-08-26 [1] RSPM (R 4.0.2)
    +##  R.utils                2.10.1   2020-08-26 [1] RSPM (R 4.0.2)
    +##  R6                     2.4.1    2019-11-12 [1] RSPM (R 4.0.0)
    +##  RColorBrewer           1.1-2    2014-12-07 [1] RSPM (R 4.0.0)
    +##  Rcpp                   1.0.5    2020-07-06 [1] RSPM (R 4.0.2)
    +##  RCurl                  1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0)
    +##  readr                  1.4.0    2020-10-05 [1] RSPM (R 4.0.2)
    +##  rematch2               2.1.2    2020-05-01 [1] RSPM (R 4.0.0)
    +##  rlang                  0.4.8    2020-10-08 [1] RSPM (R 4.0.2)
    +##  rmarkdown              2.4      2020-09-30 [1] RSPM (R 4.0.2)
    +##  RSQLite                2.2.1    2020-09-30 [1] RSPM (R 4.0.2)
    +##  rstudioapi             0.11     2020-02-07 [1] RSPM (R 4.0.0)
    +##  S4Vectors            * 0.28.0   2020-10-27 [1] Bioconductor  
    +##  scales                 1.1.1    2020-05-11 [1] RSPM (R 4.0.0)
    +##  sessioninfo            1.1.1    2018-11-05 [1] RSPM (R 4.0.0)
    +##  stringi                1.5.3    2020-09-09 [1] RSPM (R 4.0.2)
    +##  stringr                1.4.0    2019-02-10 [1] RSPM (R 4.0.0)
    +##  styler                 1.3.2    2020-02-23 [1] RSPM (R 4.0.0)
    +##  SummarizedExperiment * 1.20.0   2020-10-27 [1] Bioconductor  
    +##  survival               3.1-12   2020-04-10 [2] CRAN (R 4.0.2)
    +##  tibble                 3.0.4    2020-10-12 [1] RSPM (R 4.0.2)
    +##  tidyselect             1.1.0    2020-05-11 [1] RSPM (R 4.0.0)
    +##  vctrs                  0.3.4    2020-08-29 [1] RSPM (R 4.0.2)
    +##  withr                  2.3.0    2020-09-22 [1] RSPM (R 4.0.2)
    +##  xfun                   0.18     2020-09-29 [1] RSPM (R 4.0.2)
    +##  XML                    3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2)
    +##  xtable                 1.8-4    2019-04-21 [1] RSPM (R 4.0.0)
    +##  XVector                0.30.0   2020-10-27 [1] Bioconductor  
    +##  yaml                   2.2.1    2020-02-01 [1] RSPM (R 4.0.0)
    +##  zlibbioc               1.36.0   2020-10-27 [1] Bioconductor  
    +## 
    +## [1] /usr/local/lib/R/site-library
    +## [2] /usr/local/lib/R/library
    +
    +
    +

    References

    +
    +
    +

    Broad Institute Team, 2019 Gene set enrichment analysis (gsea) user guide. https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html

    +
    +
    +

    Carlson M., 2020 org.Hs.eg.db: Genome wide annotation for human. http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html

    +
    +
    +

    Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html

    +
    +
    +

    Gu Z., R. Eils, and M. Schlesner, 2016 Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btw313

    +
    +
    +

    Hänzelmann S., R. Castelo, and J. Guinney, 2013 Biases in Illumina transcriptome sequencing caused by random hexamer priming. BMC Bioinformatics 14. https://doi.org/10.1186/1471-2105-14-7

    +
    +
    +

    Hänzelmann S., R. Castelo, and J. Guinney, 2013 GSVA. https://github.com/rcastelo/GSVA/blob/master/man/gsva.Rd

    +
    +
    +

    Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

    +
    +
    +

    Liberzon A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, and J. P. Mesirov et al., 2015 The molecular signatures database hallmark gene set collection. Cell Systems 1. https://doi.org/10.1016/j.cels.2015.12.004

    +
    +
    +

    Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, and P. Tamayo et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260

    +
    +
    +

    Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8

    +
    +
    +

    Malhotra S., 2018 Decoding gene set variation analysis. https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3

    +
    +
    +

    Slowikowski K., 2017 Make heatmaps in R with pheatmap. https://slowkow.com/notes/pheatmap-tutorial/

    +
    +
    +

    Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

    +
    +
    +

    Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660

    +
    +
    +
    + + + + +
    +
    + +
    + + + + + + + + + + + + + + + + diff --git a/04-advanced-topics/00-intro-to-advanced-topics.html b/04-advanced-topics/00-intro-to-advanced-topics.html index 29c3e038..8c75da02 100644 --- a/04-advanced-topics/00-intro-to-advanced-topics.html +++ b/04-advanced-topics/00-intro-to-advanced-topics.html @@ -1263,25 +1263,22 @@ }; - - + - - + + + + @@ -2599,6 +3201,73 @@ } + @@ -2865,15 +3534,20 @@ @@ -2943,6 +3626,11 @@

    CCDL for ALSF

    🚧 Advanced topics are coming soon: Under construction! 🚧

    + diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index 6bee663b..bca779e0 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -53,7 +53,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -61,7 +61,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -75,7 +75,7 @@ In the same place you put this `.Rmd` file, you should now have three new empty For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). -Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP133573/identification-of-transcription-factor-relationships-associated-with-androgen-deprivation-therapy-response-and-metastatic-progression-in-prostate-cancer). +Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP140558). Click the "Download Now" button on the right side of this screen. @@ -96,9 +96,9 @@ You will get an email when it is ready. ## About the dataset we are using for this example -For this example analysis, we will use this [prostate cancer dataset](https://www.refine.bio/experiments/SRP133573). -The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. -Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens. +For this example analysis, we will use this [acute viral bronchiolitis dataset](https://www.refine.bio/experiments/SRP140558). +The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. +Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated "AV") and their recovery, their post-convalescence visit (abbreviated "CV"). ## Place the dataset in your new `data/` folder @@ -113,7 +113,7 @@ For more details on the contents of this folder see [these docs on refine.bio](h The `` folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like `GSE1235` or `SRP12345`. -Copy and paste the `SRP133573` folder into your newly created `data/` folder. +Copy and paste the `SRP140558` folder into your newly created `data/` folder. ## Check out our file structure! @@ -121,7 +121,7 @@ Your new analysis folder should contain: - The example analysis `.Rmd` you downloaded - A folder called "data" which contains: - - The `SRP133573` folder which contains: + - The `SRP140558` folder which contains: - The gene expression - The metadata TSV - A folder for `plots` (currently empty) @@ -139,19 +139,24 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "SRP140558") -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "SRP140558.tsv") -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv") ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check if the gene expression matrix file is at the path stored in `data_file` file.exists(data_file) # Check if the metadata file is at the file path stored in `metadata_file` @@ -229,7 +234,7 @@ if (!("ComplexHeatmap" %in% installed.packages())) { Attach some of the packages we need for this analysis. -```{r} +```{r message=FALSE} # Attach the DESeq2 library library(DESeq2) @@ -256,7 +261,7 @@ metadata <- readr::read_tsv(metadata_file) # Read in data TSV file df <- readr::read_tsv(data_file) %>% - # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later + # Here we are going to store the gene IDs as row names so that we can have a numeric matrix to perform calculations on later tibble::column_to_rownames("Gene") ``` @@ -291,23 +296,36 @@ df <- round(df) %>% dplyr::filter(rowSums(.) >= 50) ``` -Another thing we need to do is make sure our main experimental group label is set up. -In this case `refinebio_treatment` has two groups: `pre-adt` and `post-adt`. -To keep these two treatments in logical (rather than alphabetical) order, we will convert this to a factor with `pre-adt` as the first level. +Another thing we need to do is set up our main experimental group variable. +Unfortunately the metadata for this dataset are not set up into separate, neat columns, but we can accomplish that ourselves. + +For this study, PBMCs were collected at two time points: during the patients' first, acute bronchiolitis visit (abbreviated "AV") and their recovery visit, (called post-convalescence and abbreviated "CV"). + +For handier use of this information, we can create a new variable, `time_point`, that states this info more clearly. +This new `time_point` variable will have two labels: `acute illness` and `recovering` based on the `AV` or `CV` coding located in the `refinebio_title` string variable. ```{r} metadata <- metadata %>% - dplyr::mutate(refinebio_treatment = factor(refinebio_treatment, - levels = c("pre-adt", "post-adt") - )) + dplyr::mutate( + time_point = dplyr::case_when( + # Create our new variable based on refinebio_title containing AV/CV + stringr::str_detect(refinebio_title, "_AV_") ~ "acute illness", + stringr::str_detect(refinebio_title, "_CV_") ~ "recovering" + ), + # It's easier for future items if this is already set up as a factor + time_point = as.factor(time_point) + ) ``` Let's double check that our factor set up is right. +We want `acute illness` to be the first level since it was the first time point collected. ```{r} -levels(metadata$refinebio_treatment) +levels(metadata$time_point) ``` +Great! We're all set. + ## Create a DESeqDataset We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object. @@ -384,7 +402,7 @@ ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) + # We will plot what WGCNA recommends as an R^2 cutoff geom_hline(yintercept = 0.80, col = "red") + # Just in case our values are low, we want to make sure we can still see the 0.80 level - ylim(c(min(sft_df$model_fit), 1)) + + ylim(c(min(sft_df$model_fit), 1.05)) + # We can add more sensible labels for our axis xlab("Soft Threshold (power)") + ylab("Scale Free Topology Model Fit, signed R^2") + @@ -399,14 +417,14 @@ WGCNA's authors recommend using a `power` that has an signed $R^2$ above `0.80`, If you have multiple power values with signed $R^2$ above `0.80`, then picking the one at an inflection point, in other words where the $R^2$ values seem to have reached their saturation [@Zhang2005]. You want to a `power` that gives you a big enough $R^2$ but is not excessively large. -So using the plot above, going with a power soft-threshold of `16`! +So using the plot above, going with a power soft-threshold of `7`! If you find you have all very low $R^2$ values this may be because there are too many genes with low expression values that are cluttering up the calculations. You can try returning to [gene filtering step](#define-a-minimum-counts-cutoff) and choosing a more stringent cutoff (you'll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped). ## Run WGCNA! -We will use the `blockwiseModules()` function to find gene co-expression modules in WGCNA, using `16` for the `power` argument like we determined above. +We will use the `blockwiseModules()` function to find gene co-expression modules in WGCNA, using `7` for the `power` argument like we determined above. This next step takes some time to run. The `blockwise` part of the `blockwiseModules()` function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources. @@ -425,7 +443,7 @@ operating system and other running programs. bwnet <- blockwiseModules(normalized_counts, maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in TOMType = "signed", # topological overlap matrix - power = 16, # soft threshold for network construction + power = 7, # soft threshold for network construction numericLabels = TRUE, # Let's use numbers instead of colors for module labels randomSeed = 1234, # there's some randomness associated with this calculation # so we should set a seed @@ -444,7 +462,7 @@ We will save our whole results object to an RDS file in case we want to return t ```{r} readr::write_rds(bwnet, - file = file.path("results", "SRP133573_wgcna_results.RDS") + file = file.path("results", "SRP140558_wgcna_results.RDS") ) ``` @@ -473,8 +491,8 @@ all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes)) ``` ```{r} -# Create the design matrix from the refinebio_treatment variable -des_mat <- model.matrix(~ metadata$refinebio_treatment) +# Create the design matrix from the `time_point` variable +des_mat <- model.matrix(~ metadata$time_point) ``` Run linear model on each module. @@ -499,25 +517,25 @@ stats_df <- limma::topTable(fit, number = ncol(module_eigengenes)) %>% Let's take a look at the results. They are sorted with the most significant results at the top. -```{r} +```{r rownames.print = FALSE} head(stats_df) ``` -Module 52 seems to be the most differentially expressed across `refinebio_treatment` groups. +Module 19 seems to be the most differentially expressed across `time_point` groups. Now we can do some investigation into this module. -## Let's make plot of module 52 +## Let's make plot of module 19 -As a sanity check, let's use `ggplot` to see what module 52's eigengene looks like between treatment groups. +As a sanity check, let's use `ggplot` to see what module 18's eigengene looks like between treatment groups. First we need to set up the module eigengene for this module with the sample metadata labels we need. ```{r} -module_52_df <- module_eigengenes %>% +module_19_df <- module_eigengenes %>% tibble::rownames_to_column("accession_code") %>% # Here we are performing an inner join with a subset of metadata dplyr::inner_join(metadata %>% - dplyr::select(refinebio_accession_code, refinebio_treatment), + dplyr::select(refinebio_accession_code, time_point), by = c("accession_code" = "refinebio_accession_code") ) ``` @@ -526,21 +544,24 @@ Now we are ready for plotting. ```{r} ggplot( - module_52_df, + module_19_df, aes( - x = refinebio_treatment, - y = ME52, - color = refinebio_treatment + x = time_point, + y = ME19, + color = time_point ) ) + - ggforce::geom_sina() + + # a boxplot with outlier points hidden (they will be in the sina plot) + geom_boxplot(width = 0.2, outlier.shape = NA) + + # A sina plot to show all of the individual data points + ggforce::geom_sina(maxwidth = 0.3) + theme_classic() ``` This makes sense! -Looks like module 52 has elevated expression post treatment. +Looks like module 19 has elevated expression during the acute illness but not when recovering. -## What genes are a part of module 52? +## What genes are a part of module 19? If you want to know which of your genes make up a modules, you can look at the `$colors` slot. This is a named list which associates the genes with the module they are a part of. @@ -552,18 +573,18 @@ gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module" dplyr::mutate(module = paste0("ME", module)) ``` -Now we can find what genes are a part of module 52. +Now we can find what genes are a part of module 19. ```{r} gene_module_key %>% - dplyr::filter(module == "ME52") + dplyr::filter(module == "ME19") ``` Let's save this gene to module key to a TSV file for future use. ```{r} readr::write_tsv(gene_module_key, - file = file.path("results", "SRP133573_wgcna_gene_to_module.tsv") + file = file.path("results", "SRP140558_wgcna_gene_to_module.tsv") ) ``` @@ -581,12 +602,12 @@ make_module_heatmap <- function(module_name, # Create a summary heatmap of a given module. # # Args: - # module_name: a character indicating what module should be plotted, e.g. "ME52" + # module_name: a character indicating what module should be plotted, e.g. "ME19" # expression_mat: The full gene expression matrix. Default is `normalized_counts`. - # metadata_df: a data frame with refinebio_accession_code and refinebio_treatment + # metadata_df: a data frame with refinebio_accession_code and time_point # as columns. Default is `metadata`. # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`. - # module_eigengenes: a sample x eigengene data.frame with samples as rownames. Default is `module_eigengenes`. + # module_eigengenes: a sample x eigengene data.frame with samples as row names. Default is `module_eigengenes`. # # Returns: # A heatmap of expression matrix for a module's genes, with a barplot of the @@ -594,28 +615,28 @@ make_module_heatmap <- function(module_name, # Set up the module eigengene with its refinebio_accession_code module_eigengene <- module_eigengenes_df %>% - dplyr::select(module_name) %>% + dplyr::select(all_of(module_name)) %>% tibble::rownames_to_column("refinebio_accession_code") # Set up column annotation from metadata col_annot_df <- metadata_df %>% # Only select the treatment and sample ID columns - dplyr::select(refinebio_accession_code, refinebio_treatment) %>% + dplyr::select(refinebio_accession_code, time_point, refinebio_subject) %>% # Add on the eigengene expression by joining with sample IDs dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>% - # Arrange by treatment - dplyr::arrange(refinebio_treatment, refinebio_accession_code) %>% + # Arrange by patient and time point + dplyr::arrange(time_point, refinebio_subject) %>% # Store sample tibble::column_to_rownames("refinebio_accession_code") # Create the ComplexHeatmap column annotation object col_annot <- ComplexHeatmap::HeatmapAnnotation( # Supply treatment labels - refinebio_treatment = col_annot_df$refinebio_treatment, + time_point = col_annot_df$time_point, # Add annotation barplot module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)), - # Pick colors for each experimental group in refinebio_treatment - col = list(refinebio_treatment = c("post-adt" = "#f1a340", "pre-adt" = "#998ec3")) + # Pick colors for each experimental group in time_point + col = list(time_point = c("recovering" = "#f1a340", "acute illness" = "#998ec3")) ) # Get a vector of the Ensembl gene IDs that correspond to this module @@ -670,17 +691,17 @@ make_module_heatmap <- function(module_name, ## Make module heatmaps -Let's try out the custom heatmap function with module 52 (our most differentially expressed module). +Let's try out the custom heatmap function with module 19 (our most differentially expressed module). ```{r} -mod_52_heatmap <- make_module_heatmap(module_name = "ME52") +mod_19_heatmap <- make_module_heatmap(module_name = "ME19") # Print out the plot -mod_52_heatmap +mod_19_heatmap ``` -From the barplot portion of our plot, we can see `post-adt` samples have higher values for this eigengene for module 52. -In the heatmap portion, we can see how the individual genes that make up module 52 have more extreme values (very high or very low) in the `post-adt` samples. +From the barplot portion of our plot, we can see `acute illness` samples tend to have higher expression values for the module 19 eigengene. +In the heatmap portion, we can see how the individual genes that make up module 19 are overall higher than in the `recovering` samples. We can save this plot to PNG. @@ -693,15 +714,14 @@ dev.off() For comparison, let's try out the custom heatmap function with a different, _not_ differentially expressed module. ```{r} -mod_10_heatmap <- make_module_heatmap(module_name = "ME10") +mod_25_heatmap <- make_module_heatmap(module_name = "ME25") # Print out the plot -mod_10_heatmap +mod_25_heatmap ``` -In this non-significant module's heatmap, there's not a particularly strong pattern between pre and post ADT samples. -In general the expression of genes in module 10 does not vary much between groups, staying near the overall mean. -There are a few samples and some genes that show higher expression, but it is not surprising this does not results in a significant overall difference between the groups. +In this non-significant module's heatmap, there's not a particularly strong pattern between acute illness and recovery samples. +Though we can still see the genes in this module seem to be very correlated with each other (which is how we found them in the first place, so this makes sense!). Save this plot also. @@ -725,7 +745,7 @@ This helps make your code more reproducible by recording what versions of softwa ```{r} # Print session info -sessionInfo() +sessioninfo::session_info() ``` # References diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html index 6e72d2a6..cdb0a008 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html @@ -2570,7 +2570,7 @@ "vtp_value":"" },{ "function":"__c", - "vtp_value":false + "vtp_value":0 },{ "function":"__aev", "vtp_varType":"URL", @@ -2850,308 +2850,314 @@ Copyright The Closure Library Authors. SPDX-License-Identifier: Apache-2.0 */ -var aa,ba=function(a){var b=0;return function(){return bb)a=0,b=2147483647;return Math.floor(Math.random()*(b-a+1)+a)},Ga=function(a,b){for(var c=new Ea,d=0;dc.length&&d&&b.push(c)});return b.join(",")};/* +var ba,ca=function(a){var b=0;return function(){return bb)a=0,b=2147483647;return Math.floor(Math.random()*(b-a+1)+a)},Ha=function(a,b){for(var c=new Ga,d=0;dc.length&&d&&b.push(c)});return b.join(",")};/* jQuery v1.9.1 (c) 2005, 2012 jQuery Foundation, Inc. jquery.org/license. */ -var Xa=/\[object (Boolean|Number|String|Function|Array|Date|RegExp)\]/,Ya=function(a){if(null==a)return String(a);var b=Xa.exec(Object.prototype.toString.call(Object(a)));return b?b[1].toLowerCase():"object"},Za=function(a,b){return Object.prototype.hasOwnProperty.call(Object(a),b)},$a=function(a){if(!a||"object"!=Ya(a)||a.nodeType||a==a.window)return!1;try{if(a.constructor&&!Za(a,"constructor")&&!Za(a.constructor.prototype,"isPrototypeOf"))return!1}catch(c){return!1}for(var b in a);return void 0=== -b||Za(a,b)},m=function(a,b){var c=b||("array"==Ya(a)?[]:{}),d;for(d in a)if(Za(a,d)){var e=a[d];"array"==Ya(e)?("array"!=Ya(c[d])&&(c[d]=[]),c[d]=m(e,c[d])):$a(e)?($a(c[d])||(c[d]={}),c[d]=m(e,c[d])):c[d]=e}return c};var ab=function(a){if(void 0===a||Aa(a)||$a(a))return!0;switch(typeof a){case "boolean":case "number":case "string":case "function":return!0}return!1};var zb; -var Ab=[],Bb=[],Cb=[],Db=[],Eb=[],Fb={},Gb,Hb,Ib,Jb=function(a,b){var c=a["function"];if(!c)throw Error("Error: No function name given for function call.");var d=Fb[c],e={},f;for(f in a)a.hasOwnProperty(f)&&0===f.indexOf("vtp_")&&(d&&b&&b.Te&&b.Te(a[f]),e[void 0!==d?f:f.substr(4)]=a[f]);return void 0!==d?d(e):zb(c,e,b)},Lb=function(a,b,c){c=c||[];var d={},e;for(e in a)a.hasOwnProperty(e)&&(d[e]=Kb(a[e],b,c));return d},Mb=function(a){var b=a["function"];if(!b)throw"Error: No function name given for function call."; -var c=Fb[b];return c?c.priorityOverride||0:0},Kb=function(a,b,c){if(Aa(a)){var d;switch(a[0]){case "function_id":return a[1];case "list":d=[];for(var e=1;ec&&(b["k"+c]=Yb(Vb(e,40)),b["v"+c]=Yb(f),c++))});var d=[];Ha(b,function(e,f){d.push(""+e+f)});return d.join("~")},Yb=function(a){return(""+a).replace(/~/g,function(){return"~~"})},Wb={item_id:"id",item_name:"nm",item_brand:"br",item_category:"ca",item_category2:"c2",item_category3:"c3",item_category4:"c4",item_category5:"c5", -item_variant:"va",price:"pr",quantity:"qt",coupon:"cp",item_list_name:"ln",index:"lp",item_list_id:"li",discount:"ds",affiliation:"af",promotion_id:"pi",promotion_name:"pn",creative_name:"cn",creative_slot:"cs",location_id:"lo"},Zb={id:"id",name:"nm",brand:"br",category:"ca",variant:"va",list_name:"ln",list_position:"lp",list:"ln",position:"lp",creative:"cn"};var ac=function(a){var b=[];Ha(a,function(c,d){null!=d&&b.push(encodeURIComponent(c)+"="+encodeURIComponent(String(d)))});return b.join("&")},bc=function(a,b,c){this.qa=a.qa;this.Na=a.Na;this.I=a.I;this.i=b;this.o=ac(a.qa);this.h=ac(a.I);this.L=c?this.h.length:0;if(16384this.events.length&&16384>a.L+this.h,c=this.qa===a.o&&this.i===a.i;return 0==this.events.length||b&&c}; -var dc=function(a,b){Ha(a,function(c,d){null!=d&&b.push(encodeURIComponent(c)+"="+encodeURIComponent(d))})},ec=function(a,b){var c=[];a.o&&c.push(a.o);b&&c.push("_s="+b);dc(a.Na,c);var d=!1;a.h&&(c.push(a.h),d=!0);var e=c.join("&"),f="",h=e.length+a.i.length+1;d&&2048x&&(v=w,x=A)});y==c.length&&(h[n]=v)});dc(h,d);b&&d.push("_s="+b);for(var k=d.join("&"),l=[],q={},r=0;rc&&(b["k"+c]=jc(hc(e,40)),b["v"+c]=jc(f),c++))});var d=[];Ia(b,function(e,f){d.push(""+e+f)});return d.join("~")},jc=function(a){return(""+a).replace(/~/g,function(){return"~~"})},ic={item_id:"id",item_name:"nm",item_brand:"br",item_category:"ca",item_category2:"c2",item_category3:"c3",item_category4:"c4",item_category5:"c5", +item_variant:"va",price:"pr",quantity:"qt",coupon:"cp",item_list_name:"ln",index:"lp",item_list_id:"li",discount:"ds",affiliation:"af",promotion_id:"pi",promotion_name:"pn",creative_name:"cn",creative_slot:"cs",location_id:"lo"},lc={id:"id",name:"nm",brand:"br",category:"ca",variant:"va",list_name:"ln",list_position:"lp",list:"ln",position:"lp",creative:"cn"};var nc=function(a){var b=[];Ia(a,function(c,d){null!=d&&b.push(encodeURIComponent(c)+"="+encodeURIComponent(String(d)))});return b.join("&")},oc=function(a,b,c){this.ra=a.ra;this.Oa=a.Oa;this.J=a.J;this.i=b;this.o=nc(a.ra);this.h=nc(a.J);this.L=c?this.h.length:0;if(16384this.events.length&&16384>a.L+this.h,c=this.ra===a.o&&this.i===a.i;return 0==this.events.length||b&&c}; +var qc=function(a,b){Ia(a,function(c,d){null!=d&&b.push(encodeURIComponent(c)+"="+encodeURIComponent(d))})},rc=function(a,b){var c=[];a.o&&c.push(a.o);b&&c.push("_s="+b);qc(a.Oa,c);var d=!1;a.h&&(c.push(a.h),d=!0);var e=c.join("&"),f="",h=e.length+a.i.length+1;d&&2048x&&(v=w,x=A)});y==c.length&&(h[p]=v)});qc(h,d);b&&d.push("_s="+b);for(var k=d.join("&"),l=[],r={},q=0;q"}else r=void 0===d?"undefined":null===d?"null":typeof d;xc("Argument is not a %s (or a non-Element, non-Location mock); got: %s", -"HTMLScriptElement",r)}var t;e instanceof Dc&&e.constructor===Dc?t=e.h:(xc("expected object of type TrustedResourceUrl, got '"+e+"' of type "+ua(e)),t="type_error:TrustedResourceUrl");d.src=t;var n=sa(d.ownerDocument&&d.ownerDocument.defaultView);n&&d.setAttribute("nonce",n);Wc(d,b);c&&(d.onerror=c);var u=sa();u&&d.setAttribute("nonce",u);var v=H.getElementsByTagName("script")[0]||H.body||H.head;v.parentNode.insertBefore(d,v);return d},Yc=function(){if(Uc){var a=Uc.toLowerCase();if(0===a.indexOf("https://"))return 2; -if(0===a.indexOf("http://"))return 3}return 1},Zc=function(a,b){var c=H.createElement("iframe");c.height="0";c.width="0";c.style.display="none";c.style.visibility="hidden";var d=H.body&&H.body.lastChild||H.body||H.head;d.parentNode.insertBefore(c,d);Wc(c,b);void 0!==a&&(c.src=a);return c},$c=function(a,b,c){var d=new Image(1,1);d.onload=function(){d.onload=null;b&&b()};d.onerror=function(){d.onerror=null;c&&c()};d.src=a;return d},ad=function(a,b,c,d){a.addEventListener?a.addEventListener(b,c,!!d): -a.attachEvent&&a.attachEvent("on"+b,c)},bd=function(a,b,c){a.removeEventListener?a.removeEventListener(b,c,!1):a.detachEvent&&a.detachEvent("on"+b,c)},I=function(a){G.setTimeout(a,0)},cd=function(a,b){return a&&b&&a.attributes&&a.attributes[b]?a.attributes[b].value:null},dd=function(a){var b=a.innerText||a.textContent||"";b&&" "!=b&&(b=b.replace(/^[\s\xa0]+|[\s\xa0]+$/g,""));b&&(b=b.replace(/(\xa0+|\s{2,}|\n|\r\t)/g," "));return b},ed=function(a){var b=H.createElement("div");Rc(b,Sc("A
    "+a+"
    ")); -b=b.lastChild;for(var c=[];b.firstChild;)c.push(b.removeChild(b.firstChild));return c},fd=function(a,b,c){c=c||100;for(var d={},e=0;ec?a.href:a.href.substr(0,c)}return b},Fe=function(a){var b=H.createElement("a");a&&(b.href=a);var c=b.pathname;"/"!==c[0]&&(a||tc("TAGGING",1),c="/"+c);var d=b.hostname.replace(ze,"");return{href:b.href,protocol:b.protocol,host:b.host,hostname:d,pathname:c,search:b.search,hash:b.hash,port:b.port}},Ge=function(a){function b(q){var r=q.split("=")[0];return 0>d.indexOf(r)?q:r+"=0"}function c(q){return q.split("&").map(b).filter(function(r){return void 0!=r}).join("&")}var d="gclid dclid gclaw gcldc gclgp gclha gclgf _gl".split(" "), -e=Fe(a),f=a.split(/[?#]/)[0],h=e.search,k=e.hash;"?"===h[0]&&(h=h.substring(1));"#"===k[0]&&(k=k.substring(1));h=c(h);k=c(k);""!==h&&(h="?"+h);""!==k&&(k="#"+k);var l=""+f+h+k;"/"===l[l.length-1]&&(l=l.substring(0,l.length-1));return l};function He(a,b,c){for(var d=[],e=b.split(";"),f=0;f>21:d;return[Math.round(2147483647*Math.random())^d&2147483647,Math.round(Na()/1E3)].join(".")},Ze=function(a,b,c,d,e){var f=Xe(b);return Me(a,f,Ye(c),d,e)},$e=function(a,b,c,d){var e=""+Xe(c),f=Ye(d);1>2,q=(f&3)<<4|h>>4,r=(h&15)<<2|k>>6,p=k&63;e||(p=64,d||(r=64));b.push(jf[l],jf[q],jf[r],jf[p])}return b.join("")} -function nf(a){function b(l){for(;d>4);64!=h&&(c+=String.fromCharCode(f<<4&240|h>>2),64!=k&&(c+=String.fromCharCode(h<<6&192|k)))}};var of;var sf=function(){var a=pf,b=qf,c=rf(),d=function(h){a(h.target||h.srcElement||{})},e=function(h){b(h.target||h.srcElement||{})};if(!c.init){ad(H,"mousedown",d);ad(H,"keyup",d);ad(H,"submit",e);var f=HTMLFormElement.prototype.submit;HTMLFormElement.prototype.submit=function(){b(this);f.call(this)};c.init=!0}},tf=function(a,b,c,d,e){var f={callback:a,domains:b,fragment:2===c,placement:c,forms:d,sameHost:e};rf().decorators.push(f)},uf=function(a,b,c){for(var d=rf().decorators,e={},f=0;ff;f++){for(var h=f,k=0;8>k;k++)h= -h&1?h>>>1^3988292384:h>>>1;e[f]=h}d=e}of=d;for(var l=4294967295,q=0;q>>8^of[(l^c.charCodeAt(q))&255];return((l^-1)>>>0).toString(36)},Df=function(){return function(a){var b=Fe(G.location.href),c=b.search.replace("?",""),d=Ae(c,"_gl",!0)||"";a.query=Cf(d)||{};var e=De(b,"fragment").match(zf("_gl"));a.fragment=Cf(e&&e[3]||"")||{}}},Ef=function(a){var b=Df(),c=rf();c.data||(c.data={query:{},fragment:{}},b(c.data));var d={},e=c.data;e&&(Qa(d,e.query),a&&Qa(d,e.fragment));return d},Cf= -function(a){var b;b=void 0===b?3:b;try{if(a){var c;a:{for(var d=a,e=0;3>e;++e){var f=vf.exec(d);if(f){c=f;break a}d=decodeURIComponent(d)}c=void 0}var h=c;if(h&&"1"===h[1]){var k=h[3],l;a:{for(var q=h[2],r=0;rr){q=!0;break b}q=!1}if(!q){var n=af(b,l,!0);n.Ka="ad_storage";Se(h,k,n)}}}}Zf(Xf(c.gclid,c.gclsrc),b)})},ag=function(a,b){var c=Qf[a];if(void 0!==c)return b+c},bg=function(a){var b=a.split(".");return 3!==b.length||"GCL"!==b[0]?0:1E3*(Number(b[1])|| -0)};function Tf(a){var b=a.split(".");if(3==b.length&&"GCL"==b[0]&&b[1])return b[2]} -var dg=function(a,b,c,d,e){if(Aa(b)){var f=Wf(e),h=function(){for(var k={},l=0;lb)){var c=a.substring(0,b);if(og.test(c)){for(var d=a.substring(b+1).split("/"),e=0;ek;k++){var l=h[k].src;if(l){l=l.toLowerCase();if(0===l.indexOf(e)){b=3;break a}1===f&&0===l.indexOf(d)&&(f=2)}}b=f}else b=a;return b}; -var xg=function(a,b,c){if(G[a.functionName])return b.vd&&I(b.vd),G[a.functionName];var d=wg();G[a.functionName]=d;if(a.jc)for(var e=0;ec.indexOf("Safari")||/Chrome|Coast|Opera|Edg|Silk|Android/.test(c)||11>((/Version\/([\d]+)/.exec(c)||[])[1]||"")?!1:!0}a=!b}if(a)return-1;var d=Ia("1");return ye(1,100)Ba(c,k))if(l&&0Ba(c,l[p])){E(11);r=!1;break a}}else{r=!1;break a}r=!0}q=r}var t=!1;if(d){var n=0<=Ba(e,k);if(n)t=n;else{var u=Ga(e,l||[]);u&&E(10);t=u}}var v=!q||t;v||!(0<=Ba(l,"sandboxedScripts"))||c&&-1!==Ba(c,"sandboxedScripts")||(v=Ga(e,Pg));return f[k]=v}}, -Qg=function(){return Mg.test(G.location&&G.location.hostname)};var Sg={active:!0,isAllowed:function(){return!0}},Tg=function(a){var b=L.zones;return b?b.checkState(Wd.B,a):Sg},Ug=function(a){var b=L.zones;!b&&a&&(b=L.zones=a());return b};var Vg=function(){},Wg=function(){};var Xg=!1,Yg=0,Zg=[];function $g(a){if(!Xg){var b=H.createEventObject,c="complete"==H.readyState,d="interactive"==H.readyState;if(!a||"readystatechange"!=a.type||c||!b&&d){Xg=!0;for(var e=0;eYg){Yg++;try{H.documentElement.doScroll("left"),$g()}catch(a){G.setTimeout(ah,50)}}}var bh=function(a){Xg?a():Zg.push(a)};var ch={},dh={},eh=function(a,b,c,d){if(!dh[a]||Zd[b]||"__zone"===b)return-1;var e={};$a(d)&&(e=m(d,e));e.id=c;e.status="timeout";return dh[a].tags.push(e)-1},fh=function(a,b,c,d){if(dh[a]){var e=dh[a].tags[b];e&&(e.status=c,e.executionTime=d)}};function gh(a){for(var b=ch[a]||[],c=0;c=c&&gh(a)})},Dg:function(){d=!0;b>=c&&gh(a)}}};var lh=function(){function a(d){return!za(d)||0>d?0:d}if(!L._li&&G.performance&&G.performance.timing){var b=G.performance.timing.navigationStart,c=za(pe.get("gtm.start"))?pe.get("gtm.start"):0;L._li={cst:a(c-b),cbt:a(ee-b)}}};var ph={},qh=function(){return G.GoogleAnalyticsObject&&G[G.GoogleAnalyticsObject]},rh=!1; -var sh=function(a){G.GoogleAnalyticsObject||(G.GoogleAnalyticsObject=a||"ga");var b=G.GoogleAnalyticsObject;if(G[b])G.hasOwnProperty(b)||E(12);else{var c=function(){c.q=c.q||[];c.q.push(arguments)};c.l=Number(new Date);G[b]=c}lh();return G[b]},th=function(a,b,c,d){b=String(b).replace(/\s+/g,"").split(",");var e=qh();e(a+"require","linker");e(a+"linker:autoLink",b,c,d)},uh=function(a){}; -var wh=function(a){},vh=function(){return G.GoogleAnalyticsObject||"ga"},xh=function(a,b){return function(){var c=qh(),d=c&&c.getByName&&c.getByName(a);if(d){var e=d.get("sendHitTask");d.set("sendHitTask",function(f){var h=f.get("hitPayload"),k=f.get("hitCallback"),l=0>h.indexOf("&tid="+b);l&&(f.set("hitPayload",h.replace(/&tid=UA-[0-9]+-[0-9]+/,"&tid="+ -b),!0),f.set("hitCallback",void 0,!0));e(f);l&&(f.set("hitPayload",h,!0),f.set("hitCallback",k,!0),f.set("_x_19",void 0,!0),e(f))})}}}; -var Ch=function(){return"&tc="+Db.filter(function(a){return a}).length},Fh=function(){2022<=Dh().length&&Eh()},Hh=function(){Gh||(Gh=G.setTimeout(Eh,500))},Eh=function(){Gh&&(G.clearTimeout(Gh),Gh=void 0);void 0===Ih||Jh[Ih]&&!Kh&&!Lh||(Mh[Ih]||Nh.sh()||0>=Oh--?(E(1),Mh[Ih]=!0):(Nh.Oh(),$c(Dh()),Jh[Ih]=!0,Ph=Qh=Rh=Lh=Kh=""))},Dh=function(){var a=Ih;if(void 0===a)return"";var b=uc("GTM"),c=uc("TAGGING");return[Sh,Jh[a]?"":"&es=1",Th[a],b?"&u="+b:"",c?"&ut="+c:"",Ch(),Kh,Lh,Rh?Rh:"",Qh,Ph,"&z=0"].join("")}, -Uh=function(){return[fe,"&v=3&t=t","&pid="+Da(),"&rv="+Wd.hc].join("")},Vh="0.005000">Math.random(),Sh=Uh(),Wh=function(){Sh=Uh()},Jh={},Kh="",Lh="",Ph="",Qh="",Rh="",Ih=void 0,Th={},Mh={},Gh=void 0,Nh=function(a,b){var c=0,d=0;return{sh:function(){if(c=b&&(c=0);return c>=a},Oh:function(){Na()-d>=b&&(c=0);c++;d=Na()}}}(2,1E3),Oh=1E3,Xh=function(a,b,c){if(Vh&&!Mh[a]&&b){a!==Ih&&(Eh(),Ih=a);var d,e=String(b[Ob.Ia]||"").replace(/_/g,"");0===e.indexOf("cvt")&&(e="cvt"); -d=e;var f=c+d;Kh=Kh?Kh+"."+f:"&tr="+f;var h=b["function"];if(!h)throw Error("Error: No function name given for function call.");var k=(Fb[h]?"1":"2")+d;Ph=Ph?Ph+"."+k:"&ti="+k;Hh();Fh()}},Yh=function(a,b,c){if(Vh&&!Mh[a]){a!==Ih&&(Eh(),Ih=a);var d=c+b;Lh=Lh?Lh+"."+d:"&epr="+d;Hh();Fh()}},Zh=function(a,b,c){}; -function $h(a,b,c,d){var e=Db[a],f=ai(a,b,c,d);if(!f)return null;var h=Kb(e[Ob.Ne],c,[]);if(h&&h.length){var k=h[0];f=$h(k.index,{H:f,F:1===k.Ye?b.terminate:f,terminate:b.terminate},c,d)}return f} -function ai(a,b,c,d){function e(){if(f[Ob.kg])k();else{var x=Lb(f,c,[]);var z=eh(c.id,String(f[Ob.Ia]),Number(f[Ob.Oe]),x[Ob.lg]),A=!1;x.vtp_gtmOnSuccess=function(){if(!A){A=!0;var F=Na()-D;Xh(c.id,Db[a],"5");fh(c.id,z,"success", -F);h()}};x.vtp_gtmOnFailure=function(){if(!A){A=!0;var F=Na()-D;Xh(c.id,Db[a],"6");fh(c.id,z,"failure",F);k()}};x.vtp_gtmTagId=f.tag_id;x.vtp_gtmEventId=c.id;Xh(c.id,f,"1");var B=function(){var F=Na()-D;Xh(c.id,f,"7");fh(c.id,z,"exception",F);A||(A=!0,k())};var D=Na();try{Jb(x,c)}catch(F){B(F)}}}var f=Db[a],h=b.H,k=b.F,l=b.terminate;if(c.qd(f))return null;var q=Kb(f[Ob.Pe],c,[]);if(q&&q.length){var r=q[0],p=$h(r.index,{H:h,F:k,terminate:l},c,d);if(!p)return null;h=p;k=2===r.Ye?l:p}if(f[Ob.Je]||f[Ob.og]){var t=f[Ob.Je]?Eb:c.Xh,n=h,u=k;if(!t[a]){e=Pa(e); -var v=bi(a,t,e);h=v.H;k=v.F}return function(){t[a](n,u)}}return e}function bi(a,b,c){var d=[],e=[];b[a]=ci(d,e,c);return{H:function(){b[a]=di;for(var f=0;fe?1:dk?1:hd;++d){var e;try{e=!(!c.frames||!c.frames[b])}catch(k){e=!1}if(e)return c;var f;a:{try{var h=c.parent;if(h&&h!=c){f=h;break a}}catch(k){}f=null}if(!(c=f))break}return null};var Ii=function(){};var Ji=function(a){if(jd("tteai")){if(void 0!==a.tcString&&"string"!==typeof a.tcString||void 0!==a.gdprApplies&&"boolean"!==typeof a.gdprApplies||void 0!==a.listenerId&&"number"!==typeof a.listenerId||void 0!==a.addtlConsent&&"string"!==typeof a.addtlConsent)return 2}else if(void 0!==a.addtlConsent&&"string"!==typeof a.addtlConsent&&(a.addtlConsent=void 0),void 0!==a.gdprApplies&&"boolean"!==typeof a.gdprApplies&&(a.gdprApplies=void 0),void 0!==a.tcString&&"string"!==typeof a.tcString||void 0!== -a.listenerId&&"number"!==typeof a.listenerId)return 2;return a.cmpStatus&&"error"!==a.cmpStatus?0:3},Ki=function(a,b){this.i=a;this.h=null;this.L={};this.oa=0;this.ia=void 0===b?500:b;this.o=null};ma(Ki,Ii);var Mi=function(a){return"function"===typeof a.i.__tcfapi||null!=Li(a)}; -Ki.prototype.addEventListener=function(a){var b={},c=zc(function(){return a(b)}),d=0;-1!==this.ia&&(d=setTimeout(function(){b.tcString="tcunavailable";b.internalErrorState=1;c()},this.ia));var e=function(f,h){clearTimeout(d);f?(b=f,b.internalErrorState=Ji(b),h&&0===b.internalErrorState||(b.tcString="tcunavailable",h||(b.internalErrorState=3))):(b.tcString="tcunavailable",b.internalErrorState=3);a(b)};try{Ni(this,"addEventListener",e)}catch(f){b.tcString="tcunavailable",b.internalErrorState=3,d&&(clearTimeout(d), -d=0),c()}};Ki.prototype.removeEventListener=function(a){a&&a.listenerId&&Ni(this,"removeEventListener",null,a.listenerId)}; -var Pi=function(a,b,c){var d;d=void 0===d?"755":d;var e;a:{if(a.publisher&&a.publisher.restrictions){var f=a.publisher.restrictions[b];if(void 0!==f){e=f[void 0===d?"755":d];break a}}e=void 0}var h=e;if(0===h)return!1;var k=c;2===c?(k=0,2===h&&(k=1)):3===c&&(k=1,1===h&&(k=0));var l;if(0===k)if(a.purpose&&a.vendor){var q=Oi(a.vendor.consents,void 0===d?"755":d);l=q&&"1"===b&&a.purposeOneTreatment&&"DE"===a.publisherCC?!0:q&&Oi(a.purpose.consents,b)}else l=!0;else l=1===k?a.purpose&&a.vendor?Oi(a.purpose.legitimateInterests, -b)&&Oi(a.vendor.legitimateInterests,void 0===d?"755":d):!0:!0;return l},Oi=function(a,b){return!(!a||!a[b])},Ni=function(a,b,c,d){c||(c=function(){});if("function"===typeof a.i.__tcfapi){var e=a.i.__tcfapi;e(b,2,c,d)}else if(Li(a)){Qi(a);var f=++a.oa;a.L[f]=c;if(a.h){var h={};a.h.postMessage((h.__tcfapiCall={command:b,version:2,callId:f,parameter:d},h),"*")}}else c({},!1)},Li=function(a){if(a.h)return a.h;a.h=Hi(a.i,"__tcfapiLocator");return a.h},Qi=function(a){a.o||(a.o=function(b){try{var c,d;"string"=== -typeof b.data?d=JSON.parse(b.data):d=b.data;c=d.__tcfapiReturn;a.L[c.callId](c.returnValue,c.success)}catch(e){}},Ei(a.i,a.o))};var Ri={1:0,3:0,4:0,7:3,9:3,10:3};function Si(a,b){if(""===a)return b;var c=Number(a);return isNaN(c)?b:c}var Ti=Si("",550),Ui=Si("",500);function Vi(){var a=L.tcf||{};return L.tcf=a} -var Wi=function(a,b){this.o=a;this.h=b;this.i=Na();},Xi=function(a){},Yi=function(a){},dj=function(){var a=Vi(),b=new Ki(G,3E3),c=new Wi(b,a);if((Zi()?!0===G.gtag_enable_tcf_support:!1!==G.gtag_enable_tcf_support)&&!a.active&&("function"===typeof G.__tcfapi||Mi(b))){a.active=!0;a.Kb={};$i();var d=setTimeout(function(){aj(a);bj(a);d=null},Ui);try{b.addEventListener(function(e){d&&(clearTimeout(d),d=null);if(0!==e.internalErrorState)aj(a),bj(a),Xi(c);else{var f;if(!1===e.gdprApplies)f=cj(),b.removeEventListener(e); -else if("tcloaded"===e.eventStatus||"useractioncomplete"===e.eventStatus||"cmpuishown"===e.eventStatus){var h={},k;for(k in Ri)if(Ri.hasOwnProperty(k))if("1"===k){var l=e,q=!0;q=void 0===q?!1:q;var r;var p=l;!1===p.gdprApplies?r=!0:(void 0===p.internalErrorState&&(p.internalErrorState=Ji(p)),r="error"===p.cmpStatus||0!==p.internalErrorState||"loaded"===p.cmpStatus&&("tcloaded"===p.eventStatus||"useractioncomplete"===p.eventStatus)?!0:!1);h["1"]=r?!1===l.gdprApplies||"tcunavailable"===l.tcString|| -void 0===l.gdprApplies&&!q||"string"!==typeof l.tcString||!l.tcString.length?!0:Pi(l,"1",0):!1}else h[k]=Pi(e,k,Ri[k]);f=h}f&&(a.tcString=e.tcString||"tcempty",a.Kb=f,bj(a),Xi(c))}}),Yi(c)}catch(e){d&&(clearTimeout(d),d=null),aj(a),bj(a)}}};function aj(a){a.type="e";a.tcString="tcunavailable";a.Kb=cj()}function $i(){var a={};yd((a.ad_storage="denied",a.wait_for_update=Ti,a))} -var Zi=function(){var a=!1;a=!0;return a};function cj(){var a={},b;for(b in Ri)Ri.hasOwnProperty(b)&&(a[b]=!0);return a}function bj(a){var b={};zd((b.ad_storage=a.Kb["1"]?"granted":"denied",b))} -var ej=function(){var a=Vi();if(a.active&&void 0!==a.loadTime)return Number(a.loadTime)},fj=function(){var a=Vi();return a.active?a.tcString||"":""},gj=function(a){if(!Ri.hasOwnProperty(String(a)))return!0;var b=Vi();return b.active&&b.Kb?!!b.Kb[String(a)]:!0};function hj(a,b,c){function d(r){var p;L.reported_gclid||(L.reported_gclid={});p=L.reported_gclid;var t=f+(r?"gcu":"gcs");if(!p[t]){p[t]=!0;var n=[],u=function(z,A){A&&n.push(z+"="+encodeURIComponent(A))},v="https://www.google.com";if(td()){var x=Ad(C.s);u("gcs",Bd());r&&u("gcu","1");L.dedupe_gclid||(L.dedupe_gclid=""+We());u("rnd",L.dedupe_gclid);if((!f||h&&"aw.ds"!==h)&&Ad(C.s)){var y=Vf("_gcl_aw");u("gclaw",y.join("."))}u("url",String(G.location).split(/[?#]/)[0]);u("dclid",ij(b,k));!x&&b&&(v= -"https://pagead2.googlesyndication.com")}u("gdpr_consent",fj());"1"===Ef(!1)._up&&u("gtm_up","1");u("gclid",ij(b,f));u("gclsrc",h);u("gtm",Di(!c));var w=v+"/pagead/landing?"+n.join("&");gd(w)}}var e=Yf(),f=e.gclid||"",h=e.gclsrc,k=e.dclid||"",l=!a&&(!f||h&&"aw.ds"!==h?!1:!0),q=td();if(l||q)q?Cd(function(){d();Ad(C.s)||wd(function(r){return d(!0,r.Ue)},C.s)},[C.s]):d()} -function ij(a,b){var c=a&&!Ad(C.s);return b&&c?"0":b}var jj=function(a){if(H.hidden)return!0;var b=a.getBoundingClientRect();if(b.top==b.bottom||b.left==b.right||!G.getComputedStyle)return!0;var c=G.getComputedStyle(a,null);if("hidden"===c.visibility)return!0;for(var d=a,e=c;d;){if("none"===e.display)return!0;var f=e.opacity,h=e.filter;if(h){var k=h.indexOf("opacity(");0<=k&&(h=h.substring(k+8,h.indexOf(")",k)),"%"==h.charAt(h.length-1)&&(h=h.substring(0,h.length-1)),f=Math.min(h,f))}if(void 0!==f&&0>=f)return!0;(d=d.parentElement)&&(e=G.getComputedStyle(d, +var C={wb:"_ee",fd:"_syn",ri:"_uei",ni:"_pci",Pc:"event_callback",Xb:"event_timeout",ca:"gtag.config"};C.Ga="gtag.get";C.ja="purchase";C.Va="refund";C.Fa="begin_checkout";C.Ta="add_to_cart";C.Ua="remove_from_cart";C.Kf="view_cart";C.$d="add_to_wishlist";C.wa="view_item";C.Zd="view_promotion";C.Yd="select_promotion";C.Kc="select_item";C.Tb="view_item_list";C.Xd="add_payment_info";C.Jf="add_shipping_info"; +C.za="value_key",C.ya="value_callback";C.da="allow_ad_personalization_signals";C.Xc="restricted_data_processing";C.kb="allow_google_signals";C.fa="cookie_expires";C.Wb="cookie_update";C.tb="session_duration";C.ma="user_properties";C.Ja="transport_url";C.M="ads_data_redaction";C.s="ad_storage";C.F="analytics_storage";C.Cf="region";C.Df="wait_for_update"; +C.Lc="page_view",C.ae="user_engagement",C.Ef="app_remove",C.Ff="app_store_refund",C.Gf="app_store_subscription_cancel",C.Hf="app_store_subscription_convert",C.If="app_store_subscription_renew",C.Lf="first_open",C.Mf="first_visit",C.Nf="in_app_purchase",C.Of="session_start",C.Pf="allow_custom_scripts",C.Qf="allow_display_features",C.Mc="allow_enhanced_conversions",C.se="enhanced_conversions",C.Wa="client_id",C.V="cookie_domain",C.Vb="cookie_name",C.Ha="cookie_path",C.ka="cookie_flags",C.xa="currency", +C.ke="custom_map",C.Tc="groups",C.Xa="language",C.ie="country",C.ji="non_interaction",C.rb="page_location",C.Aa="page_referrer",C.Wc="page_title",C.sb="send_page_view",C.Ia="send_to",C.Yc="session_engaged",C.$b="session_id",C.Zc="session_number",C.ig="tracking_id",C.la="linker",C.Ba="url_passthrough",C.ob="accept_incoming",C.D="domains",C.qb="url_position",C.pb="decorate_forms",C.xe="phone_conversion_number",C.ve="phone_conversion_callback",C.we="phone_conversion_css_class",C.ye="phone_conversion_options", +C.dg="phone_conversion_ids",C.cg="phone_conversion_country_code",C.be="aw_remarketing",C.ce="aw_remarketing_only",C.Ub="gclid",C.Ca="value",C.eg="quantity",C.Uf="affiliation",C.qe="tax",C.pe="shipping",C.Oc="list_name",C.oe="checkout_step",C.ne="checkout_option",C.Vf="coupon",C.Wf="promotions",C.ub="transaction_id",C.vb="user_id",C.fg="retoken",C.mb="conversion_linker",C.lb="conversion_cookie_prefix",C.X="cookie_prefix",C.U="items",C.he="aw_merchant_id",C.ee="aw_feed_country",C.fe="aw_feed_language", +C.de="discount",C.me="disable_merchant_reported_purchases",C.ue="new_customer",C.je="customer_lifetime_value",C.Tf="dc_natural_search",C.Sf="dc_custom_params",C.jg="trip_type",C.bg="passengers",C.$f="method",C.hg="search_term",C.Rf="content_type",C.ag="optimize_id",C.Xf="experiments",C.nb="google_signals",C.Sc="google_tld",C.ac="update",C.Rc="firebase_id",C.Yb="ga_restrict_domain",C.Qc="event_settings",C.Nc="dynamic_event_settings",C.gg="screen_name",C.Zf="_x_19",C.Yf="_x_20",C.Vc="internal_traffic_results", +C.ze="traffic_type",C.Zb="referral_exclusion_definition",C.Uc="ignore_referrer",C.$c="delivery_postal_code",C.te="estimated_delivery_date",C.kg=[C.da,C.Mc,C.kb,C.U,C.Xc,C.V,C.fa,C.ka,C.Vb,C.Ha,C.X,C.Wb,C.ke,C.Nc,C.Pc,C.Qc,C.Xb,C.Yb,C.nb,C.Sc,C.Tc,C.Vc,C.la,C.Zb,C.Ia,C.sb,C.tb,C.Ja,C.ac,C.ma,C.$c],C.Ae=[C.rb,C.Aa,C.Wc,C.Xa,C.gg,C.vb,C.Rc],C.lg=[C.Ef,C.Ff,C.Gf,C.Hf,C.If,C.Lf,C.Mf,C.Nf,C.Of,C.ae],C.Rd=[C.Ia,C.be,C.ce,C.sb,C.Xa,C.Ca,C.xa,C.ub,C.vb,C.mb,C.lb,C.X,C.V,C.fa,C.ka,C.rb,C.Aa,C.xe,C.ve,C.we, +C.ye,C.U,C.he,C.ee,C.fe,C.de,C.me,C.ue,C.je,C.da,C.Xc,C.ac,C.Rc,C.se,C.Ja,C.Ba,C.Mc,C.$c,C.te];C.Ce=[C.ja,C.Va,C.Fa,C.Ta,C.Ua,C.Kf,C.$d,C.wa,C.Zd,C.Yd,C.Tb,C.Kc,C.Xd,C.Jf];C.Be=[C.da,C.kb,C.Wb];C.De=[C.fa,C.Xb,C.tb];var Ec={},Fc=function(a,b){Ec[a]=Ec[a]||[];Ec[a][b]=!0},Gc=function(a){for(var b=[],c=Ec[a]||[],d=0;d"}else q=void 0===d?"undefined":null===d?"null":typeof d;Jc("Argument is not a %s (or a non-Element, non-Location mock); got: %s", +"HTMLScriptElement",q)}var t;e instanceof Pc&&e.constructor===Pc?t=e.h:(Jc("expected object of type TrustedResourceUrl, got '"+e+"' of type "+wa(e)),t="type_error:TrustedResourceUrl");d.src=t;var p=ua(d.ownerDocument&&d.ownerDocument.defaultView);p&&d.setAttribute("nonce",p);id(d,b);c&&(d.onerror=c);var u=ua();u&&d.setAttribute("nonce",u);var v=H.getElementsByTagName("script")[0]||H.body||H.head;v.parentNode.insertBefore(d,v);return d},kd=function(){if(gd){var a=gd.toLowerCase();if(0===a.indexOf("https://"))return 2; +if(0===a.indexOf("http://"))return 3}return 1},ld=function(a,b){var c=H.createElement("iframe");c.height="0";c.width="0";c.style.display="none";c.style.visibility="hidden";var d=H.body&&H.body.lastChild||H.body||H.head;d.parentNode.insertBefore(c,d);id(c,b);void 0!==a&&(c.src=a);return c},md=function(a,b,c){var d=new Image(1,1);d.onload=function(){d.onload=null;b&&b()};d.onerror=function(){d.onerror=null;c&&c()};d.src=a;return d},nd=function(a,b,c,d){a.addEventListener?a.addEventListener(b,c,!!d): +a.attachEvent&&a.attachEvent("on"+b,c)},od=function(a,b,c){a.removeEventListener?a.removeEventListener(b,c,!1):a.detachEvent&&a.detachEvent("on"+b,c)},I=function(a){G.setTimeout(a,0)},pd=function(a,b){return a&&b&&a.attributes&&a.attributes[b]?a.attributes[b].value:null},qd=function(a){var b=a.innerText||a.textContent||"";b&&" "!=b&&(b=b.replace(/^[\s\xa0]+|[\s\xa0]+$/g,""));b&&(b=b.replace(/(\xa0+|\s{2,}|\n|\r\t)/g," "));return b},rd=function(a){var b=H.createElement("div");dd(b,ed("A
    "+a+"
    ")); +b=b.lastChild;for(var c=[];b.firstChild;)c.push(b.removeChild(b.firstChild));return c},sd=function(a,b,c){c=c||100;for(var d={},e=0;e=f)return!0;(d=d.parentElement)&&(e=G.getComputedStyle(d, null))}return!1}; -var kj=function(){var a=H.body,b=H.documentElement||a&&a.parentElement,c,d;if(H.compatMode&&"BackCompat"!==H.compatMode)c=b?b.clientHeight:0,d=b?b.clientWidth:0;else{var e=function(f,h){return f&&h?Math.min(f,h):Math.max(f,h)};E(7);c=e(b?b.clientHeight:0,a?a.clientHeight:0);d=e(b?b.clientWidth:0,a?a.clientWidth:0)}return{width:d,height:c}},lj=function(a){var b=kj(),c=b.height,d=b.width,e=a.getBoundingClientRect(),f=e.bottom-e.top,h=e.right-e.left;return f&&h?(1-Math.min((Math.max(0-e.left,0)+Math.max(e.right- -d,0))/h,1))*(1-Math.min((Math.max(0-e.top,0)+Math.max(e.bottom-c,0))/f,1)):0};var sj=new RegExp(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/i),tj=["SCRIPT","IMG","SVG","PATH","BR"],uj=["BR"];function vj(a){var b;if(a===H.body)b="body";else{var c;if(a.id)c="#"+a.id;else{var d;if(a.parentElement){var e;a:{var f=a.parentElement;if(f){for(var h=0;h:nth-child("+e+")"}else d="";c=d}b=c}return b} -function wj(){var a;var b=[],c=H.body;if(c){for(var d=c.querySelectorAll("*"),e=0;ee;e++){var f=d[e];if(!(0<=tj.indexOf(f.tagName.toUpperCase()))){for(var h=!1,k=0;kk;k++)if(!(0<=uj.indexOf(f.children[k].tagName.toUpperCase()))){h=!0;break}h||b.push(f)}}a={elements:b,status:1E4A;A++){var B=r[A].element;w.push({querySelector:vj(B),tagName:B.tagName,isVisible:!jj(B),type:1})}return{elements:w,status:z}} -var xj=function(a){var b=pi(a,"/pagead/conversion_async.js");if(b)return b;var c=-1!==navigator.userAgent.toLowerCase().indexOf("firefox"),d=vg("https://","http://","www.googleadservices.com");if(c||1===Kg())d="https://www.google.com";return d+"/pagead/conversion_async.js"},yj=!1,zj=[],Aj=["aw","dc"],Bj=function(a){var b=G.google_trackConversion,c=a.gtm_onFailure;"function"==typeof b?b(a)||c():c()},Cj=function(){for(;0c?a.href:a.href.substr(0,c)}return b},le=function(a){var b=H.createElement("a");a&&(b.href=a);var c=b.pathname;"/"!==c[0]&&(a||Fc("TAGGING",1),c="/"+c);var d=b.hostname.replace(fe,"");return{href:b.href,protocol:b.protocol,host:b.host,hostname:d,pathname:c,search:b.search,hash:b.hash,port:b.port}},me=function(a){function b(r){var q=r.split("=")[0];return 0>d.indexOf(q)?r:q+"=0"}function c(r){return r.split("&").map(b).filter(function(q){return void 0!=q}).join("&")}var d="gclid dclid gclaw gcldc gclgp gclha gclgf _gl".split(" "), +e=le(a),f=a.split(/[?#]/)[0],h=e.search,k=e.hash;"?"===h[0]&&(h=h.substring(1));"#"===k[0]&&(k=k.substring(1));h=c(h);k=c(k);""!==h&&(h="?"+h);""!==k&&(k="#"+k);var l=""+f+h+k;"/"===l[l.length-1]&&(l=l.substring(0,l.length-1));return l};var ne=new RegExp(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/i),oe=["SCRIPT","IMG","SVG","PATH","BR"],pe=["BR"];function qe(a){var b;if(a===H.body)b="body";else{var c;if(a.id)c="#"+a.id;else{var d;if(a.parentElement){var e;a:{var f=a.parentElement;if(f){for(var h=0;h:nth-child("+e+")"}else d="";c=d}b=c}return b} +function re(){var a;var b=[],c=H.body;if(c){for(var d=c.querySelectorAll("*"),e=0;ee;e++){var f=d[e];if(!(0<=oe.indexOf(f.tagName.toUpperCase()))){for(var h=!1,k=0;kk;k++)if(!(0<=pe.indexOf(f.children[k].tagName.toUpperCase()))){h=!0;break}h||b.push(f)}}a={elements:b,status:1E4A;A++){var B=q[A].element;w.push({Od:q[A].Od,querySelector:qe(B),tagName:B.tagName,isVisible:!Xd(B),type:1})}return{elements:w,status:z}}var Fe={},L=null,Ge=Math.random();Fe.B="G-C59ZQK86CL";Fe.ic="bu0";Fe.ki="";var He={__cl:!0,__ecl:!0,__ehl:!0,__evl:!0,__fal:!0,__fil:!0,__fsl:!0,__hl:!0,__jel:!0,__lcl:!0,__sdl:!0,__tl:!0,__ytl:!0},Ie={__paused:!0,__tg:!0},Je;for(Je in He)He.hasOwnProperty(Je)&&(Ie[Je]=!0);var Ke="www.googletagmanager.com/gtm.js";Ke="www.googletagmanager.com/gtag/js"; +var Le=Ke,Me=Ma(""),Ne=null,Oe=null,Pe="//www.googletagmanager.com/a?id="+Fe.B+"&cv=1",Qe={},Re={},Se=function(){var a=L.sequence||1;L.sequence=a+1;return a};var Te={},Ue=new Ga,Ve={},We={},Ze={name:"dataLayer",set:function(a,b){m(cb(a,b),Ve);Xe()},get:function(a){return Ye(a,2)},reset:function(){Ue=new Ga;Ve={};Xe()}},Ye=function(a,b){return 2!=b?Ue.get(a):$e(a)},$e=function(a){var b,c=a.split(".");b=b||[];for(var d=Ve,e=0;e>21:d;return[Math.round(2147483647*Math.random())^d&2147483647,Math.round(Pa()/1E3)].join(".")},Bf=function(a,b,c,d,e){var f=zf(b);return pf(a,f,Af(c),d,e)},Cf=function(a,b,c,d){var e=""+zf(c),f=Af(d);1>2,r=(f&3)<<4|h>>4,q=(h&15)<<2|k>>6,n=k&63;e||(n=64,d||(q=64));b.push(Lf[l],Lf[r],Lf[q],Lf[n])}return b.join("")} +function Pf(a){function b(l){for(;d>4);64!=h&&(c+=String.fromCharCode(f<<4&240|h>>2),64!=k&&(c+=String.fromCharCode(h<<6&192|k)))}};var Qf;var Uf=function(){var a=Rf,b=Sf,c=Tf(),d=function(h){a(h.target||h.srcElement||{})},e=function(h){b(h.target||h.srcElement||{})};if(!c.init){nd(H,"mousedown",d);nd(H,"keyup",d);nd(H,"submit",e);var f=HTMLFormElement.prototype.submit;HTMLFormElement.prototype.submit=function(){b(this);f.call(this)};c.init=!0}},Vf=function(a,b,c,d,e){var f={callback:a,domains:b,fragment:2===c,placement:c,forms:d,sameHost:e};Tf().decorators.push(f)},Wf=function(a,b,c){for(var d=Tf().decorators,e={},f=0;ff;f++){for(var h=f,k=0;8>k;k++)h= +h&1?h>>>1^3988292384:h>>>1;e[f]=h}d=e}Qf=d;for(var l=4294967295,r=0;r>>8^Qf[(l^c.charCodeAt(r))&255];return((l^-1)>>>0).toString(36)},eg=function(){return function(a){var b=le(G.location.href),c=b.search.replace("?",""),d=ge(c,"_gl",!0)||"";a.query=dg(d)||{};var e=je(b,"fragment").match(ag("_gl"));a.fragment=dg(e&&e[3]||"")||{}}},fg=function(a){var b=eg(),c=Tf();c.data||(c.data={query:{},fragment:{}},b(c.data));var d={},e=c.data;e&&(Ua(d,e.query),a&&Ua(d,e.fragment));return d},dg= +function(a){var b;b=void 0===b?3:b;try{if(a){var c;a:{for(var d=a,e=0;3>e;++e){var f=Xf.exec(d);if(f){c=f;break a}d=decodeURIComponent(d)}c=void 0}var h=c;if(h&&"1"===h[1]){var k=h[3],l;a:{for(var r=h[2],q=0;qq){r=!0;break b}r=!1}if(!r){var p=Df(b,l,!0);p.sa="ad_storage";vf(h,k,p)}}}}Ag(yg(c.gclid,c.gclsrc),b)})},Cg=function(a,b){var c=rg[a];if(void 0!==c)return b+c},Dg=function(a){var b=a.split(".");return 3!==b.length||"GCL"!==b[0]?0:1E3*(Number(b[1])|| +0)};function ug(a){var b=a.split(".");if(3==b.length&&"GCL"==b[0]&&b[1])return b[2]} +var Fg=function(a,b,c,d,e){if(Ca(b)){var f=xg(e),h=function(){for(var k={},l=0;lb)){var c=a.substring(0,b);if(Qg.test(c)){for(var d=a.substring(b+1).split("/"),e=0;ek;k++){var l=h[k].src;if(l){l=l.toLowerCase();if(0===l.indexOf(e)){b=3;break a}1===f&&0===l.indexOf(d)&&(f=2)}}b=f}else b=a;return b}; +var Zg=function(a,b,c){if(G[a.functionName])return b.xd&&I(b.xd),G[a.functionName];var d=Yg();G[a.functionName]=d;if(a.kc)for(var e=0;ec.indexOf("Safari")||/Chrome|Coast|Opera|Edg|Silk|Android/.test(c)||11>((/Version\/([\d]+)/.exec(c)||[])[1]||"")?!1:!0}a=!b}if(a)return-1;var d=La("1");return jf(1,100)Da(c,k))if(l&&0Da(c,l[n])){E(11);q=!1;break a}}else{q=!1;break a}q=!0}r=q}var t=!1;if(d){var p=0<=Da(e,k);if(p)t=p;else{var u=Ha(e,l||[]);u&&E(10);t=u}}var v=!r||t;v||!(0<=Da(l,"sandboxedScripts"))||c&&-1!==Da(c,"sandboxedScripts")||(v=Ha(e,qh));return f[k]=v}}, +rh=function(){return nh.test(G.location&&G.location.hostname)};var th={active:!0,isAllowed:function(){return!0}},uh=function(a){var b=L.zones;return b?b.checkState(Fe.B,a):th},vh=function(a){var b=L.zones;!b&&a&&(b=L.zones=a());return b};var xh=function(){},yh=function(){};var zh=!1,Ah=0,Bh=[];function Ch(a){if(!zh){var b=H.createEventObject,c="complete"==H.readyState,d="interactive"==H.readyState;if(!a||"readystatechange"!=a.type||c||!b&&d){zh=!0;for(var e=0;eAh){Ah++;try{H.documentElement.doScroll("left"),Ch()}catch(a){G.setTimeout(Dh,50)}}}var Eh=function(a){zh?a():Bh.push(a)};var Fh={},Gh={},Hh=function(a,b,c,d){if(!Gh[a]||Ie[b]||"__zone"===b)return-1;var e={};jb(d)&&(e=m(d,e));e.id=c;e.status="timeout";return Gh[a].tags.push(e)-1},Ih=function(a,b,c,d){if(Gh[a]){var e=Gh[a].tags[b];e&&(e.status=c,e.executionTime=d)}};function Jh(a){for(var b=Fh[a]||[],c=0;c=c&&Jh(a)})},Hg:function(){d=!0;b>=c&&Jh(a)}}};var Nh=function(){function a(d){return!Ba(d)||0>d?0:d}if(!L._li&&G.performance&&G.performance.timing){var b=G.performance.timing.navigationStart,c=Ba(Ze.get("gtm.start"))?Ze.get("gtm.start"):0;L._li={cst:a(c-b),cbt:a(Oe-b)}}};var Rh={},Sh=function(){return G.GoogleAnalyticsObject&&G[G.GoogleAnalyticsObject]},Th=!1; +var Uh=function(a){G.GoogleAnalyticsObject||(G.GoogleAnalyticsObject=a||"ga");var b=G.GoogleAnalyticsObject;if(G[b])G.hasOwnProperty(b)||E(12);else{var c=function(){c.q=c.q||[];c.q.push(arguments)};c.l=Number(new Date);G[b]=c}Nh();return G[b]},Vh=function(a,b,c,d){b=String(b).replace(/\s+/g,"").split(",");var e=Sh();e(a+"require","linker");e(a+"linker:autoLink",b,c,d)},Wh=function(a){}; +var Yh=function(a){},Xh=function(){return G.GoogleAnalyticsObject||"ga"},Zh=function(a,b){return function(){var c=Sh(),d=c&&c.getByName&&c.getByName(a);if(d){var e=d.get("sendHitTask");d.set("sendHitTask",function(f){var h=f.get("hitPayload"),k=f.get("hitCallback"),l=0>h.indexOf("&tid="+b);l&&(f.set("hitPayload",h.replace(/&tid=UA-[0-9]+-[0-9]+/,"&tid="+ +b),!0),f.set("hitCallback",void 0,!0));e(f);l&&(f.set("hitPayload",h,!0),f.set("hitCallback",k,!0),f.set("_x_19",void 0,!0),e(f))})}}}; +var di=function(){return"&tc="+Nb.filter(function(a){return a}).length},gi=function(){2022<=ei().length&&fi()},ii=function(){hi||(hi=G.setTimeout(fi,500))},fi=function(){hi&&(G.clearTimeout(hi),hi=void 0);void 0===ji||ki[ji]&&!li&&!mi||(ni[ji]||oi.wh()||0>=pi--?(E(1),ni[ji]=!0):(oi.Sh(),md(ei()),ki[ji]=!0,qi=ri=si=mi=li=""))},ei=function(){var a=ji;if(void 0===a)return"";var b=Gc("GTM"),c=Gc("TAGGING");return[ti,ki[a]?"":"&es=1",ui[a],b?"&u="+b:"",c?"&ut="+c:"",di(),li,mi,si?si:"",ri,qi,"&z=0"].join("")}, +vi=function(){return[Pe,"&v=3&t=t","&pid="+Fa(),"&rv="+Fe.ic].join("")},wi="0.005000">Math.random(),ti=vi(),xi=function(){ti=vi()},ki={},li="",mi="",qi="",ri="",si="",ji=void 0,ui={},ni={},hi=void 0,oi=function(a,b){var c=0,d=0;return{wh:function(){if(c=b&&(c=0);return c>=a},Sh:function(){Pa()-d>=b&&(c=0);c++;d=Pa()}}}(2,1E3),pi=1E3,yi=function(a,b,c){if(wi&&!ni[a]&&b){a!==ji&&(fi(),ji=a);var d,e=String(b[bc.Ka]||"").replace(/_/g,"");0===e.indexOf("cvt")&&(e="cvt"); +d=e;var f=c+d;li=li?li+"."+f:"&tr="+f;var h=b["function"];if(!h)throw Error("Error: No function name given for function call.");var k=(Qb[h]?"1":"2")+d;qi=qi?qi+"."+k:"&ti="+k;ii();gi()}},zi=function(a,b,c){if(wi&&!ni[a]){a!==ji&&(fi(),ji=a);var d=c+b;mi=mi?mi+"."+d:"&epr="+d;ii();gi()}},Ai=function(a,b,c){}; +function Bi(a,b,c,d){var e=Nb[a],f=Ci(a,b,c,d);if(!f)return null;var h=Vb(e[bc.Pe],c,[]);if(h&&h.length){var k=h[0];f=Bi(k.index,{I:f,H:1===k.$e?b.terminate:f,terminate:b.terminate},c,d)}return f} +function Ci(a,b,c,d){function e(){if(f[bc.og])k();else{var x=Zb(f,c,[]);var z=Hh(c.id,String(f[bc.Ka]),Number(f[bc.Qe]),x[bc.pg]),A=!1;x.vtp_gtmOnSuccess=function(){if(!A){A=!0;var F=Pa()-D;yi(c.id,Nb[a],"5");Ih(c.id,z,"success", +F);h()}};x.vtp_gtmOnFailure=function(){if(!A){A=!0;var F=Pa()-D;yi(c.id,Nb[a],"6");Ih(c.id,z,"failure",F);k()}};x.vtp_gtmTagId=f.tag_id;x.vtp_gtmEventId=c.id;yi(c.id,f,"1");var B=function(){var F=Pa()-D;yi(c.id,f,"7");Ih(c.id,z,"exception",F);A||(A=!0,k())};var D=Pa();try{Ub(x,c)}catch(F){B(F)}}}var f=Nb[a],h=b.I,k=b.H,l=b.terminate;if(c.sd(f))return null;var r=Vb(f[bc.Re],c,[]);if(r&&r.length){var q=r[0],n=Bi(q.index,{I:h,H:k,terminate:l},c,d);if(!n)return null;h=n;k=2===q.$e?l:n}if(f[bc.Le]||f[bc.sg]){var t=f[bc.Le]?Ob:c.ai,p=h,u=k;if(!t[a]){e=Ra(e); +var v=Di(a,t,e);h=v.I;k=v.H}return function(){t[a](p,u)}}return e}function Di(a,b,c){var d=[],e=[];b[a]=Ei(d,e,c);return{I:function(){b[a]=Fi;for(var f=0;fe?1:dk?1:hd;++d){var e;try{e=!(!c.frames||!c.frames[b])}catch(k){e=!1}if(e)return c;var f;a:{try{var h=c.parent;if(h&&h!=c){f=h;break a}}catch(k){}f=null}if(!(c=f))break}return null};var jj=function(){};var kj=function(a){void 0!==a.addtlConsent&&"string"!==typeof a.addtlConsent&&(a.addtlConsent=void 0);void 0!==a.gdprApplies&&"boolean"!==typeof a.gdprApplies&&(a.gdprApplies=void 0);return void 0!==a.tcString&&"string"!==typeof a.tcString||void 0!==a.listenerId&&"number"!==typeof a.listenerId?2:a.cmpStatus&&"error"!==a.cmpStatus?0:3},lj=function(a,b){this.i=a;this.h=null;this.L={};this.qa=0;this.ia=void 0===b?500:b;this.o=null};na(lj,jj); +var nj=function(a){return"function"===typeof a.i.__tcfapi||null!=mj(a)}; +lj.prototype.addEventListener=function(a){var b={},c=Lc(function(){return a(b)}),d=0;-1!==this.ia&&(d=setTimeout(function(){b.tcString="tcunavailable";b.internalErrorState=1;c()},this.ia));var e=function(f,h){clearTimeout(d);f?(b=f,b.internalErrorState=kj(b),h&&0===b.internalErrorState||(b.tcString="tcunavailable",h||(b.internalErrorState=3))):(b.tcString="tcunavailable",b.internalErrorState=3);a(b)};try{oj(this,"addEventListener",e)}catch(f){b.tcString="tcunavailable",b.internalErrorState=3,d&&(clearTimeout(d), +d=0),c()}};lj.prototype.removeEventListener=function(a){a&&a.listenerId&&oj(this,"removeEventListener",null,a.listenerId)}; +var qj=function(a,b,c){var d;d=void 0===d?"755":d;var e;a:{if(a.publisher&&a.publisher.restrictions){var f=a.publisher.restrictions[b];if(void 0!==f){e=f[void 0===d?"755":d];break a}}e=void 0}var h=e;if(0===h)return!1;var k=c;2===c?(k=0,2===h&&(k=1)):3===c&&(k=1,1===h&&(k=0));var l;if(0===k)if(a.purpose&&a.vendor){var r=pj(a.vendor.consents,void 0===d?"755":d);l=r&&"1"===b&&a.purposeOneTreatment&&"DE"===a.publisherCC?!0:r&&pj(a.purpose.consents,b)}else l=!0;else l=1===k?a.purpose&&a.vendor?pj(a.purpose.legitimateInterests, +b)&&pj(a.vendor.legitimateInterests,void 0===d?"755":d):!0:!0;return l},pj=function(a,b){return!(!a||!a[b])},oj=function(a,b,c,d){c||(c=function(){});if("function"===typeof a.i.__tcfapi){var e=a.i.__tcfapi;e(b,2,c,d)}else if(mj(a)){rj(a);var f=++a.qa;a.L[f]=c;if(a.h){var h={};a.h.postMessage((h.__tcfapiCall={command:b,version:2,callId:f,parameter:d},h),"*")}}else c({},!1)},mj=function(a){if(a.h)return a.h;a.h=ij(a.i,"__tcfapiLocator");return a.h},rj=function(a){a.o||(a.o=function(b){try{var c;c=("string"=== +typeof b.data?JSON.parse(b.data):b.data).__tcfapiReturn;a.L[c.callId](c.returnValue,c.success)}catch(d){}},fj(a.i,a.o))};var sj={1:0,3:0,4:0,7:3,9:3,10:3};function tj(a,b){if(""===a)return b;var c=Number(a);return isNaN(c)?b:c}var uj=tj("",550),vj=tj("",500);function wj(){var a=L.tcf||{};return L.tcf=a} +var xj=function(a,b){this.o=a;this.h=b;this.i=Pa();},yj=function(a){},zj=function(a){},Fj=function(){var a=wj(),b=new lj(G,3E3),c=new xj(b,a);if((Aj()?!0===G.gtag_enable_tcf_support:!1!==G.gtag_enable_tcf_support)&&!a.active&&("function"===typeof G.__tcfapi||nj(b))){a.active=!0;a.Kb={};Bj();var d=setTimeout(function(){Cj(a);Dj(a);d=null},vj);try{b.addEventListener(function(e){d&&(clearTimeout(d),d=null);if(0!==e.internalErrorState)Cj(a),Dj(a),yj(c);else{var f;if(!1===e.gdprApplies)f=Ej(),b.removeEventListener(e); +else if("tcloaded"===e.eventStatus||"useractioncomplete"===e.eventStatus||"cmpuishown"===e.eventStatus){var h={},k;for(k in sj)if(sj.hasOwnProperty(k))if("1"===k){var l=e,r=!0;r=void 0===r?!1:r;var q;var n=l;!1===n.gdprApplies?q=!0:(void 0===n.internalErrorState&&(n.internalErrorState=kj(n)),q="error"===n.cmpStatus||0!==n.internalErrorState||"loaded"===n.cmpStatus&&("tcloaded"===n.eventStatus||"useractioncomplete"===n.eventStatus)?!0:!1);h["1"]=q?!1===l.gdprApplies||"tcunavailable"===l.tcString|| +void 0===l.gdprApplies&&!r||"string"!==typeof l.tcString||!l.tcString.length?!0:qj(l,"1",0):!1}else h[k]=qj(e,k,sj[k]);f=h}f&&(a.tcString=e.tcString||"tcempty",a.Kb=f,Dj(a),yj(c))}}),zj(c)}catch(e){d&&(clearTimeout(d),d=null),Cj(a),Dj(a)}}};function Cj(a){a.type="e";a.tcString="tcunavailable";a.Kb=Ej()}function Bj(){var a={};Ld((a.ad_storage="denied",a.wait_for_update=uj,a))} +var Aj=function(){var a=!1;a=!0;return a};function Ej(){var a={},b;for(b in sj)sj.hasOwnProperty(b)&&(a[b]=!0);return a}function Dj(a){var b={};Md((b.ad_storage=a.Kb["1"]?"granted":"denied",b))} +var Gj=function(){var a=wj();if(a.active&&void 0!==a.loadTime)return Number(a.loadTime)},Hj=function(){var a=wj();return a.active?a.tcString||"":""},Ij=function(a){if(!sj.hasOwnProperty(String(a)))return!0;var b=wj();return b.active&&b.Kb?!!b.Kb[String(a)]:!0};function Jj(a,b,c){function d(q){var n;L.reported_gclid||(L.reported_gclid={});n=L.reported_gclid;var t=f+(q?"gcu":"gcs");if(!n[t]){n[t]=!0;var p=[],u=function(z,A){A&&p.push(z+"="+encodeURIComponent(A))},v="https://www.google.com";if(Gd()){var x=Nd(C.s);u("gcs",Od());q&&u("gcu","1");L.dedupe_gclid||(L.dedupe_gclid=""+yf());u("rnd",L.dedupe_gclid);if((!f||h&&"aw.ds"!==h)&&Nd(C.s)){var y=wg("_gcl_aw");u("gclaw",y.join("."))}u("url",String(G.location).split(/[?#]/)[0]);u("dclid",Kj(b,k));!x&&b&&(v= +"https://pagead2.googlesyndication.com")}u("gdpr_consent",Hj());"1"===fg(!1)._up&&u("gtm_up","1");u("gclid",Kj(b,f));u("gclsrc",h);u("gtm",ej(!c));var w=v+"/pagead/landing?"+p.join("&");td(w)}}var e=zg(),f=e.gclid||"",h=e.gclsrc,k=e.dclid||"",l=!a&&(!f||h&&"aw.ds"!==h?!1:!0),r=Gd();if(l||r)r?Pd(function(){d();Nd(C.s)||Jd(function(q){return d(!0,q.We)},C.s)},[C.s]):d()} +function Kj(a,b){var c=a&&!Nd(C.s);return b&&c?"0":b} +var Lj=function(a){var b=Ri(a,"/pagead/conversion_async.js");if(b)return b;var c=-1!==navigator.userAgent.toLowerCase().indexOf("firefox"),d=Xg("https://","http://","www.googleadservices.com");if(c||1===lh())d="https://www.google.com";return d+"/pagead/conversion_async.js"},Mj=!1,Nj=[],Oj=["aw","dc"],Pj=function(a){var b=G.google_trackConversion,c=a.gtm_onFailure;"function"==typeof b?b(a)||c():c()},Qj=function(){for(;0Ba(n,C.Vb)&&(l.cookieName=x+"_ga")}pk(l,"cookieDomain","auto");pk(k,"forceSSL",!0);pk(h,"eventCategory",yk(c));0<=Ba(["view_item","view_item_list","view_promotion","view_search_results"],c)&&pk(k,"nonInteraction",!0);"login"===c||"sign_up"===c||"share"===c?pk(h,"eventLabel",f(C.Wf)):"search"===c||"view_search_results"===c?pk(h,"eventLabel",f(C.dg)):"select_content"===c&&pk(h,"eventLabel",f(C.Nf));var w=h[C.ka]||{},z= -w[C.ob];z||0!=z&&w[C.D]?l.allowLinker=!0:!1===z&&pk(l,"useAmpClientId",!1);f(C.ya)&&(l._useUp=!0);!1!==f(C.Mf)&&!1!==f(C.kb)&&fk()||(k.allowAdFeatures=!1);if(!1===f(C.ca)||!ek()){var A="allowAdFeatures";A="allowAdPersonalizationSignals";k[A]=!1}l.name=b;k[">m"]=Di(!0); -k.hitCallback=d.H;td()&&(k["&gcs"]=Bd(),Ad(C.J)||(l.storage="none"),Ad(C.s)||(k.allowAdFeatures=!1,l.storeGac=!1));var B=f(C.Ha)||f(C.Vf)||oe("gtag.remote_config."+a+".url",2),D=f(C.Uf)||oe("gtag.remote_config."+a+".dualId",2);if(B&&null!=Uc){l._x_19=B;}D&&(l._x_20=D);h.Ba=k;h.Aa=l;return h},nk=function(a,b){function c(v){function x(w,z){for(var A=0;ADa(p,C.Vb)&&(l.cookieName=x+"_ga")}Dk(l,"cookieDomain","auto");Dk(k,"forceSSL",!0);Dk(h,"eventCategory",Mk(c));0<=Da(["view_item","view_item_list","view_promotion","view_search_results"],c)&&Dk(k,"nonInteraction",!0);"login"===c||"sign_up"===c||"share"===c?Dk(h,"eventLabel",f(C.$f)):"search"===c||"view_search_results"===c?Dk(h,"eventLabel",f(C.hg)):"select_content"===c&&Dk(h,"eventLabel",f(C.Rf));var w=h[C.la]||{},z=w[C.ob]; +z||0!=z&&w[C.D]?l.allowLinker=!0:!1===z&&Dk(l,"useAmpClientId",!1);f(C.Ba)&&(l._useUp=!0);!1!==f(C.Qf)&&!1!==f(C.kb)&&tk()||(k.allowAdFeatures=!1);if(!1===f(C.da)||!sk()){var A="allowAdFeatures";A="allowAdPersonalizationSignals";k[A]=!1}l.name=b;k[">m"]=ej(!0); +k.hitCallback=d.I;Gd()&&(k["&gcs"]=Od(),Nd(C.F)||(l.storage="none"),Nd(C.s)||(k.allowAdFeatures=!1,l.storeGac=!1));var B=f(C.Ja)||f(C.Zf)||Ye("gtag.remote_config."+a+".url",2),D=f(C.Yf)||Ye("gtag.remote_config."+a+".dualId",2);if(B&&null!=gd){l._x_19=B;}D&&(l._x_20=D);h.Ea=k;h.Da=l;return h},Bk=function(a,b){function c(v){function x(w,z){for(var A=0;Aa.length||!g(a[1]))return;var b={};if(2 -a.length)&&g(b)){var c;if(2=r?0:Math.round(p/r*100),n=H.hidden?!1:.5<=lj(c);d();var u=Il(c,"gtm.video",[b]);u["gtm.videoProvider"]="youtube";u["gtm.videoStatus"]=h;u["gtm.videoUrl"]=q.url;u["gtm.videoTitle"]=q.title;u["gtm.videoDuration"]=Math.round(r);u["gtm.videoCurrentTime"]= -Math.round(p);u["gtm.videoElapsedTime"]=Math.round(f);u["gtm.videoPercent"]=t;u["gtm.videoVisible"]=n;yl(u)},Rh:function(){e=Na()},dd:function(){d()}}};var fm=G.clearTimeout,gm=G.setTimeout,N=function(a,b,c){if(tg()){b&&I(b)}else return Xc(a,b,c)},hm=function(){return new Date},im=function(){return G.location.href},jm=function(a){return De(Fe(a),"fragment")},km=function(a){return Ee(Fe(a))},lm=function(a,b){return oe(a,b||2)},mm=function(a,b,c){var d;b?(a.eventCallback=b,c&&(a.eventTimeout=c),d=yl(a)):d=yl(a);return d},nm=function(a,b){G[a]=b},V=function(a,b,c){b&& -(void 0===G[a]||c&&!G[a])&&(G[a]=b);return G[a]},om=function(a,b,c){return Je(a,b,void 0===c?!0:!!c)},pm=function(a,b,c){return 0===Se(a,b,c)},qm=function(a,b){if(tg()){b&&I(b)}else Zc(a,b)},rm=function(a){return!!Ml(a,"init",!1)},sm=function(a){Kl(a,"init",!0)},tm=function(a,b){var c=(void 0===b?0:b)?"www.googletagmanager.com/gtag/js":be;c+="?id="+encodeURIComponent(a)+"&l=dataLayer";N(vg("https://","http://",c))},um=function(a, -b){var c=a[b];return c},vm=function(a,b,c){Vh&&(ab(a)||Zh(c,b,a))}; -var wm=Fl.Bh;function Tm(a,b){a=String(a);b=String(b);var c=a.length-b.length;return 0<=c&&a.indexOf(b,c)==c}var Um=new Ea;function Vm(a,b){function c(h){var k=Fe(h),l=De(k,"protocol"),q=De(k,"host",!0),r=De(k,"port"),p=De(k,"path").toLowerCase().replace(/\/$/,"");if(void 0===l||"http"==l&&"80"==r||"https"==l&&"443"==r)l="web",r="default";return[l,q,r,p]}for(var d=c(String(a)),e=c(String(b)),f=0;f=Number(c);case "_gt":return Number(b)>Number(c);case "_lc":var l;l=String(b).split(",");return 0<=Ba(l,String(c));case "_le":return Number(b)<=Number(c);case "_lt":return Number(b)Number(e);break;case "ge":f=Number(d)>=Number(e)}return!!b.negate!==f},Vo=function(a,b){var c=b.event_name_predicate;if(c&&!Uo(a,c))return!1;var d=b.conditions|| -[];if(0===d.length)return!0;for(var e=0;e=b&&(b=e+864E5,d=5E3);if(1>d)return!1;c=Math.min(c+(e-a)/1E3*5,20);a=e;if(1>c)return!1;d--;c--;return!0}}; -var Xo=function(a){var b="https://www.google-analytics.com/g/collect";if(null!=Uc){var c=oi(a.m(C.Ha),"/g/collect");if(c)return c}var d=!0;Ad(C.s)&&Ad(C.J)||(d=!1);var e=!1!==a.m(C.ca);e=!0;a.m(C.nb)&&!a.m(C.Yb)&&e&&!1!==a.m(C.kb)&&fk()&&d&&(b="https://analytics.google.com/g/collect");return b},Yo={};Yo[C.eg]="tid";Yo[C.Ta]="cid";Yo[C.Va]="ul";Yo[C.Nc]="_fid";Yo[C.xe]="tt";var Zo={};Zo[C.Zb]="sid";Zo[C.Wc]="sct";Zo[C.Vc]="seg";Zo[C.rb]="dl";Zo[C.Fa]="dr";Zo[C.Sc]="dt";Zo[C.va]="cu";Zo[C.vb]="uid";var $o=function(a,b){function c(t,n){if(void 0!== -n&&-1==C.gg.indexOf(t)){null===n&&(n="");var u;if("_"===t.charAt(0))d[t]=Vb(n,300);else if(Yo[t])u=Yo[t],d[u]=Vb(n,300);else if(Zo[t])u=Zo[t],f[u]=Vb(n,300);else{t=Vb(t,40);var v="ep."+t,x="epn."+t;u=za(n)?x:v;var y=za(n)?v:x;e.hasOwnProperty(y)&&delete e[y];e[u]=Vb(n,100)}}}var d={},e={},f={};d.v="2";d.tid=a.o;d.gtm=Di();d._p=wo();a.Zc&&(d.sr=a.Zc);a.ia&&(d._gaz=1);td()&&(d.gcs=Bd());a.He&&(d.gtm_up="1");e.en=Vb(a.N,40);a.La&&(e._fv=a.Ge?2:1);a.Fe&&(e._nsi=1);a.i&&(e._ss=a.Ie?2:1);a.xb&&(e._c=1); -0k;k++)e["pr"+(k+1)]=$b(h[k])}a.gb&&(e._eu=a.gb);for(var l=0;lthis.i&&(this.i=G.setTimeout(function(){return b.flush()},5E3));else{var e=ec(c,this.o++);cp(c.i,e.yd,e.body); -dp(c,a.fb,a.ia,String(a.m(C.Oc)))}};bp.prototype.flush=function(){if(this.h.events.length){var a=fc(this.h,this.o++);cp(this.h.i,a.yd,a.body);this.h=new cc;0<=this.i&&(G.clearTimeout(this.i),this.i=-1)}};var dp=function(a,b,c,d){function e(k){f.push(k+"="+encodeURIComponent(""+a.qa[k]))}if(b||c){var f=[];e("tid");e("cid");e("gtm");f.push("aip=1");a.Na.uid&&f.push("uid="+encodeURIComponent(""+a.Na.uid));b&&(cp("https://stats.g.doubleclick.net/g/collect","v=2&"+f.join("&")),Wg("https://stats.g.doubleclick.net/g/collect?v=2&"+ -f.join("&")));if(c){f.push("z="+Da());var h=ap(d);h&&$c(h+f.join("&"))}}},cp=function(a,b,c){var d=a+"?"+b;c?Tc.sendBeacon&&Tc.sendBeacon(d,c):gd(d)};var ep=function(a,b){var c;var d=Go(a);d?(Eo(d,a)||(E(25),a.abort()),c=d):c=void 0;var e=c,f;a:{var h=a.I[C.Ta];h?(h=""+h,Bo(h,a)||(E(31),a.abort()),xo(h,Ad(C.J)),f=h):(E(32),a.abort(),f="")}return{clientId:f,rf:e}};var fp=window,gp=document,hp=function(a){var b=fp._gaUserPrefs;if(b&&b.ioo&&b.ioo()||a&&!0===fp["ga-disable-"+a])return!0;try{var c=fp.external;if(c&&c._gaUserPrefs&&"oo"==c._gaUserPrefs)return!0}catch(f){}for(var d=He("AMP_TOKEN",String(gp.cookie),!0),e=0;eu.length||5n.rd+60*t&&(v=!0,n.sessionId=String(e.h),n.Id++,n.zc=!1);if(v)e.i=!0,e.oa=0,so(),po=0;else if(1E4p?p=0:isNaN(p)&&(p=1E3);if(rm("ehl")){var t=Ml("ehl","reg");t?(t(r,p),I(l.vtp_gtmOnSuccess)): -I(l.vtp_gtmOnFailure)}else{var n=function(u){for(var v=0;vv.getDuration())return;U=(ha.Ma-v.getCurrentTime())/Q;if(0>U&&(J.shift(),0===J.length))return}while(0>U);y=function(){w=0;y=P;0A||w.push(A/100)}w.sort(function(B,D){return B-D});return w}function l(v){for(var x=v.split(","),y=x.length,w=[],z=0;zA||w.push(A)}w.sort(function(B,D){return B-D});return w}function q(v,x,y){var w=v.map(function(B){return{Ma:B, -pf:B,jf:void 0}});if(!x.length)return w;var z=x.map(function(B){return{Ma:B*y,pf:void 0,jf:B}});if(!w.length)return z;var A=w.concat(z);A.sort(function(B,D){return B.Ma-D.Ma});return A}function r(v){var x=!!v.vtp_captureStart,y=!!v.vtp_captureComplete,w=!!v.vtp_capturePause,z=k(v.vtp_progressThresholdsPercent+""),A=l(v.vtp_progressThresholdsTimeInSeconds+""),B=!!v.vtp_fixMissingApi;if(x||y||w||z.length||A.length){var D={Og:x,Mg:y,Ng:w,Jh:z,Kh:A,af:B,Kd:void 0===v.vtp_uniqueTriggerId?"":v.vtp_uniqueTriggerId}, -F=V("YT"),M=function(){f(D)};I(v.vtp_gtmOnSuccess);if(F)F.ready&&F.ready(M);else{var P=V("onYouTubeIframeAPIReady");nm("onYouTubeIframeAPIReady",function(){P&&P();M()});I(function(){for(var W=V("document"),Z=W.getElementsByTagName("script"),na=Z.length,J=0;J=f)f=2E3; -var h=c.vtp_uniqueTriggerId||"0";if(d){var k=function(q){return Math.max(f,q)};Ll("lcl","mwt",k,0);e||Ll("lcl","nv.mwt",k,0)}var l=function(q){q.push(h);return q};Ll("lcl","ids",l,[]);e||Ll("lcl","nv.ids",l,[]);rm("lcl")||(a(),sm("lcl"));I(c.vtp_gtmOnSuccess)})}(); - - -var up={};up.macro=function(a){if(Fl.ad.hasOwnProperty(a))return Fl.ad[a]},up.onHtmlSuccess=Fl.We(!0),up.onHtmlFailure=Fl.We(!1);up.dataLayer=pe;up.callback=function(a){ge.hasOwnProperty(a)&&ya(ge[a])&&ge[a]();delete ge[a]};up.bootstrap=0;up._spx=!1;function vp(){L[Wd.B]=up;Qa(he,Y.a);Hb=Hb||Fl;Ib=Ub} -function wp(){id.gtm_3pds=!0;id.gtag_cs_api=!0;L=G.google_tag_manager=G.google_tag_manager||{};dj();if(L[Wd.B]){var a=L.zones;a&&a.unregisterChild(Wd.B);}else{for(var b=data.resource||{},c=b.macros||[],d= -0;da.length||!g(a[1]))return;var b={};if(2a.length)&&g(b)){var c;if(2=q?0:Math.round(n/q*100),p=H.hidden?!1:.5<=Zd(c);d();var u=Xl(c,"gtm.video",[b]);u["gtm.videoProvider"]="youtube";u["gtm.videoStatus"]=h;u["gtm.videoUrl"]=r.url;u["gtm.videoTitle"]=r.title;u["gtm.videoDuration"]=Math.round(q);u["gtm.videoCurrentTime"]= +Math.round(n);u["gtm.videoElapsedTime"]=Math.round(f);u["gtm.videoPercent"]=t;u["gtm.videoVisible"]=p;Ml(u)},Vh:function(){e=Pa()},gd:function(){d()}}};var um=G.clearTimeout,vm=G.setTimeout,N=function(a,b,c){if(Vg()){b&&I(b)}else return jd(a,b,c)},wm=function(){return new Date},xm=function(){return G.location.href},ym=function(a){return je(le(a),"fragment")},zm=function(a){return ke(le(a))},Am=function(a,b){return Ye(a,b||2)},Bm=function(a,b,c){var d;b?(a.eventCallback=b,c&&(a.eventTimeout=c),d=Ml(a)):d=Ml(a);return d},Cm=function(a,b){G[a]=b},V=function(a,b,c){b&& +(void 0===G[a]||c&&!G[a])&&(G[a]=b);return G[a]},Dm=function(a,b,c){return mf(a,b,void 0===c?!0:!!c)},Em=function(a,b,c){return 0===vf(a,b,c)},Fm=function(a,b){if(Vg()){b&&I(b)}else ld(a,b)},Gm=function(a){return!!am(a,"init",!1)},Hm=function(a){Zl(a,"init",!0)},Im=function(a,b){var c=(void 0===b?0:b)?"www.googletagmanager.com/gtag/js":Le;c+="?id="+encodeURIComponent(a)+"&l=dataLayer";N(Xg("https://","http://",c))},Jm=function(a, +b){var c=a[b];return c},Km=function(a,b,c){wi&&(kb(a)||Ai(c,b,a))}; +var Lm=Ul.Fh;function hn(a,b){a=String(a);b=String(b);var c=a.length-b.length;return 0<=c&&a.indexOf(b,c)==c}var jn=new Ga;function kn(a,b){function c(h){var k=le(h),l=je(k,"protocol"),r=je(k,"host",!0),q=je(k,"port"),n=je(k,"path").toLowerCase().replace(/\/$/,"");if(void 0===l||"http"==l&&"80"==q||"https"==l&&"443"==q)l="web",q="default";return[l,r,q,n]}for(var d=c(String(a)),e=c(String(b)),f=0;f=Number(c);case "_gt":return Number(b)>Number(c);case "_lc":var l;l=String(b).split(",");return 0<=Da(l,String(c));case "_le":return Number(b)<=Number(c);case "_lt":return Number(b)Number(e);break;case "ge":f=Number(d)>=Number(e)}return!!b.negate!==f},Vo=function(a,b){var c=b.event_name_predicate;if(c&&!Uo(a,c))return!1;var d=b.conditions|| +[];if(0===d.length)return!0;for(var e=0;e=b&&(b=e+864E5,d=5E3);if(1>d)return!1;c=Math.min(c+(e-a)/1E3*5,20);a=e;if(1>c)return!1;d--;c--;return!0}};var rp=""+Fa(),sp=!1,tp=void 0; +var up=function(a){var b="https://www.google-analytics.com/g/collect";if(null!=gd){var c=Qi(a.m(C.Ja),"/g/collect");if(c)return c}var d=!0;Nd(C.s)&&Nd(C.F)||(d=!1);var e=!1!==a.m(C.da);e=!0;a.m(C.nb)&&!a.m(C.Yb)&&e&&!1!==a.m(C.kb)&&tk()&&d&&(b="https://analytics.google.com/g/collect");return b},vp={};vp[C.ig]="tid";vp[C.Wa]="cid";vp[C.Xa]="ul";vp[C.Rc]="_fid";vp[C.ze]="tt";var wp={};wp[C.$b]="sid";wp[C.Zc]="sct";wp[C.Yc]="seg";wp[C.rb]="dl";wp[C.Aa]="dr";wp[C.Wc]="dt";wp[C.xa]="cu";wp[C.vb]="uid";var xp=function(a,b){function c(t,p){if(void 0!== +p&&-1==C.kg.indexOf(t)){null===p&&(p="");var u;if("_"===t.charAt(0))d[t]=hc(p,300);else if(vp[t])u=vp[t],d[u]=hc(p,300);else if(wp[t])u=wp[t],f[u]=hc(p,300);else{t=hc(t,40);var v="ep."+t,x="epn."+t;u=Ba(p)?x:v;var y=Ba(p)?v:x;e.hasOwnProperty(y)&&delete e[y];e[u]=hc(p,100)}}}var d={},e={},f={};d.v="2";d.tid=a.i;d.gtm=ej();d._p=Yo();a.bd&&(d.sr=a.bd);a.ia&&(d._gaz=1);Gd()&&(d.gcs=Od());a.Je&&(d.gtm_up="1");e.en=hc(a.N,40);a.Ma&&(e._fv=a.Ie?2:1);a.He&&(e._nsi=1);a.o&&(e._ss=a.Ke?2:1);a.xb&&(e._c=1); +0k;k++)e["pr"+(k+1)]=mc(h[k])}a.jb&&(e._eu=a.jb);for(var l=0;lthis.i&&(this.i=G.setTimeout(function(){return b.flush()},5E3));else{var e=rc(c,this.o++);Cp(c.i,e.Ad,e.body);Fp(c,a.ib,a.ia,String(a.m(C.Sc)))}};Ep.prototype.flush=function(){if(this.h.events.length){var a=sc(this.h,this.o++);Cp(this.h.i,a.Ad,a.body);this.h=new pc;0<=this.i&&(G.clearTimeout(this.i),this.i=-1)}}; +var Fp=function(a,b,c,d){function e(k){f.push(k+"="+encodeURIComponent(""+a.ra[k]))}if(b||c){var f=[];e("tid");e("cid");e("gtm");f.push("aip=1");a.Oa.uid&&f.push("uid="+encodeURIComponent(""+a.Oa.uid));b&&(Bp("https://stats.g.doubleclick.net/g/collect","v=2&"+f.join("&")),yh("https://stats.g.doubleclick.net/g/collect?v=2&"+f.join("&")));if(c){f.push("z="+Fa());var h=Dp(d);h&&md(h+f.join("&"))}}},Bp=function(a,b,c){var d=a+"?"+b;c?fd.sendBeacon&&fd.sendBeacon(d,c):td(d)};var Gp=function(a,b){var c;var d=hp(a);d?(fp(d,a)||(E(25),a.abort()),c=d):c=void 0;var e=c,f;a:{var h=a.J[C.Wa];h?(h=""+h,cp(h,a)||(E(31),a.abort()),Zo(h,Nd(C.F)),f=h):(E(32),a.abort(),f="")}return{clientId:f,tf:e}};var Hp=window,Ip=document,Jp=function(a){var b=Hp._gaUserPrefs;if(b&&b.ioo&&b.ioo()||a&&!0===Hp["ga-disable-"+a])return!0;try{var c=Hp.external;if(c&&c._gaUserPrefs&&"oo"==c._gaUserPrefs)return!0}catch(f){}for(var d=kf("AMP_TOKEN",String(Ip.cookie),!0),e=0;eu.length||5p.td+60*t&&(v=!0,p.sessionId=String(e.h), +p.Kd++,p.Bc=!1);if(v)e.o=!0,e.qa=0,Mo(),Jo=0;else if(1E4n?n=0:isNaN(n)&&(n=1E3);if(Gm("ehl")){var t=am("ehl","reg");t?(t(q,n),I(l.vtp_gtmOnSuccess)): +I(l.vtp_gtmOnFailure)}else{var p=function(u){for(var v=0;vv.getDuration())return;U=(fa.Na-v.getCurrentTime())/O;if(0>U&&(J.shift(),0===J.length))return}while(0>U);y=function(){w=0;y=P;0A||w.push(A/100)}w.sort(function(B,D){return B-D});return w}function l(v){for(var x=v.split(","),y=x.length,w=[],z=0;zA||w.push(A)}w.sort(function(B,D){return B-D});return w}function r(v,x,y){var w=v.map(function(B){return{Na:B, +rf:B,lf:void 0}});if(!x.length)return w;var z=x.map(function(B){return{Na:B*y,rf:void 0,lf:B}});if(!w.length)return z;var A=w.concat(z);A.sort(function(B,D){return B.Na-D.Na});return A}function q(v){var x=!!v.vtp_captureStart,y=!!v.vtp_captureComplete,w=!!v.vtp_capturePause,z=k(v.vtp_progressThresholdsPercent+""),A=l(v.vtp_progressThresholdsTimeInSeconds+""),B=!!v.vtp_fixMissingApi;if(x||y||w||z.length||A.length){var D={Sg:x,Qg:y,Rg:w,Nh:z,Oh:A,cf:B,Md:void 0===v.vtp_uniqueTriggerId?"":v.vtp_uniqueTriggerId}, +F=V("YT"),M=function(){f(D)};I(v.vtp_gtmOnSuccess);if(F)F.ready&&F.ready(M);else{var P=V("onYouTubeIframeAPIReady");Cm("onYouTubeIframeAPIReady",function(){P&&P();M()});I(function(){for(var X=V("document"),aa=X.getElementsByTagName("script"),pa=aa.length,J=0;J=f)f=2E3;var h=c.vtp_uniqueTriggerId||"0";if(d){var k=function(r){return Math.max(f,r)};$l("lcl","mwt",k,0);e||$l("lcl","nv.mwt",k,0)}var l=function(r){r.push(h);return r};$l("lcl","ids",l,[]);e||$l("lcl","nv.ids",l,[]);Gm("lcl")||(a(),Hm("lcl"));I(c.vtp_gtmOnSuccess)})}(); + + +var Wp={};Wp.macro=function(a){if(Ul.dd.hasOwnProperty(a))return Ul.dd[a]},Wp.onHtmlSuccess=Ul.Ye(!0),Wp.onHtmlFailure=Ul.Ye(!1);Wp.dataLayer=Ze;Wp.callback=function(a){Qe.hasOwnProperty(a)&&Aa(Qe[a])&&Qe[a]();delete Qe[a]};Wp.bootstrap=0;Wp._spx=!1;function Xp(){L[Fe.B]=Wp;Ua(Re,Z.a);Sb=Sb||Ul;Tb=gc} +function Yp(){vd.gtm_3pds=!0;vd.gtag_cs_api=!0;L=G.google_tag_manager=G.google_tag_manager||{};Fj();if(L[Fe.B]){var a=L.zones;a&&a.unregisterChild(Fe.B);}else{for(var b=data.resource||{},c=b.macros||[],d= +0;d @@ -3732,6 +3738,9 @@
  • Dimension Reduction - UMAP
  • Ensembl Gene ID Annotation
  • Ortholog Mapping
  • +
  • Pathway Analysis - ORA
  • +
  • Pathway Analysis - GSEA
  • +
  • Pathway Analysis - GSVA
  • @@ -3772,8 +3781,8 @@

    November 2020

    1 Purpose of this analysis

    -

    In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules (Langfelder and Horvath 2008). WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNAseq data.

    -

    As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) (Langfelder and Horvath 2007). This eigengene x sample data can, in many instances, be used as you would the original gene expression values. In this example, we use eigengene x sample data to identify differentially expressed modules between our treatment and control group

    +

    In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules (Langfelder and Horvath 2008). WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNA-seq data.

    +

    As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) (Langfelder and Horvath 2007). This eigengene x sample data can, in many instances, be used as you would the original gene expression values. In this example, we use eigengene x sample data to identify differentially expressed modules between our treatment and control group

    This method does require some computing power, but can still be run locally (on your own computer) for most refine.bio datasets. As with many clustering and network methods, there are some parameters that may need tweaking.

    ⬇️ Jump to the analysis code ⬇️

    @@ -3796,7 +3805,7 @@

    2.2 Set up your analysis folders< } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -3804,7 +3813,7 @@

    2.2 Set up your analysis folders< } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -3815,7 +3824,7 @@

    2.2 Set up your analysis folders<

    2.3 Obtain the dataset from refine.bio

    For general information about downloading data for these examples, see our ‘Getting Started’ section.

    -

    Go to this dataset’s page on refine.bio.

    +

    Go to this dataset’s page on refine.bio.

    Click the “Download Now” button on the right side of this screen.

    Fill out the pop up window with your email and our Terms and Conditions:

    @@ -3826,7 +3835,7 @@

    2.3 Obtain the dataset from refin

    2.4 About the dataset we are using for this example

    -

    For this example analysis, we will use this prostate cancer dataset. The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens.

    +

    For this example analysis, we will use this acute viral bronchiolitis dataset. The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated “AV”) and their recovery, their post-convalescence visit (abbreviated “CV”).

    2.5 Place the dataset in your new data/ folder

    @@ -3834,7 +3843,7 @@

    2.5 Place the dataset in your new

    For more details on the contents of this folder see these docs on refine.bio.

    The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

    -

    Copy and paste the SRP133573 folder into your newly created data/ folder.

    +

    Copy and paste the SRP140558 folder into your newly created data/ folder.

    2.6 Check out our file structure!

    @@ -3844,7 +3853,7 @@

    2.6 Check out our file structure!
  • A folder called “data” which contains:
      -
    • The SRP133573 folder which contains: +
    • The SRP140558 folder which contains:
      • The gene expression
      • @@ -3860,15 +3869,20 @@

        2.6 Check out our file structure!

        In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

        First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

        # Define the file path to the data directory
        -data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
        -
        -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
        -data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
        -
        -# Declare the file path to the metadata file using the data directory saved as `data_dir`
        -metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
        +# Replace with the path of the folder the files will be in +data_dir <- file.path("data", "SRP140558") + +# Declare the file path to the gene expression matrix file +# inside directory saved as `data_dir` +# Replace with the path to your dataset file +data_file <- file.path(data_dir, "SRP140558.tsv") + +# Declare the file path to the metadata file +# inside the directory saved as `data_dir` +# Replace with the path to your metadata file +metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv")

  • Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

    -
    # Check if the gene expression matrix file is at the file path stored in `data_file`
    +
    # Check if the gene expression matrix file is at the path stored in `data_file`
     file.exists(data_file)
    ## [1] TRUE
    # Check if the metadata file is at the file path stored in `metadata_file`
    @@ -3899,9 +3913,9 @@ 

    4 Identifying co-expression gene

    4.1 Install libraries

    See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

    -

    We will be using DESeq2 to normalize and transform our RNA-seq data before running WGCNA, so we will need to install that (Love et al. 2014).

    -

    Of course, we will need the WGCNA package (Langfelder and Horvath 2008). But WGCNA also requires a package called impute that it sometimes has trouble installing so we recommend installing that first (Hastie et al. 2020).

    -

    For plotting purposes will be creating a sina plot and heatmaps which we will need a ggplot2 companion package for, called ggforce as well as the ComplexHeatmap package (Gu 2020).

    +

    We will be using DESeq2 to normalize and transform our RNA-seq data before running WGCNA, so we will need to install that (Love et al. 2014).

    +

    Of course, we will need the WGCNA package (Langfelder and Horvath 2008). But WGCNA also requires a package called impute that it sometimes has trouble installing so we recommend installing that first (Hastie et al. 2020).

    +

    For plotting purposes will be creating a sina plot and heatmaps which we will need a ggplot2 companion package for, called ggforce as well as the ComplexHeatmap package (Gu 2020).

    if (!("DESeq2" %in% installed.packages())) {
       # Install this package if it isn't installed yet
       BiocManager::install("DESeq2", update = FALSE)
    @@ -3928,480 +3942,406 @@ 

    4.1 Install libraries

    }

    Attach some of the packages we need for this analysis.

    # Attach the DESeq2 library
    -library(DESeq2)
    -
    ## Loading required package: S4Vectors
    -
    ## Loading required package: stats4
    -
    ## Loading required package: BiocGenerics
    -
    ## Loading required package: parallel
    -
    ## 
    -## Attaching package: 'BiocGenerics'
    -
    ## The following objects are masked from 'package:parallel':
    -## 
    -##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    -##     clusterExport, clusterMap, parApply, parCapply, parLapply,
    -##     parLapplyLB, parRapply, parSapply, parSapplyLB
    -
    ## The following objects are masked from 'package:stats':
    -## 
    -##     IQR, mad, sd, var, xtabs
    -
    ## The following objects are masked from 'package:base':
    -## 
    -##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    -##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    -##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    -##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    -##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    -##     union, unique, unsplit, which.max, which.min
    -
    ## 
    -## Attaching package: 'S4Vectors'
    -
    ## The following object is masked from 'package:base':
    -## 
    -##     expand.grid
    -
    ## Loading required package: IRanges
    -
    ## Loading required package: GenomicRanges
    -
    ## Loading required package: GenomeInfoDb
    -
    ## Loading required package: SummarizedExperiment
    -
    ## Loading required package: MatrixGenerics
    -
    ## Loading required package: matrixStats
    -
    ## 
    -## Attaching package: 'MatrixGenerics'
    -
    ## The following objects are masked from 'package:matrixStats':
    -## 
    -##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    -##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    -##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    -##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    -##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    -##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    -##     colWeightedMeans, colWeightedMedians, colWeightedSds,
    -##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    -##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    -##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    -##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    -##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    -##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    -##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    -##     rowWeightedSds, rowWeightedVars
    -
    ## Loading required package: Biobase
    -
    ## Welcome to Bioconductor
    -## 
    -##     Vignettes contain introductory material; view with
    -##     'browseVignettes()'. To cite Bioconductor, see
    -##     'citation("Biobase")', and for packages 'citation("pkgname")'.
    -
    ## 
    -## Attaching package: 'Biobase'
    -
    ## The following object is masked from 'package:MatrixGenerics':
    -## 
    -##     rowMedians
    -
    ## The following objects are masked from 'package:matrixStats':
    -## 
    -##     anyMissing, rowMedians
    -
    # We will need this so we can use the pipe: %>%
    -library(magrittr)
    -
    -# We'll need this for finding gene modules
    -library(WGCNA)
    -
    ## Loading required package: dynamicTreeCut
    -
    ## Loading required package: fastcluster
    -
    ## 
    -## Attaching package: 'fastcluster'
    -
    ## The following object is masked from 'package:stats':
    -## 
    -##     hclust
    -
    ## 
    -
    ## 
    -## Attaching package: 'WGCNA'
    -
    ## The following object is masked from 'package:IRanges':
    -## 
    -##     cor
    -
    ## The following object is masked from 'package:S4Vectors':
    -## 
    -##     cor
    -
    ## The following object is masked from 'package:stats':
    -## 
    -##     cor
    -
    # We'll be making some plots
    -library(ggplot2)
    +library(DESeq2) + +# We will need this so we can use the pipe: %>% +library(magrittr) + +# We'll need this for finding gene modules +library(WGCNA) + +# We'll be making some plots +library(ggplot2)

    4.2 Import and set up data

    Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

    We stored our file paths as objects named metadata_file and data_file in this previous step.

    -
    # Read in metadata TSV file
    -metadata <- readr::read_tsv(metadata_file)
    +
    # Read in metadata TSV file
    +metadata <- readr::read_tsv(metadata_file)
    ## 
    -## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────
    +## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
     ## cols(
    -##   .default = col_character(),
    -##   refinebio_age = col_logical(),
    -##   refinebio_cell_line = col_logical(),
    -##   refinebio_compound = col_logical(),
    -##   refinebio_disease_stage = col_logical(),
    -##   refinebio_genetic_information = col_logical(),
    -##   refinebio_processed = col_logical(),
    -##   refinebio_sex = col_logical(),
    -##   refinebio_source_archive_url = col_logical(),
    -##   refinebio_specimen_part = col_logical(),
    -##   refinebio_time = col_logical()
    +##   .default = col_logical(),
    +##   refinebio_accession_code = col_character(),
    +##   experiment_accession = col_character(),
    +##   refinebio_organism = col_character(),
    +##   refinebio_platform = col_character(),
    +##   refinebio_source_database = col_character(),
    +##   refinebio_subject = col_character(),
    +##   refinebio_title = col_character()
     ## )
     ## ℹ Use `spec()` for the full column specifications.
    -
    # Read in data TSV file
    -df <- readr::read_tsv(data_file) %>%
    -  # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
    -  tibble::column_to_rownames("Gene")
    +
    # Read in data TSV file
    +df <- readr::read_tsv(data_file) %>%
    +  # Here we are going to store the gene IDs as row names so that we can have a numeric matrix to perform calculations on later
    +  tibble::column_to_rownames("Gene")
    ## 
    -## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────
    +## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
     ## cols(
     ##   .default = col_double(),
     ##   Gene = col_character()
     ## )
     ## ℹ Use `spec()` for the full column specifications.

    Let’s ensure that the metadata and data are in the same sample order.

    -
    # Make the data in the order of the metadata
    -df <- df %>%
    -  dplyr::select(metadata$refinebio_accession_code)
    -
    -# Check if this is in the same order
    -all.equal(colnames(df), metadata$refinebio_accession_code)
    +
    # Make the data in the order of the metadata
    +df <- df %>%
    +  dplyr::select(metadata$refinebio_accession_code)
    +
    +# Check if this is in the same order
    +all.equal(colnames(df), metadata$refinebio_accession_code)
    ## [1] TRUE

    4.2.1 Prepare data for DESeq2

    -

    There are two things we neeed to do to prep our expression data for DESeq2.

    +

    There are two things we need to do to prep our expression data for DESeq2.

    First, we need to make sure all of the values in our data are converted to integers as required by a DESeq2 function we will use later.

    Then, we need to filter out the genes that have not been expressed or that have low expression counts. This is recommended by WGCNA docs for RNA-seq data. Removing low count genes can also help improve your WGCNA results. We are going to do some pre-filtering to keep only genes with 50 or more reads in total across the samples.

    -
    # The next DESeq2 functions need the values to be converted to integers
    -df <- round(df) %>%
    -  # The next steps require a data frame and round() returns a matrix
    -  as.data.frame() %>%
    -  # Only keep rows that have total counts above the cutoff
    -  dplyr::filter(rowSums(.) >= 50)
    -

    Another thing we need to do is make sure our main experimental group label is set up. In this case refinebio_treatment has two groups: pre-adt and post-adt. To keep these two treatments in logical (rather than alphabetical) order, we will convert this to a factor with pre-adt as the first level.

    -
    metadata <- metadata %>%
    -  dplyr::mutate(refinebio_treatment = factor(refinebio_treatment,
    -    levels = c("pre-adt", "post-adt")
    -  ))
    -

    Let’s double check that our factor set up is right.

    -
    levels(metadata$refinebio_treatment)
    -
    ## [1] "pre-adt"  "post-adt"
    +
    # The next DESeq2 functions need the values to be converted to integers
    +df <- round(df) %>%
    +  # The next steps require a data frame and round() returns a matrix
    +  as.data.frame() %>%
    +  # Only keep rows that have total counts above the cutoff
    +  dplyr::filter(rowSums(.) >= 50)
    +

    Another thing we need to do is set up our main experimental group variable. Unfortunately the metadata for this dataset are not set up into separate, neat columns, but we can accomplish that ourselves.

    +

    For this study, PBMCs were collected at two time points: during the patients’ first, acute bronchiolitis visit (abbreviated “AV”) and their recovery visit, (called post-convalescence and abbreviated “CV”).

    +

    For handier use of this information, we can create a new variable, time_point, that states this info more clearly. This new time_point variable will have two labels: acute illness and recovering based on the AV or CV coding located in the refinebio_title string variable.

    +
    metadata <- metadata %>%
    +  dplyr::mutate(
    +    time_point = dplyr::case_when(
    +      # Create our new variable based on refinebio_title containing AV/CV
    +      stringr::str_detect(refinebio_title, "_AV_") ~ "acute illness",
    +      stringr::str_detect(refinebio_title, "_CV_") ~ "recovering"
    +    ),
    +    # It's easier for future items if this is already set up as a factor
    +    time_point = as.factor(time_point)
    +  )
    +

    Let’s double check that our factor set up is right. We want acute illness to be the first level since it was the first time point collected.

    +
    levels(metadata$time_point)
    +
    ## [1] "acute illness" "recovering"
    +

    Great! We’re all set.

    4.3 Create a DESeqDataset

    -

    We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

    -
    # Create a `DESeqDataSet` object
    -dds <- DESeqDataSetFromMatrix(
    -  countData = df, # Our prepped data frame with counts
    -  colData = metadata, # Data frame with annotation for our samples
    -  design = ~1 # Here we are not specifying a model
    -)
    +

    We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

    +
    # Create a `DESeqDataSet` object
    +dds <- DESeqDataSetFromMatrix(
    +  countData = df, # Our prepped data frame with counts
    +  colData = metadata, # Data frame with annotation for our samples
    +  design = ~1 # Here we are not specifying a model
    +)
    ## converting counts to integer mode

    4.4 Perform DESeq2 normalization and transformation

    We often suggest normalizing and transforming your data for various applications and in this instance WGCNA’s authors suggest using variance stabilizing transformation before running WGCNA.
    We are going to use the vst() function from the DESeq2 package to normalize and transform the data. For more information about these transformation methods, see here.

    -
    # Normalize and transform the data in the `DESeqDataSet` object using the `vst()`
    -# function from the `DESEq2` R package
    -dds_norm <- vst(dds)
    -

    At this point, if your data has any outliers, you should look into removing them as they can affect your WGCNA results. WGCNA’s tutorial has an example of exploring your data for outliers you can reference.

    +
    # Normalize and transform the data in the `DESeqDataSet` object using the `vst()`
    +# function from the `DESEq2` R package
    +dds_norm <- vst(dds)
    +

    At this point, if your data set has any outlier samples, you should look into removing them as they can affect your WGCNA results.

    +

    WGCNA’s tutorial has an example of exploring your data for outliers you can reference.

    +

    For this example data set, we will skip this step (there are no obvious outliers) and proceed.

    4.5 Format normalized data for WGCNA

    Extract the normalized counts to a matrix and transpose it so we can pass it to WGCNA.

    -
    # Retrieve the normalized data from the `DESeqDataSet`
    -normalized_counts <- assay(dds_norm) %>%
    -  t() # Transpose this data
    +
    # Retrieve the normalized data from the `DESeqDataSet`
    +normalized_counts <- assay(dds_norm) %>%
    +  t() # Transpose this data

    4.6 Determine parameters for WGCNA

    To identify which genes are in the same modules, WGCNA first creates a weighted network to define which genes are near each other. The measure of “adjacency” it uses is based on the correlation matrix, but requires the definition of a threshold value, which in turn depends on a “power” parameter that defines the exponent used when transforming the correlation values. The choice of power parameter will affect the number of modules identified, and the WGCNA modules provides the pickSoftThreshold() function to help identify good choices for this parameter.

    -
    sft <- pickSoftThreshold(normalized_counts,
    -  dataIsExpr = TRUE,
    -  corFnc = cor,
    -  networkType = "signed"
    -)
    -
    ## Warning: executing %dopar% sequentially: no parallel backend registered
    +
    sft <- pickSoftThreshold(normalized_counts,
    +  dataIsExpr = TRUE,
    +  corFnc = cor,
    +  networkType = "signed"
    +)
    +
    ## Warning: executing %dopar% sequentially: no parallel backend
    +## registered
    ##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
    -## 1      1  0.58200 12.200          0.957 13500.0   13600.0  15500
    -## 2      2  0.44500  5.130          0.972  7630.0    7650.0   9910
    -## 3      3  0.26300  2.570          0.985  4480.0    4450.0   6680
    -## 4      4  0.06480  0.914          0.985  2730.0    2680.0   4720
    -## 5      5  0.00662 -0.236          0.964  1720.0    1660.0   3450
    -## 6      6  0.15900 -1.010          0.965  1120.0    1060.0   2580
    -## 7      7  0.36500 -1.470          0.971   746.0     689.0   1980
    -## 8      8  0.50000 -1.730          0.972   509.0     459.0   1550
    -## 9      9  0.59700 -1.910          0.972   356.0     313.0   1220
    -## 10    10  0.67000 -2.060          0.973   253.0     217.0    982
    -## 11    12  0.74000 -2.260          0.970   135.0     110.0    651
    -## 12    14  0.79400 -2.320          0.978    76.9      58.6    447
    -## 13    16  0.82000 -2.350          0.981    45.9      32.7    315
    -## 14    18  0.83800 -2.360          0.985    28.6      18.9    227
    -## 15    20  0.84500 -2.350          0.987    18.5      11.2    167
    +## 1 1 0.0491 42.50 0.947 13400.0 13400.00 13600 +## 2 2 0.8530 -12.60 0.871 7230.0 7080.00 8430 +## 3 3 0.8800 -5.41 0.856 4120.0 3900.00 5840 +## 4 4 0.8910 -3.28 0.864 2470.0 2230.00 4340 +## 5 5 0.9060 -2.39 0.882 1560.0 1310.00 3380 +## 6 6 0.9140 -1.96 0.895 1030.0 798.00 2740 +## 7 7 0.9220 -1.72 0.908 706.0 496.00 2280 +## 8 8 0.9190 -1.58 0.910 504.0 314.00 1940 +## 9 9 0.9180 -1.48 0.917 371.0 203.00 1680 +## 10 10 0.9080 -1.42 0.915 282.0 134.00 1470 +## 11 12 0.9050 -1.34 0.927 174.0 60.40 1170 +## 12 14 0.8870 -1.31 0.927 116.0 28.60 964 +## 13 16 0.8660 -1.32 0.918 81.7 14.00 810 +## 14 18 0.8560 -1.33 0.921 59.7 7.13 692 +## 15 20 0.8570 -1.33 0.929 45.0 3.71 599

    This sft object has a lot of information, we will want to plot some of it to figure out what our power soft-threshold should be. We have to first calculate a measure of the model fit, the signed \(R^2\), and make that a new variable.

    -
    sft_df <- data.frame(sft$fitIndices) %>%
    -  dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq)
    +
    sft_df <- data.frame(sft$fitIndices) %>%
    +  dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq)

    Now, let’s plot the model fitting by the power soft threshold so we can decide on a soft-threshold for power.

    -
    ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) +
    -  # Plot the points
    -  geom_point() +
    -  # We'll put the Power labels slightly above the data points
    -  geom_text(nudge_y = 0.1) +
    -  # We will plot what WGCNA recommends as an R^2 cutoff
    -  geom_hline(yintercept = 0.80, col = "red") +
    -  # Just in case our values are low, we want to make sure we can still see the 0.80 level
    -  ylim(c(min(sft_df$model_fit), 1)) +
    -  # We can add more sensible labels for our axis
    -  xlab("Soft Threshold (power)") +
    -  ylab("Scale Free Topology Model Fit, signed R^2") +
    -  ggtitle("Scale independence") +
    -  # This adds some nicer aesthetics to our plot
    -  theme_classic()
    -

    +
    ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) +
    +  # Plot the points
    +  geom_point() +
    +  # We'll put the Power labels slightly above the data points
    +  geom_text(nudge_y = 0.1) +
    +  # We will plot what WGCNA recommends as an R^2 cutoff
    +  geom_hline(yintercept = 0.80, col = "red") +
    +  # Just in case our values are low, we want to make sure we can still see the 0.80 level
    +  ylim(c(min(sft_df$model_fit), 1.05)) +
    +  # We can add more sensible labels for our axis
    +  xlab("Soft Threshold (power)") +
    +  ylab("Scale Free Topology Model Fit, signed R^2") +
    +  ggtitle("Scale independence") +
    +  # This adds some nicer aesthetics to our plot
    +  theme_classic()
    +

    Using this plot we can decide on a power parameter. WGCNA’s authors recommend using a power that has an signed \(R^2\) above 0.80, otherwise they warn your results may be too noisy to be meaningful.

    -

    If you have multiple power values with signed \(R^2\) above 0.80, then picking the one at an inflection point, in other words where the \(R^2\) values seem to have reached their saturation (Zhang and Horvath 2005). You want to a power that gives you a big enough \(R^2\) but is not excessively large.

    -

    So using the plot above, going with a power soft-threshold of 16!

    +

    If you have multiple power values with signed \(R^2\) above 0.80, then picking the one at an inflection point, in other words where the \(R^2\) values seem to have reached their saturation (Zhang and Horvath 2005). You want to a power that gives you a big enough \(R^2\) but is not excessively large.

    +

    So using the plot above, going with a power soft-threshold of 7!

    If you find you have all very low \(R^2\) values this may be because there are too many genes with low expression values that are cluttering up the calculations. You can try returning to gene filtering step and choosing a more stringent cutoff (you’ll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped).

    4.7 Run WGCNA!

    -

    We will use the blockwiseModules() function to find gene co-expression modules in WGCNA, using 16 for the power argument like we determined above.

    +

    We will use the blockwiseModules() function to find gene co-expression modules in WGCNA, using 7 for the power argument like we determined above.

    This next step takes some time to run. The blockwise part of the blockwiseModules() function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources.

    Here we are using the default maxBlockSize, 5000 but, you may want to adjust the maxBlockSize argument depending on your computer’s memory. The authors of WGCNA recommend running the largest block your computer can handle and they provide some approximations as to GB of memory of a laptop and what maxBlockSize it should be able to handle:

    • If the reader has access to a large workstation with more than 4 GB of memory, the parameter maxBlockSize can be increased. A 16GB workstation should handle up to 20000 probes; a 32GB workstation should handle perhaps 30000. A 4GB standard desktop or a laptop may handle up to 8000-10000 probes, depending on operating system and other running programs.

    -

    (Langfelder and Horvath 2016)

    -
    bwnet <- blockwiseModules(normalized_counts,
    -  maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
    -  TOMType = "signed", # topological overlap matrix
    -  power = 16, # soft threshold for network construction
    -  numericLabels = TRUE, # Let's use numbers instead of colors for module labels
    -  randomSeed = 1234, # there's some randomness associated with this calculation
    -  # so we should set a seed
    -)
    -

    The TOMtype argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. You can safely assume for most situations a signed network represents what you want – we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an unsigned network, where positive/negative is ignored see this article to help you figure that out (Langfelder 2018).

    -

    There are a lot of other settings you can tweak – look at ?blockwiseModules help page as well as the WGCNA tutorial (Langfelder and Horvath 2016).

    +

    (Langfelder and Horvath 2016)

    +
    bwnet <- blockwiseModules(normalized_counts,
    +  maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
    +  TOMType = "signed", # topological overlap matrix
    +  power = 7, # soft threshold for network construction
    +  numericLabels = TRUE, # Let's use numbers instead of colors for module labels
    +  randomSeed = 1234, # there's some randomness associated with this calculation
    +  # so we should set a seed
    +)
    +

    The TOMtype argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. You can safely assume for most situations a signed network represents what you want – we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an unsigned network, where positive/negative is ignored see this article to help you figure that out (Langfelder 2018).

    +

    There are a lot of other settings you can tweak – look at ?blockwiseModules help page as well as the WGCNA tutorial (Langfelder and Horvath 2016).

    4.8 Write main WGCNA results object to file

    We will save our whole results object to an RDS file in case we want to return to our original WGCNA results.

    -
    readr::write_rds(bwnet,
    -  file = file.path("results", "SRP133573_wgcna_results.RDS")
    -)
    +
    readr::write_rds(bwnet,
    +  file = file.path("results", "SRP140558_wgcna_results.RDS")
    +)

    4.9 Explore our WGCNA results

    The bwnet object has many parts, storing a lot of information. We can pull out the parts we are most interested in and may want to use use for plotting.

    In bwnet we have a data frame of eigengene module data for each sample in the MEs slot. These represent the collapsed, combined, and normalized expression of the genes that make up each module.

    -
    module_eigengenes <- bwnet$MEs
    -
    -# Print out a preview
    -head(module_eigengenes)
    +
    module_eigengenes <- bwnet$MEs
    +
    +# Print out a preview
    +head(module_eigengenes)

    4.10 Which modules have biggest differences across treatment groups?

    We can also see if our eigengenes relate to our metadata labels. First we double check that our samples are still in order.

    -
    all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes))
    +
    all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes))
    ## [1] TRUE
    -
    # Create the design matrix from the refinebio_treatment variable
    -des_mat <- model.matrix(~ metadata$refinebio_treatment)
    +
    # Create the design matrix from the `time_point` variable
    +des_mat <- model.matrix(~ metadata$time_point)

    Run linear model on each module. Limma wants our tests to be per row, so we also need to transpose so the eigengenes are rows

    -
    # lmFit() needs a transposed version of the matrix
    -fit <- limma::lmFit(t(module_eigengenes), design = des_mat)
    -
    -# Apply empirical Bayes to smooth standard errors
    -fit <- limma::eBayes(fit)
    +
    # lmFit() needs a transposed version of the matrix
    +fit <- limma::lmFit(t(module_eigengenes), design = des_mat)
    +
    +# Apply empirical Bayes to smooth standard errors
    +fit <- limma::eBayes(fit)

    Apply multiple testing correction and obtain stats in a data frame.

    -
    # Apply multiple testing correction and obtain stats
    -stats_df <- limma::topTable(fit, number = ncol(module_eigengenes)) %>%
    -  tibble::rownames_to_column("module")
    +
    # Apply multiple testing correction and obtain stats
    +stats_df <- limma::topTable(fit, number = ncol(module_eigengenes)) %>%
    +  tibble::rownames_to_column("module")
    ## Removing intercept from test coefficients

    Let’s take a look at the results. They are sorted with the most significant results at the top.

    -
    head(stats_df)
    +
    head(stats_df)
    -

    Module 52 seems to be the most differentially expressed across refinebio_treatment groups. Now we can do some investigation into this module.

    +

    Module 19 seems to be the most differentially expressed across time_point groups. Now we can do some investigation into this module.

    -
    -

    4.11 Let’s make plot of module 52

    -

    As a sanity check, let’s use ggplot to see what module 52’s eigengene looks like between treatment groups.

    +
    +

    4.11 Let’s make plot of module 19

    +

    As a sanity check, let’s use ggplot to see what module 18’s eigengene looks like between treatment groups.

    First we need to set up the module eigengene for this module with the sample metadata labels we need.

    -
    module_52_df <- module_eigengenes %>%
    -  tibble::rownames_to_column("accession_code") %>%
    -  # Here we are performing an inner join with a subset of metadata
    -  dplyr::inner_join(metadata %>%
    -    dplyr::select(refinebio_accession_code, refinebio_treatment),
    -  by = c("accession_code" = "refinebio_accession_code")
    -  )
    +
    module_19_df <- module_eigengenes %>%
    +  tibble::rownames_to_column("accession_code") %>%
    +  # Here we are performing an inner join with a subset of metadata
    +  dplyr::inner_join(metadata %>%
    +    dplyr::select(refinebio_accession_code, time_point),
    +  by = c("accession_code" = "refinebio_accession_code")
    +  )

    Now we are ready for plotting.

    -
    ggplot(
    -  module_52_df,
    -  aes(
    -    x = refinebio_treatment,
    -    y = ME52,
    -    color = refinebio_treatment
    -  )
    -) +
    -  # a boxplot with outlier points hidden (they will be in the sina plot)
    -  geom_boxplot(width = 0.2, outlier.shape = NA) +
    -  # A sina plot to show all of the individual data points
    -  ggforce::geom_sina(maxwidth = 0.3) +
    -  theme_classic()
    -

    +
    ggplot(
    +  module_19_df,
    +  aes(
    +    x = time_point,
    +    y = ME19,
    +    color = time_point
    +  )
    +) +
    +  # a boxplot with outlier points hidden (they will be in the sina plot)
    +  geom_boxplot(width = 0.2, outlier.shape = NA) +
    +  # A sina plot to show all of the individual data points
    +  ggforce::geom_sina(maxwidth = 0.3) +
    +  theme_classic()
    +

    This makes sense! Looks like module 19 has elevated expression during the acute illness but not when recovering.

    -
    -

    4.12 What genes are a part of module 52?

    +
    +

    4.12 What genes are a part of module 19?

    If you want to know which of your genes make up a modules, you can look at the $colors slot. This is a named list which associates the genes with the module they are a part of. We can turn this into a data frame for handy use.

    -
    gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module") %>%
    -  # Let's add the `ME` part so its more clear what these numbers are and it matches elsewhere
    -  dplyr::mutate(module = paste0("ME", module))
    -

    Now we can find what genes are a part of module 52.

    -
    gene_module_key %>%
    -  dplyr::filter(module == "ME52")
    +
    gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module") %>%
    +  # Let's add the `ME` part so its more clear what these numbers are and it matches elsewhere
    +  dplyr::mutate(module = paste0("ME", module))
    +

    Now we can find what genes are a part of module 19.

    +
    gene_module_key %>%
    +  dplyr::filter(module == "ME19")

    Let’s save this gene to module key to a TSV file for future use.

    -
    readr::write_tsv(gene_module_key,
    -  file = file.path("results", "SRP133573_wgcna_gene_to_module.tsv")
    -)
    +
    readr::write_tsv(gene_module_key,
    +  file = file.path("results", "SRP140558_wgcna_gene_to_module.tsv")
    +)

    4.13 Make a custom heatmap function

    We will make a heatmap that summarizes our differentially expressed module. Because we will make a couple of these, it makes sense to make a custom function for making this heatmap.

    -
    make_module_heatmap <- function(module_name,
    -                                expression_mat = normalized_counts,
    -                                metadata_df = metadata,
    -                                gene_module_key_df = gene_module_key,
    -                                module_eigengenes_df = module_eigengenes) {
    -  # Create a summary heatmap of a given module.
    -  #
    -  # Args:
    -  # module_name: a character indicating what module should be plotted, e.g. "ME52"
    -  # expression_mat: The full gene expression matrix. Default is `normalized_counts`.
    -  # metadata_df: a data frame with refinebio_accession_code and refinebio_treatment
    -  #              as columns. Default is `metadata`.
    -  # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`.
    -  # module_eigengenes: a sample x eigengene data.frame with samples as rownames. Default is `module_eigengenes`.
    -  #
    -  # Returns:
    -  # A heatmap of expression matrix for a module's genes, with a barplot of the
    -  # eigengene expression for that module.
    -
    -  # Set up the module eigengene with its refinebio_accession_code
    -  module_eigengene <- module_eigengenes_df %>%
    -    dplyr::select(module_name) %>%
    -    tibble::rownames_to_column("refinebio_accession_code")
    -
    -  # Set up column annotation from metadata
    -  col_annot_df <- metadata_df %>%
    -    # Only select the treatment and sample ID columns
    -    dplyr::select(refinebio_accession_code, refinebio_treatment) %>%
    -    # Add on the eigengene expression by joining with sample IDs
    -    dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>%
    -    # Arrange by treatment
    -    dplyr::arrange(refinebio_treatment, refinebio_accession_code) %>%
    -    # Store sample
    -    tibble::column_to_rownames("refinebio_accession_code")
    -
    -  # Create the ComplexHeatmap column annotation object
    -  col_annot <- ComplexHeatmap::HeatmapAnnotation(
    -    # Supply treatment labels
    -    refinebio_treatment = col_annot_df$refinebio_treatment,
    -    # Add annotation barplot
    -    module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)),
    -    # Pick colors for each experimental group in refinebio_treatment
    -    col = list(refinebio_treatment = c("post-adt" = "#f1a340", "pre-adt" = "#998ec3"))
    -  )
    -
    -  # Get a vector of the Ensembl gene IDs that correspond to this module
    -  module_genes <- gene_module_key_df %>%
    -    dplyr::filter(module == module_name) %>%
    -    dplyr::pull(gene)
    -
    -  # Set up the gene expression data frame
    -  mod_mat <- expression_mat %>%
    -    t() %>%
    -    as.data.frame() %>%
    -    # Only keep genes from this module
    -    dplyr::filter(rownames(.) %in% module_genes) %>%
    -    # Order the samples to match col_annot_df
    -    dplyr::select(rownames(col_annot_df)) %>%
    -    # Data needs to be a matrix
    -    as.matrix()
    -
    -  # Normalize the gene expression values
    -  mod_mat <- mod_mat %>%
    -    # Scale can work on matrices, but it does it by column so we will need to
    -    # transpose first
    -    t() %>%
    -    scale() %>%
    -    # And now we need to transpose back
    -    t()
    -
    -  # Create a color function based on standardized scale
    -  color_func <- circlize::colorRamp2(
    -    c(-2, 0, 2),
    -    c("#67a9cf", "#f7f7f7", "#ef8a62")
    -  )
    -
    -  # Plot on a heatmap
    -  heatmap <- ComplexHeatmap::Heatmap(mod_mat,
    -    name = module_name,
    -    # Supply color function
    -    col = color_func,
    -    # Supply column annotation
    -    bottom_annotation = col_annot,
    -    # We don't want to cluster samples
    -    cluster_columns = FALSE,
    -    # We don't need to show sample or gene labels
    -    show_row_names = FALSE,
    -    show_column_names = FALSE
    -  )
    -
    -  # Return heatmap
    -  return(heatmap)
    -}
    +
    make_module_heatmap <- function(module_name,
    +                                expression_mat = normalized_counts,
    +                                metadata_df = metadata,
    +                                gene_module_key_df = gene_module_key,
    +                                module_eigengenes_df = module_eigengenes) {
    +  # Create a summary heatmap of a given module.
    +  #
    +  # Args:
    +  # module_name: a character indicating what module should be plotted, e.g. "ME19"
    +  # expression_mat: The full gene expression matrix. Default is `normalized_counts`.
    +  # metadata_df: a data frame with refinebio_accession_code and time_point
    +  #              as columns. Default is `metadata`.
    +  # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`.
    +  # module_eigengenes: a sample x eigengene data.frame with samples as row names. Default is `module_eigengenes`.
    +  #
    +  # Returns:
    +  # A heatmap of expression matrix for a module's genes, with a barplot of the
    +  # eigengene expression for that module.
    +
    +  # Set up the module eigengene with its refinebio_accession_code
    +  module_eigengene <- module_eigengenes_df %>%
    +    dplyr::select(all_of(module_name)) %>%
    +    tibble::rownames_to_column("refinebio_accession_code")
    +
    +  # Set up column annotation from metadata
    +  col_annot_df <- metadata_df %>%
    +    # Only select the treatment and sample ID columns
    +    dplyr::select(refinebio_accession_code, time_point, refinebio_subject) %>%
    +    # Add on the eigengene expression by joining with sample IDs
    +    dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>%
    +    # Arrange by patient and time point
    +    dplyr::arrange(time_point, refinebio_subject) %>%
    +    # Store sample
    +    tibble::column_to_rownames("refinebio_accession_code")
    +
    +  # Create the ComplexHeatmap column annotation object
    +  col_annot <- ComplexHeatmap::HeatmapAnnotation(
    +    # Supply treatment labels
    +    time_point = col_annot_df$time_point,
    +    # Add annotation barplot
    +    module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)),
    +    # Pick colors for each experimental group in time_point
    +    col = list(time_point = c("recovering" = "#f1a340", "acute illness" = "#998ec3"))
    +  )
    +
    +  # Get a vector of the Ensembl gene IDs that correspond to this module
    +  module_genes <- gene_module_key_df %>%
    +    dplyr::filter(module == module_name) %>%
    +    dplyr::pull(gene)
    +
    +  # Set up the gene expression data frame
    +  mod_mat <- expression_mat %>%
    +    t() %>%
    +    as.data.frame() %>%
    +    # Only keep genes from this module
    +    dplyr::filter(rownames(.) %in% module_genes) %>%
    +    # Order the samples to match col_annot_df
    +    dplyr::select(rownames(col_annot_df)) %>%
    +    # Data needs to be a matrix
    +    as.matrix()
    +
    +  # Normalize the gene expression values
    +  mod_mat <- mod_mat %>%
    +    # Scale can work on matrices, but it does it by column so we will need to
    +    # transpose first
    +    t() %>%
    +    scale() %>%
    +    # And now we need to transpose back
    +    t()
    +
    +  # Create a color function based on standardized scale
    +  color_func <- circlize::colorRamp2(
    +    c(-2, 0, 2),
    +    c("#67a9cf", "#f7f7f7", "#ef8a62")
    +  )
    +
    +  # Plot on a heatmap
    +  heatmap <- ComplexHeatmap::Heatmap(mod_mat,
    +    name = module_name,
    +    # Supply color function
    +    col = color_func,
    +    # Supply column annotation
    +    bottom_annotation = col_annot,
    +    # We don't want to cluster samples
    +    cluster_columns = FALSE,
    +    # We don't need to show sample or gene labels
    +    show_row_names = FALSE,
    +    show_column_names = FALSE
    +  )
    +
    +  # Return heatmap
    +  return(heatmap)
    +}

    4.14 Make module heatmaps

    -

    Let’s try out the custom heatmap function with module 52 (our most differentially expressed module).

    -
    mod_52_heatmap <- make_module_heatmap(module_name = "ME52")
    +

    Let’s try out the custom heatmap function with module 19 (our most differentially expressed module).

    +
    mod_19_heatmap <- make_module_heatmap(module_name = "ME19")
    ## Note: Using an external vector in selections is ambiguous.
     ## ℹ Use `all_of(module_name)` instead of `module_name` to silence this message.
     ## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
     ## This message is displayed once per session.
    -
    # Print out the plot
    -mod_19_heatmap
    +
    # Print out the plot
    +mod_19_heatmap

    From the barplot portion of our plot, we can see acute illness samples tend to have higher expression values for the module 19 eigengene. In the heatmap portion, we can see how the individual genes that make up module 19 are overall higher than in the recovering samples.

    We can save this plot to PNG.

    -
    png(file.path("results", "SRP140558_module_19_heatmap.png"))
    -mod_19_heatmap
    -dev.off()
    +
    png(file.path("results", "SRP140558_module_19_heatmap.png"))
    +mod_19_heatmap
    +dev.off()
    ## png 
     ##   2

    For comparison, let’s try out the custom heatmap function with a different, not differentially expressed module.

    -
    mod_10_heatmap <- make_module_heatmap(module_name = "ME10")
    -
    -# Print out the plot
    -mod_25_heatmap
    +
    mod_25_heatmap <- make_module_heatmap(module_name = "ME25")
    +
    +# Print out the plot
    +mod_25_heatmap

    In this non-significant module’s heatmap, there’s not a particularly strong pattern between acute illness and recovery samples. Though we can still see the genes in this module seem to be very correlated with each other (which is how we found them in the first place, so this makes sense!).

    Save this plot also.

    -
    png(file.path("results", "SRP140558_module_25_heatmap.png"))
    -mod_25_heatmap
    -dev.off()
    +
    png(file.path("results", "SRP140558_module_25_heatmap.png"))
    +mod_25_heatmap
    +dev.off()
    ## png 
     ##   2
    @@ -4409,18 +4349,18 @@

    4.14 Make module heatmaps

    5 Resources for further learning

    6 Session info

    At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

    -
    # Print session info
    -sessioninfo::session_info()
    -
    ## ─ Session info ───────────────────────────────────────────────────────────────
    +
    # Print session info
    +sessioninfo::session_info()
    +
    ## ─ Session info ─────────────────────────────────────────────────────
     ##  setting  value                       
     ##  version  R version 4.0.2 (2020-06-22)
     ##  os       Ubuntu 20.04 LTS            
    @@ -4430,9 +4370,9 @@ 

    6 Session info

    ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2020-11-30 +## date 2020-12-16 ## -## ─ Packages ─────────────────────────────────────────────────────────────────── +## ─ Packages ───────────────────────────────────────────────────────── ## package * version date lib source ## annotate 1.68.0 2020-10-27 [1] Bioconductor ## AnnotationDbi 1.52.0 2020-10-27 [1] Bioconductor @@ -4475,8 +4415,8 @@

    6 Session info

    ## genefilter 1.72.0 2020-10-27 [1] Bioconductor ## geneplotter 1.68.0 2020-10-27 [1] Bioconductor ## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0) -## GenomeInfoDb * 1.26.1 2020-11-20 [1] Bioconductor -## GenomeInfoDbData 1.2.4 2020-11-25 [1] Bioconductor +## GenomeInfoDb * 1.26.2 2020-12-08 [1] Bioconductor +## GenomeInfoDbData 1.2.4 2020-12-16 [1] Bioconductor ## GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor ## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0) ## GetoptLong 1.0.3 2020-10-01 [1] RSPM (R 4.0.2) @@ -4484,7 +4424,7 @@

    6 Session info

    ## ggplot2 * 3.3.2 2020-06-19 [1] RSPM (R 4.0.1) ## GlobalOptions 0.1.2 2020-06-10 [1] RSPM (R 4.0.0) ## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2) -## GO.db 3.12.1 2020-11-25 [1] Bioconductor +## GO.db 3.12.1 2020-12-16 [1] Bioconductor ## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.0.0) ## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0) ## Hmisc 4.4-1 2020-08-10 [1] RSPM (R 4.0.2) @@ -4494,7 +4434,7 @@

    6 Session info

    ## htmlwidgets 1.5.2 2020-10-03 [1] RSPM (R 4.0.2) ## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.2) ## impute 1.64.0 2020-10-27 [1] Bioconductor -## IRanges * 2.24.0 2020-10-27 [1] Bioconductor +## IRanges * 2.24.1 2020-12-12 [1] Bioconductor ## iterators 1.0.12 2019-07-26 [1] RSPM (R 4.0.0) ## jpeg 0.1-8.1 2019-10-24 [1] RSPM (R 4.0.0) ## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2) @@ -4538,7 +4478,7 @@

    6 Session info

    ## rpart 4.1-15 2019-04-12 [2] CRAN (R 4.0.2) ## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2) ## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0) -## S4Vectors * 0.28.0 2020-10-27 [1] Bioconductor +## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor ## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0) ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0) ## shape 1.4.5 2020-09-13 [1] RSPM (R 4.0.2) @@ -4560,67 +4500,8 @@

    6 Session info

    ## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0) ## zlibbioc 1.36.0 2020-10-27 [1] Bioconductor ## -## locale: -## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C -## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 -## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C -## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C -## [9] LC_ADDRESS=C LC_TELEPHONE=C -## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C -## -## attached base packages: -## [1] parallel stats4 stats graphics grDevices utils datasets -## [8] methods base -## -## other attached packages: -## [1] ggplot2_3.3.2 WGCNA_1.69 -## [3] fastcluster_1.1.25 dynamicTreeCut_1.63-1 -## [5] magrittr_1.5 DESeq2_1.30.0 -## [7] SummarizedExperiment_1.20.0 Biobase_2.50.0 -## [9] MatrixGenerics_1.2.0 matrixStats_0.57.0 -## [11] GenomicRanges_1.42.0 GenomeInfoDb_1.26.0 -## [13] IRanges_2.24.0 S4Vectors_0.28.0 -## [15] BiocGenerics_0.36.0 optparse_1.6.6 -## -## loaded via a namespace (and not attached): -## [1] colorspace_1.4-1 rjson_0.2.20 ellipsis_0.3.1 -## [4] circlize_0.4.10 htmlTable_2.1.0 XVector_0.30.0 -## [7] GlobalOptions_0.1.2 base64enc_0.1-3 clue_0.3-57 -## [10] rstudioapi_0.11 farver_2.0.3 getopt_1.20.3 -## [13] bit64_4.0.5 AnnotationDbi_1.52.0 fansi_0.4.1 -## [16] codetools_0.2-16 splines_4.0.2 R.methodsS3_1.8.1 -## [19] doParallel_1.0.15 impute_1.64.0 geneplotter_1.68.0 -## [22] knitr_1.30 polyclip_1.10-0 jsonlite_1.7.1 -## [25] Formula_1.2-3 Cairo_1.5-12.2 annotate_1.68.0 -## [28] cluster_2.1.0 GO.db_3.12.1 png_0.1-7 -## [31] R.oo_1.24.0 ggforce_0.3.2 readr_1.4.0 -## [34] compiler_4.0.2 httr_1.4.2 backports_1.1.10 -## [37] assertthat_0.2.1 Matrix_1.2-18 limma_3.46.0 -## [40] cli_2.1.0 tweenr_1.0.1 htmltools_0.5.0 -## [43] tools_4.0.2 gtable_0.3.0 glue_1.4.2 -## [46] GenomeInfoDbData_1.2.4 dplyr_1.0.2 Rcpp_1.0.5 -## [49] styler_1.3.2 vctrs_0.3.4 preprocessCore_1.52.0 -## [52] iterators_1.0.12 xfun_0.18 stringr_1.4.0 -## [55] ps_1.4.0 lifecycle_0.2.0 XML_3.99-0.5 -## [58] MASS_7.3-51.6 zlibbioc_1.36.0 scales_1.1.1 -## [61] hms_0.5.3 rematch2_2.1.2 RColorBrewer_1.1-2 -## [64] ComplexHeatmap_2.6.0 yaml_2.2.1 memoise_1.1.0 -## [67] gridExtra_2.3 rpart_4.1-15 latticeExtra_0.6-29 -## [70] stringi_1.5.3 RSQLite_2.2.1 genefilter_1.72.0 -## [73] foreach_1.5.0 checkmate_2.0.0 BiocParallel_1.24.1 -## [76] shape_1.4.5 rlang_0.4.8 pkgconfig_2.0.3 -## [79] bitops_1.0-6 evaluate_0.14 lattice_0.20-41 -## [82] purrr_0.3.4 htmlwidgets_1.5.2 labeling_0.3 -## [85] bit_4.0.4 tidyselect_1.1.0 R6_2.4.1 -## [88] magick_2.4.0 generics_0.0.2 Hmisc_4.4-1 -## [91] DelayedArray_0.16.0 DBI_1.1.0 pillar_1.4.6 -## [94] foreign_0.8-80 withr_2.3.0 survival_3.1-12 -## [97] RCurl_1.98-1.2 nnet_7.3-14 tibble_3.0.4 -## [100] crayon_1.3.4 rmarkdown_2.4 GetoptLong_1.0.3 -## [103] jpeg_0.1-8.1 locfit_1.5-9.4 grid_4.0.2 -## [106] data.table_1.13.0 blob_1.2.1 digest_0.6.25 -## [109] xtable_1.8-4 R.cache_0.14.0 R.utils_2.10.1 -## [112] munsell_0.5.0
    +## [1] /usr/local/lib/R/site-library +## [2] /usr/local/lib/R/library

    References

    diff --git a/04-advanced-topics/validate_differential_expression_adv_topics_00_author_de.Rmd b/04-advanced-topics/validate_differential_expression_adv_topics_00_author_de.Rmd index 70a9955a..49532f04 100644 --- a/04-advanced-topics/validate_differential_expression_adv_topics_00_author_de.Rmd +++ b/04-advanced-topics/validate_differential_expression_adv_topics_00_author_de.Rmd @@ -36,9 +36,9 @@ if (!("GEOquery" %in% installed.packages())) { Attach the `limma` library: -```{r} -# Magrittr pipe -`%>%` <- dplyr::`%>%` +```{r message=FALSE} +# We will need this so we can use the pipe: %>% +library(magrittr) # Attach library library(limma) @@ -64,6 +64,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the data directory +# Replace with the path of the folder the files will be in data_dir <- "data" # Replace with path to data directory # Make a data directory if it isn't created yet diff --git a/04-advanced-topics/validate_differential_expression_adv_topics_01.Rmd b/04-advanced-topics/validate_differential_expression_adv_topics_01.Rmd index d726a289..5e477e8b 100644 --- a/04-advanced-topics/validate_differential_expression_adv_topics_01.Rmd +++ b/04-advanced-topics/validate_differential_expression_adv_topics_01.Rmd @@ -81,9 +81,9 @@ if (!("VennDiagram" %in% installed.packages())) { Attach the `limma` library: -```{r} -# Magrittr pipe -`%>%` <- dplyr::`%>%` +```{r message=FALSE} +# We will need this so we can use the pipe: %>% +library(magrittr) # Attach library library(limma) @@ -109,6 +109,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the data directory +# Replace with the path of the folder the files will be in data_dir <- "data" # Replace with path to data directory ``` diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 9f950be9..606bf8bf 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -8,7 +8,9 @@ - [Setting up the docker container](#setting-up-the-docker-container) - [Docker image updates](#docker-image-updates) - [Download the datasets](#download-the-datasets) -- [Add a new analysis](#add-a-new-analysis) +- [Adding a new analysis](#adding-a-new-analysis) + - [Draft PR: Big picture reviews](#draft-pr-big-picture-reviews) + - [Refined PRs: Detailed reviews](#refined-prs-detailed-reviews) - [Setting up a new analysis file](#setting-up-a-new-analysis-file) - [How to use the template.Rmd](#how-to-use-the-templatermd) - [Adding datasets to the S3 bucket](#adding-datasets-to-the-s3-bucket) @@ -89,18 +91,35 @@ For development purposes, you can download all the datasets for the example note scripts/download-data.sh ``` -## Add a new analysis +## Adding a new analysis -Here are the summarized steps for adding a new analysis. -Click on the links to go to the detailed instructions for each step. +Our PR process for adding a new analysis involves two stages which are 2-3 (or more) PRs. +This splitting up an analysis into multiple PRs helps make the review process more manageable. +This process ensures that discussions around the big picture: conceptual decisions, what steps are included, which packages are used, are generally concluded before review moves on to the details and further polishing. +Note that all the following steps describe PRs to `staging` branch only (see more [about the branch set up](#pull-requests)). + +### Draft PR: Big picture reviews - On your new git branch, [set up the analysis file from the template](#setting-up-the-analysis-file). -- Add a [link to the html file to `_navbar.html`](#add-new-analyses-to-the-navbar) +- Get the basic steps for the analysis set up and create a draft PR for a big picture review (Not all descriptions need to be 100% word-smithed, but the general steps/outline should be reflected). +- Try to highlight things that encapsulate main concepts as ready for review using a `**REVIEW**` tag and/or use a `**DRAFT**` tag to indicate a section that hasn't really been worked on much yet. +- After the general outline of the analysis has been agreed upon through a reviewing process, incorporate the major feedback from the draft PR process before you split off new branches to file your refined PRs. +- Keep the original draft PR open for easy reference. + +### Refined PRs: Detailed reviews + +Break up the steps of the analysis into manageable review chunks on their own branches for detailed review (you may want to discuss what the chunks should be on the Draft PR). +- Delete any `**REVIEW/DRAFT**` tags leftover from the draft PR. +- Make sure each steps' explanations are fully realized for these PRs. +- Ensure that the notebook adheres to [the guidelines](#guidelines-for-analysis-notebooks). - [Cite sources and add them to the reference.bib file](#citing-sources-in-text) +- If the file has been [added to snakemake](#add-new-analyses-to-the-snakefile), in the Docker container, run [snakemake for rendering](#how-to-re-render-the-notebooks-locally) to make sure it runs. + +These steps should be done in the first refined PR, but don't need to be done again: +- Add a [link to the html file to `_navbar.html`](#add-new-analyses-to-the-navbar) - Add [data and metadata files to S3](#adding-datasets-to-the-s3-bucket) - Add not yet added packages needed for this analysis to the Dockerfile (make sure it successfully builds). - Add the [expected output html file to snakemake](#add-new-analyses-to-the-snakefile) -- In the Docker container, run [snakemake for rendering](#how-to-re-render-the-notebooks-locally) ### Setting up a new analysis file @@ -479,12 +498,15 @@ Hopefully the error message helps you track down the problem, but you can also c ### About the render-notebooks.R script The `render-notebooks.R` script adds a `bibliography:` specification in the `.Rmd` header so all citations are automatically rendered. +A file with other R code to include can also be specified, which should be used to set options for rendering, such as the output width. +No code that affects the computational behavior of the notebook should be included here, as it will be sourced in a hidden chunk and not visible to users. It also adds other components like CSS styling, a footer, and Google Analytics (these items are all hard-coded into the script). **Options:** - `--rmd`: provided by snakemake, the input `.Rmd` file to render. - `--bib_file`: File path for the `bibliography:` header option. -Default is the `references.bib` in the `components` folder. +- `--cite_style`: File path for a CSL file to control citation style +- `--include_file`: File path for code to be sourced at the start of the notebook but hidden from rendering. - `--html`: Default is to save the output `.html` file the same name as the input `.Rmd` file. This option allows you to specify an output file name. Default is used by snakemake. ### Add new analyses to the Snakefile diff --git a/Snakefile b/Snakefile index cb3705c3..cd92aa49 100644 --- a/Snakefile +++ b/Snakefile @@ -8,7 +8,9 @@ rule target: "02-microarray/dimension-reduction_microarray_01_pca.html", "02-microarray/dimension-reduction_microarray_02_umap.html", "02-microarray/gene-id-annotation_microarray_01_ensembl.html", - "02-microarray/pathway-analysis_microarray_02_ora.html", + "02-microarray/pathway-analysis_microarray_01_ora.html", + "02-microarray/pathway-analysis_microarray_02_gsea.html", + "02-microarray/pathway-analysis_microarray_03_gsva.html", "02-microarray/ortholog-mapping_microarray_01_ensembl.html", "03-rnaseq/00-intro-to-rnaseq.html", "03-rnaseq/clustering_rnaseq_01_heatmap.html", @@ -17,6 +19,9 @@ rule target: "03-rnaseq/dimension-reduction_rnaseq_02_umap.html", "03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html", "03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.html", + "03-rnaseq/pathway-analysis_rnaseq_01_ora.html", + "03-rnaseq/pathway-analysis_rnaseq_02_gsea.html", + "03-rnaseq/pathway-analysis_rnaseq_03_gsva.html", "04-advanced-topics/00-intro-to-advanced-topics.html", "04-advanced-topics/network-analysis_rnaseq_01_wgcna.html" @@ -31,5 +36,6 @@ rule render_citations: " --rmd {input.rmd}" " --bib_file components/references.bib" " --cite_style components/genetics.csl" + " --include_file components/include.R" " --html {output}" " --style" diff --git a/components/_navbar.html b/components/_navbar.html index a4cfcbdd..179c8365 100644 --- a/components/_navbar.html +++ b/components/_navbar.html @@ -7,7 +7,7 @@ - +