diff --git a/.github/workflows/docker-build-push.yml b/.github/workflows/docker-build-push.yml index b3832343..e334fa8f 100644 --- a/.github/workflows/docker-build-push.yml +++ b/.github/workflows/docker-build-push.yml @@ -91,3 +91,14 @@ jobs: git add -A git commit -m 'Render html and publish' || echo "No changes to commit" git push origin gh-pages || echo "No changes to push" + + # If we have a failure, Slack us + - name: Report failure to Slack + if: always() + uses: ravsamhq/notify-slack-action@v1.1 + with: + status: ${{ job.status }} + notify_when: 'failure' + env: + SLACK_WEBHOOK_URL: ${{ secrets.ACTION_MONITORING_SLACK }} + SLACK_MESSAGE: 'Build, Render, and Push failed' diff --git a/.github/workflows/docker-build.yml b/.github/workflows/docker-build.yml index aa065d43..b2a84002 100644 --- a/.github/workflows/docker-build.yml +++ b/.github/workflows/docker-build.yml @@ -42,3 +42,14 @@ jobs: tags: ccdl/refinebio-examples:latest cache-from: type=local,src=/tmp/.buildx-cache cache-to: type=local,dest=/tmp/.buildx-cache + + # If we have a failure, Slack us + - name: Report failure to Slack + if: always() + uses: ravsamhq/notify-slack-action@v1.1 + with: + status: ${{ job.status }} + notify_when: 'failure' + env: + SLACK_WEBHOOK_URL: ${{ secrets.ACTION_MONITORING_SLACK }} + SLACK_MESSAGE: 'Build Docker failed' diff --git a/.gitignore b/.gitignore index c9721416..66921de5 100644 --- a/.gitignore +++ b/.gitignore @@ -10,6 +10,7 @@ _site */plots/* */results/* */data/* +*/gene_sets/* # markdown spellcheck .spelling diff --git a/01-getting-started/getting-started.html b/01-getting-started/getting-started.html index 8e68ebbf..e72ead1b 100644 --- a/01-getting-started/getting-started.html +++ b/01-getting-started/getting-started.html @@ -1263,25 +1263,22 @@ }; - - + + + + - - + @@ -2865,15 +3680,20 @@ @@ -3004,7 +3833,7 @@
We use R Markdown throughout this tutorial. R Markdown documents are helpful for scientific code by allowing you to keep detailed notes, code, and output in one place.
When you execute code within the notebook, the results appear beneath the code. Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
-print("The output from the code in this chunk will print below!")
+
## [1] "The output from the code in this chunk will print below!"
R Markdown documents also have the added benefit of producing HTML file output that is nicely rendered and easy to read. Saving one of our R Markdowns (the files that end in .Rmd
) on your computer will create an HTML file containing the code and output to be saved alongside it (will end in .nb.html
).
See this guide using to R Notebooks for more information about inserting and executing code chunks.
@@ -3045,23 +3874,28 @@Huber W., V. J. Carey, R. Gentleman, S. Anders, and M. Carlson et al., 2015 Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12: 115–121.
+Huber W., V. J. Carey, R. Gentleman, S. Anders, and M. Carlson et al., 2015 Orchestrating high-throughput genomic analysis with Bioconductor. Nature Methods 12: 115–121. https://doi.org/10.1038/nmeth.3252
R Core Team, 2019 R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
+R Core Team, 2019 R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org
RStudio Team, 2020 RStudio: Integrated development environment for r. RStudio, PBC., Boston, MA.
+RStudio Team, 2020 RStudio: Integrated development environment for R. RStudio, PBC., Boston, MA. http://www.rstudio.com/
Wickham H., M. Averick, J. Bryan, W. Chang, and L. D. McGowan et al., 2019 Welcome to the tidyverse. Journal of Open Source Software 4: 1686. https://doi.org/10.21105/joss.01686
Wickham H., J. Hester, and W. Chang, 2020 Devtools: Tools to make developing r packages easier.
+Wickham H., J. Hester, and W. Chang, 2020 devtools: Tools to make developing R packages easier. https://CRAN.R-project.org/package=devtools
.Rmd
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
- dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the path to the directory where plots will be saved
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
-# Define the file path to the data directory
-data_dir <- file.path("data", "GSE24862") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE24862.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE24862.tsv") # Replace with file path to your metadata
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE24862")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE24862.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE24862.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
@@ -3071,25 +3902,26 @@See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
-In this analysis, we will be using the R package pheatmap
for clustering and creating a heatmap (Slowikowski 2017).
if (!("pheatmap" %in% installed.packages())) {
- # Install pheatmap
- install.packages("pheatmap", update = FALSE)
-}
+In this analysis, we will be using the R package pheatmap
for clustering and creating a heatmap (Slowikowski 2017).
if (!("pheatmap" %in% installed.packages())) {
+ # Install pheatmap
+ install.packages("pheatmap", update = FALSE)
+}
Attach the pheatmap
library:
# Attach the `pheatmap` library
-library(pheatmap)
-
-# We will need this so we can use the pipe: %>%
-library(magrittr)
+
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read in both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-## Parsed with column specification:
+
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_logical(),
@@ -3107,82 +3939,83 @@ 4.2 Import and set up data
## `contact_zip/postal_code` = col_double(),
## data_row_count = col_double(),
## taxid_ch1 = col_double()
-## )
-## See spec(...) for full column specifications.
-# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
- # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
- tibble::column_to_rownames("Gene")
-## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+# Read in data TSV file
+df <- readr::read_tsv(data_file) %>%
+ # Here we are going to store the gene IDs as row names so that
+ # we have only numeric values to perform calculations on later
+ tibble::column_to_rownames("Gene")
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
-## See spec(...) for full column specifications.
+## ℹ Use `spec()` for the full column specifications.
Let’s take a look at the metadata object that we read into the R environment.
-head(metadata)
+
Now let’s ensure that the metadata and data are in the same sample order.
-# Make the data in the order of the metadata
-df <- df %>% dplyr::select(metadata$refinebio_accession_code)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$refinebio_accession_code)
+# Make the data in the order of the metadata
+df <- df %>% dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$refinebio_accession_code)
## [1] TRUE
Now we are going to use a combination of functions from base R and the pheatmap
package to look at how our samples and genes are clustering.
Although you may want to create a heatmap including all of the genes in the set, alternatively, the heatmap could be created using only genes of interest. For this example, we will sort genes by variance, but there are many alternative criterion by which you may want to sort your genes e.g. fold change, t-statistic, membership to a particular gene ontology, so on.
-# Calculate the variance for each gene
-variances <- apply(df, 1, var)
-
-# Determine the upper quartile variance cutoff value
-upper_var <- quantile(variances, 0.75)
-
-# Subset the data choosing only genes whose variances are in the upper quartile
-df_by_var <- data.frame(df) %>%
- dplyr::filter(variances > upper_var)
+Although you may want to create a heatmap including all of the genes in the dataset, this can produce a very large image that is hard to interpret. Alternatively, the heatmap could be created using only genes of interest. For this example, we will sort genes by variance and select genes in the upper quartile, but there are many alternative criteria by which you may want to sort your genes, e.g. fold change, t-statistic, membership in a particular gene ontology, and so on.
+# Calculate the variance for each gene
+variances <- apply(df, 1, var)
+
+# Determine the upper quartile variance cutoff value
+upper_var <- quantile(variances, 0.75)
+
+# Filter the data choosing only genes whose variances are in the upper quartile
+df_by_var <- data.frame(df) %>%
+ dplyr::filter(variances > upper_var)
To further customize the heatmap, see a vignette for a guide at this link (Slowikowski 2017).
-# Create and store the heatmap object
-heatmap <-
- pheatmap(
- df_by_var,
- cluster_rows = TRUE, # We want to cluster the heatmap by rows (genes in this case)
- cluster_cols = TRUE, # We also want to cluster the heatmap by columns (samples in this case),
- show_rownames = FALSE, # We don't want to show the rownames because there are too many genes for the labels to be clearly seen
- main = "Non-Annotated Heatmap",
- colorRampPalette(c(
- "deepskyblue",
- "black",
- "yellow"
- ))(25),
- scale = "row" # Scale values in the direction of genes (rows)
- )
+To further customize the heatmap, see a vignette for a guide at this link (Slowikowski 2017).
+# Create and store the heatmap object
+heatmap <- pheatmap(
+ df_by_var,
+ cluster_rows = TRUE, # Cluster the rows of the heatmap (genes in this case)
+ cluster_cols = TRUE, # Cluster the columns of the heatmap (samples)
+ show_rownames = FALSE, # There are too many genes to clearly show the labels
+ main = "Non-Annotated Heatmap",
+ colorRampPalette(c(
+ "deepskyblue",
+ "black",
+ "yellow"
+ ))(25),
+ scale = "row" # Scale values in the direction of genes (rows)
+)
We’ve created a heatmap but although our genes and samples are clustered, there is not much information that we can gather here because we did not provide the pheatmap()
function with annotation labels for our samples.
First let’s save our clustered heatmap.
You can easily switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.
-# Open a PNG file
-png(file.path(
- plots_dir,
- "GSE24862_heatmap_non_annotated.png" # Replace file name with a relevant output plot name
-))
-
-# Print your heatmap
-heatmap
-
-# Close the PNG file:
-dev.off()
+# Open a PNG file
+png(file.path(
+ plots_dir,
+ "GSE24862_heatmap_non_annotated.png" # Replace with a relevant file name
+))
+
+# Print your heatmap
+heatmap
+
+# Close the PNG file:
+dev.off()
## png
## 2
Now, let’s add some annotation bars to our heatmap.
@@ -3190,44 +4023,47 @@From the accompanying paper, we know that three PLX4032-sensitive parental cell lines (M229, M238 and M249) and three derived PLX4032-resistant (r) sub-lines (M229_r5, M238_r1, and M249_r4) were treated or not treated with the RAF-selective inhibitor, PLX4032 (Nazarian et al. 2010). We are going to annotate our heatmap with the variables that hold the refinebio_cell_line
and refinebio_treatment
data. We are also going to create a new column variable from our existing metadata called cell_line_type
, that will distinguish whether the refinebio_cell_line
is parental or resistant – since this is also a key aspect of the experimental design. Note that this step is very specific to our metadata, you may find that you also need to tailor the metadata for your own needs.
# Let's prepare an annotation data frame for plotting
-annotation_df <- metadata %>%
- # We want to select the variables that we want for annotating the heatmap
- dplyr::select(
- refinebio_accession_code,
- refinebio_cell_line,
- refinebio_treatment
- ) %>%
- # Let's create a variable that specifically distinguishes whether the cell line is parental or resistant -- since this is a key aspect of the experimental design
- dplyr::mutate(
- cell_line_type =
- dplyr::case_when(
- stringr::str_detect(refinebio_cell_line, "_r") ~ "resistant",
- TRUE ~ "parental"
- )
- ) %>%
- # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our dataset object
- tibble::column_to_rownames("refinebio_accession_code")
+From the accompanying paper, we know that three PLX4032-sensitive parental cell lines (M229, M238 and M249) and three derived PLX4032-resistant (r) sub-lines (M229_r5, M238_r1, and M249_r4) were treated or not treated with the RAF-selective inhibitor, PLX4032 (Nazarian et al. 2010). We are going to annotate our heatmap with the variables that hold the refinebio_cell_line
and refinebio_treatment
data. We are also going to create a new column variable from our existing metadata called cell_line_type
, that will distinguish whether the refinebio_cell_line
is parental or resistant – since this is also a key aspect of the experimental design. Note that this step is very specific to our metadata, you may find that you also need to tailor the metadata for your own needs.
# Let's prepare an annotation data frame for plotting
+annotation_df <- metadata %>%
+ # We want to select the variables that we want for annotating the heatmap
+ dplyr::select(
+ refinebio_accession_code,
+ refinebio_cell_line,
+ refinebio_treatment
+ ) %>%
+ # Let's create a variable that specifically distinguishes whether
+ # the cell line is parental or resistant.
+ # This is a key aspect of the experimental design
+ dplyr::mutate(
+ cell_line_type =
+ dplyr::case_when(
+ stringr::str_detect(refinebio_cell_line, "_r") ~ "resistant",
+ TRUE ~ "parental"
+ )
+ ) %>%
+ # The `pheatmap()` function requires that the row names of our
+ # annotation object matches the column names of our dataset object
+ tibble::column_to_rownames("refinebio_accession_code")
You can create an annotated heatmap by providing our annotation object to the annotation_col
argument of the pheatmap()
function.
# Create and store the annotated heatmap object
-heatmap_annotated <-
- pheatmap(
- df_by_var,
- cluster_rows = TRUE,
- cluster_cols = TRUE,
- show_rownames = FALSE,
- annotation_col = annotation_df,
- main = "Annotated Heatmap",
- colorRampPalette(c(
- "deepskyblue",
- "black",
- "yellow"
- ))(25),
- scale = "row" # Scale values in the direction of genes (rows)
- )
+# Create and store the annotated heatmap object
+heatmap_annotated <-
+ pheatmap(
+ df_by_var,
+ cluster_rows = TRUE,
+ cluster_cols = TRUE,
+ show_rownames = FALSE,
+ annotation_col = annotation_df,
+ main = "Annotated Heatmap",
+ colorRampPalette(c(
+ "deepskyblue",
+ "black",
+ "yellow"
+ ))(25),
+ scale = "row" # Scale values with respect to genes (rows)
+ )
Now that we have annotation bars on our heatmap, we have a better idea of the cell line and treatment groups that appear to cluster together. More specifically, we can see that the samples seem to cluster by their cell lines of origin, but not necessarily as much by whether or not they received the PLX4302
treatment.
Let’s save our annotated heatmap.
@@ -3235,17 +4071,17 @@You can easily switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.
-# Open a PNG file
-png(file.path(
- plots_dir,
- "GSE24862_heatmap_annotated.png" # Replace file name with a relevant output plot name
-))
-
-# Print your heatmap
-heatmap_annotated
-
-# Close the PNG file:
-dev.off()
+# Open a PNG file
+png(file.path(
+ plots_dir,
+ "GSE24862_heatmap_annotated.png" # Replace with a relevant plot file name
+))
+
+# Print your heatmap
+heatmap_annotated
+
+# Close the PNG file:
+dev.off()
## png
## 2
pheatmap
package allow, see the ComplexHeatmap Complete Reference Manual (Gu et al. 2016)pheatmap
package allow, see the ComplexHeatmap Complete Reference Manual (Gu et al. 2016)At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessioninfo::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -3273,13 +4109,13 @@ 6 Print session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-10-16
+## date 2020-12-14
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
+## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
-## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
@@ -3303,6 +4139,7 @@ 6 Print session info
## pheatmap * 1.0.12 2019-01-04 [1] RSPM (R 4.0.0)
## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
@@ -3310,10 +4147,9 @@ 6 Print session info
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0)
-## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
-## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
-## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
@@ -3321,7 +4157,7 @@ 6 Print session info
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
-## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
@@ -3335,17 +4171,22 @@ 6 Print session info
References
-Gu Z., R. Eils, and M. Schlesner, 2016 Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics.
+Gu Z., R. Eils, and M. Schlesner, 2016 Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btw313
-Nazarian R., H. Shi, Q. Wang, X. Kong, and R. C. Koya et al., 2010 Melanomas acquire resistance to b-raf(V600E) inhibition by rtk or n-ras upregulation. Nature 468. https://doi.org/10.1038/nature09626
+Nazarian R., H. Shi, Q. Wang, X. Kong, and R. C. Koya et al., 2010 Melanomas acquire resistance to B-RAF(V600E) inhibition by RTK or N-RAS upregulation. Nature 468. https://doi.org/10.1038/nature09626
-Slowikowski K., 2017 Make heatmaps in r with pheatmap
+Slowikowski K., 2017 Make heatmaps in R with pheatmap. https://slowkow.com/notes/pheatmap-tutorial/
.Rmd
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
- dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For this example analysis, we will use this CREB overexpression zebrafish. Tregnago et al. (2016) measured microarray gene expression of zebrafish samples overexpressing human CREB, as well as control samples. In this analysis, we will test differential expression between the control and CREB-overexpressing groups.
+For this example analysis, we will use this zebrafish gene expression dataset. Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples. In this analysis, we will test differential expression between the control and CREB-overexpressing groups.
data/
folderIn order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
-# Define the file path to the data directory
-data_dir <- file.path("data", "GSE71270") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE71270.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE71270.tsv") # Replace with file path to your metadata
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE71270")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE71270.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE71270.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
@@ -3070,38 +3901,35 @@See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
-In this analysis, we will be using limma
for differential expression (Ritchie et al. 2015). We will also use EnhancedVolcano
for plotting and apeglm
for some log fold change estimates in the results table (Zhu et al. 2018; Blighe et al. 2020).
if (!("limma" %in% installed.packages())) {
- # Install this package if it isn't installed yet
- BiocManager::install("limma", update = FALSE)
-}
-if (!("EnhancedVolcano" %in% installed.packages())) {
- # Install this package if it isn't installed yet
- BiocManager::install("EnhancedVolcano", update = FALSE)
-}
-if (!("apeglm" %in% installed.packages())) {
- # Install this package if it isn't installed yet
- BiocManager::install("apeglm", update = FALSE)
-}
+In this analysis, we will be using limma
for differential expression (Ritchie et al. 2015). We will also use EnhancedVolcano
for plotting (Blighe et al. 2020).
if (!("limma" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("limma", update = FALSE)
+}
+if (!("EnhancedVolcano" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("EnhancedVolcano", update = FALSE)
+}
Attach the packages we need for this analysis.
-# Attach the library
-library(limma)
-
-# We will need this so we can use the pipe: %>%
-library(magrittr)
-
-# We'll use this for plotting
-library(ggplot2)
+# Attach the library
+library(limma)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+# We'll use this for plotting
+library(ggplot2)
The jitter plot we make later on with geom_jitter()
involves some randomness. As is good practice when our analysis involves randomness, we will set the seed.
set.seed(12345)
+
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-## Parsed with column specification:
+
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_logical(),
@@ -3121,13 +3949,14 @@ 4.2 Import and set up data
## `contact_zip/postal_code` = col_double(),
## data_row_count = col_double(),
## taxid_ch1 = col_double()
-## )
-## See spec(...) for full column specifications.
-# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
- # Tuck away the Gene ID column as rownames
- tibble::column_to_rownames("Gene")
-## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+# Read in data TSV file
+expression_df <- readr::read_tsv(data_file) %>%
+ # Tuck away the Gene ID column as row names
+ tibble::column_to_rownames("Gene")
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## Gene = col_character(),
## GSM1831675 = col_double(),
@@ -3142,43 +3971,67 @@ 4.2 Import and set up data
## GSM1831684 = col_double()
## )
Let’s ensure that the metadata and data are in the same sample order.
-# Make the data in the order of the metadata
-df <- df %>%
- dplyr::select(metadata$geo_accession)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$geo_accession)
-## [1] TRUE
+# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+ dplyr::select(metadata$geo_accession)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$geo_accession)
+## [1] "target is NULL, current is character"
limma
needs a numeric design matrix to signify which are CREB and control samples. Here we are using the treatments supplied in the metadata to create a design matrix where the “none” samples are assigned 0
and the “amputated” samples are assigned 1
. Note that the metadata variables that signify the treatment groups might be different across datasets and might not always be underneath the category.
The genotype/variation
column contains group information we will be using for differential expression. But the /
it contains in its column name makes it more annoying to access.
-Accessing variable that have names with special characters like /
, or spaces, require extra work-arounds to ignore R’s normal interpretations of these characters.
metadata <- metadata %>%
- dplyr::rename("genotype" = `genotype/variation`) # This step will not be the same (or might not be needed at all) with a different dataset
+limma
needs a numeric design matrix to signify which are CREB and control samples. Here we are using the treatments described in the metadata table in the genotype/variation
column to create a design matrix where the “control” samples are assigned 0
and the “overexpressing the human CREB” samples are assigned 1
. Note that the metadata columns that signify the treatment groups might be different across datasets, and will almost certainly have different contents.
While the genotype/variation
column contains the group information we will be using for differential expression, the /
it contains in its column name makes it more annoying to access.
+Accessing variable that have names with special characters like /
, or spaces, require extra work-arounds to ignore R’s normal interpretations of these characters. Here we will rename it as just genotype
to make our lives later much easier.
We will also recode the contents of the column, as overexpressing the human CREB"
is a bit of an unruly name. To do this, we will use the fct_recode()
function from the forcats
package, simplifying "overexpressing the human CREB"
to just CREB
. We will also use fct_relevel()
to make sure our control
samples appear first in the factor levels.
# These renaming steps will not be the same (or might not be needed at all)
+# with a different dataset
+metadata <- metadata %>%
+ # rename the column
+ dplyr::rename("genotype" = `genotype/variation`) %>%
+ # change the names and order of the genotypes (making the column a factor)
+ dplyr::mutate(
+ genotype = genotype %>%
+ # rename the "overexpressing..." genotype to "CREB"
+ forcats::fct_recode(CREB = "overexpressing the human CREB") %>%
+ # make "control" the first level of the factor
+ forcats::fct_relevel("control")
+ )
Now we will create a model matrix based on our newly renamed genotype
variable.
# Create the design matrix from the genotype information
-des_mat <- model.matrix(~ metadata$genotype)
+# Create the design matrix from the genotype information
+des_mat <- model.matrix(~genotype, data = metadata)
+
+# Look at the design matrix
+head(des_mat)
## (Intercept) genotypeCREB
+## 1 1 1
+## 2 1 0
+## 3 1 0
+## 4 1 1
+## 5 1 0
+## 6 1 0
+When we look at this design matrix, we see that there is now a genotypeCREB
column that defines the group for each sample: 0 for control samples and 1 for the CREB samples. (The model will also fit an intercept for all samples, so we can see that here as well.)
After applying our data to linear model, in this example we apply empirical Bayes smoothing and Benjamini-Hochberg multiple testing correction. The topTable()
function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method
argument (see the ?topTable
help page for more about the options).
# Apply linear model to data
-fit <- lmFit(df, design = des_mat)
-
-# Apply empirical Bayes to smooth standard errors
-fit <- eBayes(fit)
-
-# Apply multiple testing correction and obtain stats
-stats_df <- topTable(fit, number = nrow(df)) %>%
- tibble::rownames_to_column("Gene")
+We will use the lmFit()
function from the limma
package to test each gene for differential expression between the two groups using a linear model. After fitting our data to the linear model, in this example we apply empirical Bayes smoothing with the eBayes()
function.
Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).
+# Apply linear model to data
+fit <- lmFit(expression_df, design = des_mat)
+
+# Apply empirical Bayes to smooth standard errors
+fit <- eBayes(fit)
Because we are testing many different genes at once, we also want to perform some multiple test corrections, which we will do with the Benjamini-Hochberg method while making a table of results with topTable()
. The topTable()
function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method
argument (see the ?topTable
help page for more about the options).
# Apply multiple testing correction and obtain stats
+stats_df <- topTable(fit, number = nrow(expression_df)) %>%
+ tibble::rownames_to_column("Gene")
## Removing intercept from test coefficients
-Let’s take a peek at what our results table looks like.
-head(stats_df)
+Let’s take a peek at our results table.
+By default, results are ordered by largest B
(the log odds value) to the smallest, which means your most differentially expressed genes should be toward the top.
To test if these results make sense, we can make a plot of one of top genes. Let’s try extracting the data for ENSDARG00000104315
and set up its own data frame for plotting purposes.
top_gene_df <- df %>%
- # Extract this gene from `df`
- dplyr::filter(rownames(.) == "ENSDARG00000104315") %>%
- # Transpose so the gene is a column
- t() %>%
- # Transpose made this a matrix, let's make it back into a data.frame like before
- data.frame() %>%
- # Store the sample ids as their own column instead of being row names
- tibble::rownames_to_column("refinebio_accession_code") %>%
- # Join on the selected columns from metadata
- dplyr::inner_join(dplyr::select(
- metadata,
- refinebio_accession_code,
- genotype
- ))
+top_gene_df <- expression_df %>%
+ # Extract this gene from `expression_df`
+ dplyr::filter(rownames(.) == "ENSDARG00000104315") %>%
+ # Transpose so the gene is a column
+ t() %>%
+ # Transpose made this a matrix, let's make it back into a data frame
+ data.frame() %>%
+ # Store the sample ids as their own column instead of as row names
+ tibble::rownames_to_column("refinebio_accession_code") %>%
+ # Join on the selected columns from metadata
+ dplyr::inner_join(dplyr::select(
+ metadata,
+ refinebio_accession_code,
+ genotype
+ ))
## Joining, by = "refinebio_accession_code"
Let’s take a sneak peek at what our top_gene_df
looks like.
top_gene_df
+
Now let’s plot the data for ENSDARG00000104315
using our top_gene_df
.
ggplot(top_gene_df, aes(x = genotype, y = ENSDARG00000104315, color = genotype)) +
- geom_jitter(width = 0.2, height = 0) + # We'll make this a jitter plot
- theme_classic() # This makes some aesthetic changes
-
+ggplot(top_gene_df, aes(x = genotype, y = ENSDARG00000104315, color = genotype)) +
+ geom_jitter(width = 0.2, height = 0) + # We'll make this a jitter plot
+ theme_classic() # This makes some aesthetic changes
These results make sense. The overexpressing CREB group samples have much higher expression values for ENSDARG00000104315 than the control samples do.
The results in stats_df
will be saved to our results/
directory.
readr::write_tsv(stats_df, file.path(
- results_dir,
- "GSE71270_limma_results.tsv" # Replace with a relevant output name
-))
+
We’ll use the EnhancedVolcano
package’s main function to plot our data (Zhu et al. 2018).
EnhancedVolcano::EnhancedVolcano(stats_df,
- lab = stats_df$Gene, # This has to be a vector with our labels we want for our genes
- x = "logFC", # This is the column name in `stats_df` that contains what we want on the x axis
- y = "adj.P.Val" # This is the column name in `stats_df` that contains what we want on the y axis
-)
-
+We’ll use the EnhancedVolcano
package’s main function to plot our data (Zhu et al. 2018).
EnhancedVolcano::EnhancedVolcano(stats_df,
+ lab = stats_df$Gene, # This has to be a vector with our labels we want for our genes
+ x = "logFC", # This is the column name in `stats_df` that contains what we want on the x axis
+ y = "adj.P.Val" # This is the column name in `stats_df` that contains what we want on the y axis
+)
## Registered S3 methods overwritten by 'ggalt':
+## method from
+## grid.draw.absoluteGrob ggplot2
+## grobHeight.absoluteGrob ggplot2
+## grobWidth.absoluteGrob ggplot2
+## grobX.absoluteGrob ggplot2
+## grobY.absoluteGrob ggplot2
+
In this plot, green points represent genes that meet the log2 fold change, by default the cutoff is absolute value of 1.
But there are no genes that meet the p value cutoff, which by default is 1e-05
. We used the adjusted p values for our plot above, so you may want to adjust this with the pCutoff
argument (Take a look at all the options for tailoring this plot using ?EnhancedVolcano
).
Let’s make the same plot again, but adjust the pCutoff
since we are using multiple-testing corrected p values, and this time we will assign the plot to our environment as volcano_plot
.
volcano_plot <- EnhancedVolcano::EnhancedVolcano(stats_df,
- lab = stats_df$Gene,
- x = "logFC",
- y = "adj.P.Val",
- pCutoff = 0.01 # Because we are using adjusted p values, we can loosen this a bit
-)
-
-# Print out our plot
-volcano_plot
-
+volcano_plot <- EnhancedVolcano::EnhancedVolcano(stats_df,
+ lab = stats_df$Gene,
+ x = "logFC",
+ y = "adj.P.Val",
+ pCutoff = 0.01 # Because we are using adjusted p values, we can loosen this a bit
+)
+
+# Print out our plot
+volcano_plot
Let’s save this plot to a PNG file.
-ggsave(
- plot = volcano_plot,
- file.path(plots_dir, "GSE71270_volcano_plot.png")
-) # Replace with a plot name relevant to your data
+ggsave(
+ plot = volcano_plot,
+ file.path(plots_dir, "GSE71270_volcano_plot.png")
+) # Replace with a plot name relevant to your data
## Saving 7 x 5 in image
EnhancedVolcano
vignette has more examples on how to tailor your volcano plot (Blighe et al. 2020).EnhancedVolcano
vignette has more examples on how to tailor your volcano plot (Blighe et al. 2020).At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessioninfo::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -3280,63 +4140,79 @@ 6 Session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-10-16
+## date 2020-12-17
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
-## package * version date lib source
-## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
-## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
-## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
-## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
-## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
-## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
-## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
-## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0)
-## EnhancedVolcano 1.6.0 2020-04-27 [1] Bioconductor
-## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0)
-## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
-## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0)
-## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
-## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
-## ggplot2 * 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
-## ggrepel 0.8.2 2020-03-08 [1] RSPM (R 4.0.2)
-## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
-## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
-## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
-## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
-## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
-## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
-## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0)
-## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0)
-## limma * 3.44.3 2020-06-12 [1] Bioconductor
-## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
-## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
-## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
-## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
-## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
-## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
-## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
-## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
-## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2)
-## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
-## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
-## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
-## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2)
-## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
-## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
-## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
-## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
-## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
-## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
-## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
-## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
-## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
-## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
-## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
-## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
-## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
-## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2)
-## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
+## ─ Packages ─────────────────────────────────────────────────────────
+## package * version date lib source
+## ash 1.0-15 2015-09-01 [1] RSPM (R 4.0.0)
+## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
+## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
+## beeswarm 0.2.3 2016-04-25 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
+## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
+## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
+## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
+## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
+## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0)
+## EnhancedVolcano 1.8.0 2020-10-27 [1] Bioconductor
+## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0)
+## extrafont 0.17 2014-12-08 [1] RSPM (R 4.0.0)
+## extrafontdb 1.0 2012-06-11 [1] RSPM (R 4.0.0)
+## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
+## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0)
+## forcats 0.5.0 2020-03-01 [1] RSPM (R 4.0.0)
+## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
+## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
+## ggalt 0.4.0 2017-02-15 [1] RSPM (R 4.0.0)
+## ggbeeswarm 0.6.0 2017-08-07 [1] RSPM (R 4.0.0)
+## ggplot2 * 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
+## ggrastr 0.2.1 2020-09-14 [1] RSPM (R 4.0.2)
+## ggrepel 0.8.2 2020-03-08 [1] RSPM (R 4.0.2)
+## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
+## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
+## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
+## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
+## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
+## KernSmooth 2.23-17 2020-04-26 [2] CRAN (R 4.0.2)
+## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
+## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0)
+## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0)
+## limma * 3.46.0 2020-10-27 [1] Bioconductor
+## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
+## maps 3.3.0 2018-04-03 [1] RSPM (R 4.0.0)
+## MASS 7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
+## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
+## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
+## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## proj4 1.0-10 2020-03-02 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
+## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
+## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
+## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2)
+## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
+## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0)
+## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
+## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
+## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
+## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
+## Rttf2pt1 1.3.8 2020-01-10 [1] RSPM (R 4.0.0)
+## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
+## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
+## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
+## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
+## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
+## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
+## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
+## vipor 0.4.5 2017-03-22 [1] RSPM (R 4.0.0)
+## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
+## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2)
+## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
@@ -3345,19 +4221,22 @@ 6 Session info
References
-Blighe K., S. Rana, and M. Lewis, 2020 EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling.
+Blighe K., S. Rana, and M. Lewis, 2020 EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling. https://github.com/kevinblighe/EnhancedVolcano
-Gonzalez I., 2014 Statistical analysis of rna-seq data.
+Gonzalez I., 2014 Statistical analysis of RNA-Seq data. http://www.nathalievialaneix.eu/doc/pdf/tutorial-rnaseq.pdf
-Klaus B., and S. Reisenauer, 2018 An end to end workflow for differential gene expression using affymetrix microarrays.
+Klaus B., and S. Reisenauer, 2018 An end to end workflow for differential gene expression using Affymetrix microarrays. https://www.bioconductor.org/packages/devel/workflows/vignettes/maEndToEnd/inst/doc/MA-Workflow.html
Ritchie M. E., B. Phipson, D. Wu, Y. Hu, and C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007
+
+Robinson D., Understanding empirical Bayes estimation (using baseball statistics). http://varianceexplained.org/r/empirical_bayes_baseball/
+
-Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896.
+Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98
Zhu A., J. G. Ibrahim, and M. I. Love, 2018 Heavy-tailed prior distributions for sequence count data: Removing the noise and preserving large differences. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
@@ -3365,6 +4244,11 @@ References
+
.Rmd
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
- dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For this example analysis, we will use this medulloblastoma samples. Robinson et al. (2012) measured microarray gene expression of 71 medulloblastoma tumor samples. In this analysis, we will test differential expression across the medulloblastoma subtypes.
+For this example analysis, we will use this medulloblastoma samples. Robinson et al. (2012) measured microarray gene expression of 71 medulloblastoma tumor samples. In this analysis, we will test differential expression across the medulloblastoma subtypes.
data/
folderIn order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
-# Define the file path to the data directory
-data_dir <- file.path("data", "GSE37418") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE37418.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE37418.tsv") # Replace with file path to your metadata
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE37418")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE37418.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE37418.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
@@ -3070,30 +3901,31 @@See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
-In this analysis, we will be using limma
for differential expression (Ritchie et al. 2015).
if (!("limma" %in% installed.packages())) {
- # Install this package if it isn't installed yet
- BiocManager::install("limma", update = FALSE)
-}
+In this analysis, we will be using limma
for differential expression (Ritchie et al. 2015).
if (!("limma" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("limma", update = FALSE)
+}
Attach the packages we need for this analysis.
-# Attach the library
-library(limma)
-
-# We will need this so we can use the pipe: %>%
-library(magrittr)
-
-# We'll use this for plotting
-library(ggplot2)
+# Attach the library
+library(limma)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+# We'll use this for plotting
+library(ggplot2)
The jitter plot we make later on with geom_jitter()
involves some randomness. As is good practice when our analysis involves randomness, we will set the seed.
set.seed(12345)
+
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-## Parsed with column specification:
+
+##
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_logical(),
@@ -3114,23 +3946,24 @@ 4.2 Import and set up data
## `contact_zip/postal_code` = col_double(),
## data_row_count = col_double(),
## taxid_ch1 = col_double()
-## )
-## See spec(...) for full column specifications.
-# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
- # Tuck away the gene ID column as rownames
- tibble::column_to_rownames("Gene")
-## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+# Read in data TSV file
+expression_df <- readr::read_tsv(data_file) %>%
+ # Tuck away the gene ID column as row names
+ tibble::column_to_rownames("Gene")
+##
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
-## See spec(...) for full column specifications.
+## ℹ Use `spec()` for the full column specifications.
We will be using the subgroup
variable labels in our metadata to test differentially expression across. Let’s take a look at how many samples of each subgroup we have.
metadata %>% dplyr::count(subgroup)
+
Note that the U
and the SHH OUTLIER
samples are gone and only the four groups we are interested in are left.
But, we still need to filter these samples out from the expression data that’s stored in df
.
# Make the data in the order of the metadata
-df <- df %>%
- dplyr::select(filtered_metadata$geo_accession)
-
-# Check if this is in the same order
-all.equal(colnames(df), filtered_metadata$geo_accession)
+Note that the U
and the SHH OUTLIER
subgroups are gone and only the four groups we are interested in are left.
But we still need to filter these samples out from the expression data that’s stored in expression_df
.
# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+ dplyr::select(filtered_metadata$geo_accession)
+
+# Check if this is in the same order
+all.equal(colnames(expression_df), filtered_metadata$geo_accession)
## [1] TRUE
limma
needs a numeric design matrix to signify which samples are of which subtype of medulloblastoma. Now we will create a model matrix based on our subgroup
variable. We are using a + 0
in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. If you have a control group, you might want that to be the intercept.
# Create the design matrix
-des_mat <- model.matrix(~ filtered_metadata$subgroup + 0)
+limma
needs a numeric design matrix to signify which samples are of which subtype of medulloblastoma. Now we will create a model matrix based on our subgroup
variable. We are using a + 0
in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. If you have a control group, you might want to leave off the + 0
so the model includes an intercept representing the control group expression level, with the remaining coefficients the changes relative to that expression level.
Let’s take a look at the design matrix we created.
-# Print out the design matrix
-head(des_mat)
-## filtered_metadata$subgroupG3 filtered_metadata$subgroupG4
-## 1 0 1
-## 2 0 1
-## 3 0 0
-## 4 1 0
-## 5 0 1
-## 6 0 0
-## filtered_metadata$subgroupSHH filtered_metadata$subgroupWNT
-## 1 0 0
-## 2 0 0
-## 3 1 0
-## 4 0 0
-## 5 0 0
-## 6 1 0
-The design matrix column names are a bit messy, so we will neaten them up by dropping the filtered_metadata$subgroup
designation they all have.
# Make the column names less messy
-colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "filtered_metadata\\$subgroup")
-Side note: If you are wondering why there are two \
above in "filtered_metadata\\$subgroup"
, that’s called an escape character. There’s a whole universe of things called regular expressions (regex) that can be super handy for string manipulations.
## subgroupG3 subgroupG4 subgroupSHH subgroupWNT
+## 1 0 1 0 0
+## 2 0 1 0 0
+## 3 0 0 1 0
+## 4 1 0 0 0
+## 5 0 1 0 0
+## 6 0 0 1 0
+The design matrix column names are a bit messy, so we will neaten them up by dropping the subgroup
designation they all have.
Now we are ready to actually start fitting our differential expression model to the data. To accommodate our design that has more than 2 groups this time, we will need to do this in a couple steps.
-First we need to fit our basic linear model to the data, then apply empirical Bayes smoothing.
-# Apply linear model to data
-fit <- lmFit(df, design = des_mat)
-
-# Apply empirical Bayes to smooth standard errors
-fit <- eBayes(fit)
-Now that we have our basic model fitting, we will want to make the contrasts among all our groups. Depending on your scientific questions, you will need to customize the next steps. Consulting the limma users guide for how to set up your model based on your hypothesis is a good idea.
+We will use the lmFit()
function from the limma
package to test each gene for differential expression between the two groups using a linear model. After fitting our data to the linear model, in this example we apply empirical Bayes smoothing using the eBayes()
function.
Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).
+# Apply linear model to data
+fit <- lmFit(expression_df, design = des_mat)
+
+# Apply empirical Bayes to smooth standard errors
+fit <- eBayes(fit)
Now that we have our basic model fitting, we will want to investigate the contrasts among all our groups. Depending on your scientific questions, you will need to customize the next steps. Consulting the limma users guide for how to set up your model based on your hypothesis is a good idea.
In this contrasts matrix, we are comparing each subtype to all the other subtypes.
We’re dividing by three in this expression so that each group is compared to the average of the other three groups (makeContrasts()
doesn’t allow you to use functions like mean()
; it wants a formula).
contrast_matrix <- makeContrasts(
- "G3vsOther" = G3 - (G4 + SHH + WNT) / 3,
- "G4vsOther" = G4 - (G3 + SHH + WNT) / 3,
- "SHHvsOther" = SHH - (G3 + G4 + WNT) / 3,
- "WNTvsOther" = WNT - (G3 + G4 + SHH) / 3,
- levels = des_mat
-)
-Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulate would look like G3 = G3 - Control
for each one. We highly recommend consulting the limma users guide for figuring out what your makeContrasts()
and model.matrix()
setups should look like (Ritchie et al. 2015).
Now that we have the contrasts matrix set up, we can use it to re-fit the model and re-smooth it with eBayes()
.
# Fit the model according to the contrasts matrix
-contrasts_fit <- contrasts.fit(fit, contrast_matrix)
-
-# Re-smooth the Bayes
-contrasts_fit <- eBayes(contrasts_fit)
-Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).
+contrast_matrix <- makeContrasts(
+ "G3vsOther" = G3 - (G4 + SHH + WNT) / 3,
+ "G4vsOther" = G4 - (G3 + SHH + WNT) / 3,
+ "SHHvsOther" = SHH - (G3 + G4 + WNT) / 3,
+ "WNTvsOther" = WNT - (G3 + G4 + SHH) / 3,
+ levels = des_mat
+)
Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulae would look like G3 = G3 - Control
for each one. We highly recommend consulting the limma users guide for figuring out what your makeContrasts()
and model.matrix()
setups should look like (Ritchie et al. 2015).
Now that we have the contrasts matrix set up, we can use it to re-fit the model with contrasts.fit()
and re-smooth it with eBayes()
.
# Fit the model according to the contrasts matrix
+contrasts_fit <- contrasts.fit(fit, contrast_matrix)
+
+# Re-smooth the Bayes
+contrasts_fit <- eBayes(contrasts_fit)
Now let’s create the results table based on the contrasts fitted model.
-This step will provide the Benjamini-Hochberg multiple testing correction. The topTable()
function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method
argument (see the ?topTable
help page for more about the options).
# Apply multiple testing correction and obtain stats
-stats_df <- topTable(contrasts_fit, number = nrow(df)) %>%
- tibble::rownames_to_column("Gene")
+This step will also apply the Benjamini-Hochberg multiple testing correction. The topTable()
function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method
argument (see the ?topTable
help page for more about the options).
# Apply multiple testing correction and obtain stats
+stats_df <- topTable(contrasts_fit, number = nrow(expression_df)) %>%
+ tibble::rownames_to_column("Gene")
Let’s take a peek at our results table.
-head(stats_df)
+
For each gene, each group’s fold change in expression, compared to the average of the other groups is reported.
@@ -3233,134 +4058,136 @@To test if these results make sense, we can make a plot of one of top genes. Let’s try extracting the data for ENSG00000128683
and set up its own data frame for plotting purposes. Based on the results in stats_df
, we should expect this gene to be much higher in the WNT
samples.
First we will need to set up the data for this gene and the subgroup labels into a data frame for plotting.
-top_gene_df <- df %>%
- # Extract this gene from `df`
- dplyr::filter(rownames(.) == "ENSG00000128683") %>%
- # Transpose so the gene is a column
- t() %>%
- # Transpose made this a matrix, let's make it back into a data.frame like before
- data.frame() %>%
- # Store the sample ids as their own column instead of being row names
- tibble::rownames_to_column("refinebio_accession_code") %>%
- # Join on the selected columns from metadata
- dplyr::inner_join(dplyr::select(
- metadata,
- refinebio_accession_code,
- subgroup
- ))
+top_gene_df <- expression_df %>%
+ # Extract this gene from `expression_df`
+ dplyr::filter(rownames(.) == "ENSG00000128683") %>%
+ # Transpose so the gene is a column
+ t() %>%
+ # Transpose made this a matrix, let's make it back into a data.frame like before
+ data.frame() %>%
+ # Store the sample ids as their own column instead of being row names
+ tibble::rownames_to_column("refinebio_accession_code") %>%
+ # Join on the selected columns from metadata
+ dplyr::inner_join(dplyr::select(
+ metadata,
+ refinebio_accession_code,
+ subgroup
+ ))
## Joining, by = "refinebio_accession_code"
Let’s take a sneak peek at our top_gene_df
.
head(top_gene_df)
+
Now let’s plot the data for ENSG00000128683
using our top_gene_df
. We should expect this gene to be expressed at much higher levels in the WNT
group samples.
ggplot(top_gene_df, aes(x = subgroup, y = ENSG00000128683, color = subgroup)) +
- geom_jitter(width = 0.2, height = 0) + # We'll make this a jitter plot
- theme_classic() # This makes some aesthetic changes
-
+ggplot(top_gene_df, aes(x = subgroup, y = ENSG00000128683, color = subgroup)) +
+ geom_jitter(width = 0.2, height = 0) + # We'll make this a jitter plot
+ theme_classic() # This makes some aesthetic changes
Yes! These results make sense. The WNT samples have much higher expression of ENSG00000128683 than the other samples.
The results in stats_df
will be saved to our results/
directory.
readr::write_tsv(stats_df, file.path(
- results_dir,
- "GSE37418_limma_results.tsv" # Replace with a relevant output name
-))
+
We’ll use the ggplot2
to make a set of volcano plots. But first, we need to set up our data for plotting. We will need the p values from the individual contrasts as well as the log fold changes.
We’ll use ggplot2
to make a set of volcano plots. But first, we need to set up our data for plotting. We will need the p values from the individual contrasts as well as the log fold changes.
We can obtain the contrast p values from the contrasts_fit
object and make it a longer format that the ggplot()
function will want for plotting.
# Let's extract the contrast p values for each and convert them to -log10()
-contrast_p_vals_df <- -log10(contrasts_fit$p.value) %>%
- # Make this into a data frame
- as.data.frame() %>%
- # Store genes as their own column
- tibble::rownames_to_column("Gene") %>%
- # Make this into long format
- tidyr::pivot_longer(dplyr::contains("vsOther"),
- names_to = "contrast",
- values_to = "neg_log10_p_val"
- )
+# Let's extract the contrast p values for each and transform them with -log10()
+contrast_p_vals_df <- -log10(contrasts_fit$p.value) %>%
+ # Make this into a data frame
+ as.data.frame() %>%
+ # Store genes as their own column
+ tibble::rownames_to_column("Gene") %>%
+ # Make this into long format
+ tidyr::pivot_longer(dplyr::contains("vsOther"),
+ names_to = "contrast",
+ values_to = "neg_log10_p_val"
+ )
Now let’s extract the log fold changes from stats_df
.
# Let's extract the fold changes from `stats_df`
-log_fc_df <- stats_df %>%
- # We only want to keep the `Gene` column as well
- dplyr::select("Gene", dplyr::contains("vsOther")) %>%
- # Make this a longer format
- tidyr::pivot_longer(dplyr::contains("vsOther"),
- names_to = "contrast",
- values_to = "logFoldChange"
- )
+# Let's extract the fold changes from `stats_df`
+log_fc_df <- stats_df %>%
+ # We only want to keep the `Gene` column as well
+ dplyr::select("Gene", dplyr::contains("vsOther")) %>%
+ # Make this a longer format
+ tidyr::pivot_longer(dplyr::contains("vsOther"),
+ names_to = "contrast",
+ values_to = "logFoldChange"
+ )
We can perform an inner_join()
of both these datasets using both their Gene
and contrast
columns.
plot_df <- log_fc_df %>%
- dplyr::inner_join(contrast_p_vals_df,
- by = c("Gene", "contrast"),
- # This argument will automatically tack this on the end of the column names
- # from the respective data frames - this way we can keep track of which columns are from which
- suffix = c("_log_fc", "_p_val")
- )
+plot_df <- log_fc_df %>%
+ dplyr::inner_join(contrast_p_vals_df,
+ by = c("Gene", "contrast"),
+ # This argument will add the given suffixes to the column names
+ # from the respective data frames, helping us keep track of which columns
+ # hold which types of values
+ suffix = c("_log_fc", "_p_val")
+ )
Let’s print out a preview of plot_df
.
# Print out what this looks like
-head(plot_df)
+
Let’s declare what we consider to be significant levels for fold change and for -log10 p-values. By saving this as its own variable, we only need to change these cutoffs in one place if we want to adjust later.
-# This is equivalent to p value < 0.05
-p_val_cutoff <- 1.301
-
-# Absolute value cutoff for fold changes
-abs_fc_cutoff <- 5
+# Convert p value cutoff to negative log 10 scale
+p_val_cutoff <- -log10(0.05)
+
+# Absolute value cutoff for fold changes
+abs_fc_cutoff <- 5
Now we can use these cutoffs to make a new variable that declares which genes we consider significant. We will use some logic with dplyr::case_when()
to do this.
plot_df <- plot_df %>%
- dplyr::mutate(
- signif_label = dplyr::case_when(
- abs(logFoldChange) > abs_fc_cutoff & neg_log10_p_val > p_val_cutoff ~ "p-val and FC",
- abs(logFoldChange) > abs_fc_cutoff ~ "FC",
- neg_log10_p_val > p_val_cutoff ~ "p-val",
- TRUE ~ "NS"
- )
- )
+plot_df <- plot_df %>%
+ dplyr::mutate(
+ signif_label = dplyr::case_when(
+ abs(logFoldChange) > abs_fc_cutoff & neg_log10_p_val > p_val_cutoff
+ ~ "p-val and FC",
+ abs(logFoldChange) > abs_fc_cutoff ~ "FC",
+ neg_log10_p_val > p_val_cutoff ~ "p-val",
+ TRUE ~ "NS"
+ )
+ )
Now we’re ready to plot the volcanoes!
-volcanoes_plot <- ggplot(
- plot_df,
- aes(
- x = logFoldChange, # Fold change as x value
- y = neg_log10_p_val, # -log10(p value) for the contrasts
- color = signif_label
- ) # Color code by significance cutoffs variable we made
-) +
- # Making this a scatter plot with dots that are 30% opaque using the `alpha` argument
- geom_point(alpha = 0.3) +
- # Using our `p_val_cutoff` for our line here
- geom_hline(yintercept = p_val_cutoff, linetype = "dashed") +
- # Using our `abs_fc_cutoff` for our lines here
- geom_vline(xintercept = c(-abs_fc_cutoff, abs_fc_cutoff), linetype = "dashed") +
- # The default colors aren't great, we'll specify our own here
- scale_colour_manual(values = c("#67a9cf", "darkgray", "gray", "#a1d76a")) +
- # Let's be more specific about what this p value is in our y axis label
- ylab("Contrast -log10(p value)") +
- # This makes separate plots for each contrast!
- facet_wrap(~contrast) +
- # Just for making it prettier!
- theme_classic()
-
-# Print out the plot!
-volcanoes_plot
+volcanoes_plot <- ggplot(
+ plot_df,
+ aes(
+ x = logFoldChange, # Fold change as x value
+ y = neg_log10_p_val, # -log10(p value) for the contrasts
+ color = signif_label # Color code by significance cutoffs variable we made
+ )
+) +
+ # Make a scatter plot with points that are 30% opaque using `alpha`
+ geom_point(alpha = 0.3) +
+ # Draw our `p_val_cutoff` for line here
+ geom_hline(yintercept = p_val_cutoff, linetype = "dashed") +
+ # Using our `abs_fc_cutoff` for our lines here
+ geom_vline(xintercept = c(-abs_fc_cutoff, abs_fc_cutoff), linetype = "dashed") +
+ # The default colors aren't great, we'll specify our own here
+ scale_colour_manual(values = c("#67a9cf", "darkgray", "gray", "#a1d76a")) +
+ # Let's be more specific about what this p value is in our y axis label
+ ylab("Contrast -log10(p value)") +
+ # This makes separate plots for each contrast!
+ facet_wrap(~contrast) +
+ # Just for making it prettier!
+ theme_classic()
+
+# Print out the plot!
+volcanoes_plot
Here the green points might be of interest. We recommend ColorBrewer for finding different color sets if you don’t like the ones we used.
Let’s save these volcanoes to a PNG file.
-ggsave(
- plot = volcanoes_plot,
- file.path(plots_dir, "GSE37418_results_volcano_plots.png")
-)
+
## Saving 7 x 5 in image
ggplot2
cheatsheet has a summary of `ggplot2 options that might give you some inspiration for tweaking the volcano plot.At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessioninfo::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -3390,13 +4217,13 @@ 6 Session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-10-16
+## date 2020-12-16
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
+## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
-## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
@@ -3416,22 +4243,22 @@ 6 Session info
## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0)
## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0)
-## limma * 3.44.3 2020-06-12 [1] Bioconductor
+## limma * 3.46.0 2020-10-27 [1] Bioconductor
## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
-## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
-## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
-## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
@@ -3439,7 +4266,7 @@ 6 Session info
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
-## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
## tidyr 1.1.2 2020-08-27 [1] RSPM (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
@@ -3451,23 +4278,28 @@ 6 Session info
## [2] /usr/local/lib/R/library
-Gonzalez I., 2014 Statistical analysis of rna-seq data.
+Gonzalez I., 2014 Statistical analysis of RNA-Seq data. http://www.nathalievialaneix.eu/doc/pdf/tutorial-rnaseq.pdf
-Klaus B., and S. Reisenauer, 2018 An end to end workflow for differential gene expression using affymetrix microarrays.
+Klaus B., and S. Reisenauer, 2018 An end to end workflow for differential gene expression using Affymetrix microarrays. https://www.bioconductor.org/packages/devel/workflows/vignettes/maEndToEnd/inst/doc/MA-Workflow.html
Ritchie M. E., B. Phipson, D. Wu, Y. Hu, and C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007
-Robinson G., M. Parker, T. A. Kranenburg, C. Lu, and X. Chen et al., 2012 Novel mutations target distinct subgroups of medulloblastoma. Nature 488: 43–48.
+Robinson G., M. Parker, T. A. Kranenburg, C. Lu, and X. Chen et al., 2012 Novel mutations target distinct subgroups of medulloblastoma. Nature 488: 43–48. https://doi.org/10.1038/nature11213
-Robinson D., Understanding empirical bayes estimation (using baseball statistics)
+Robinson D., Understanding empirical Bayes estimation (using baseball statistics). http://varianceexplained.org/r/empirical_bayes_baseball/
.Rmd
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
- dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
-# Define the file path to the data directory
-data_dir <- file.path("data", "GSE37382") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE37382.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv") # Replace with file path to your metadata
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE37382")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE37382.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
@@ -3072,19 +3903,20 @@See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
Attach the packages we need for this analysis:
-# Attach the library
-library(ggplot2)
-
-# We will need this so we can use the pipe: %>%
-library(magrittr)
+
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-## Parsed with column specification:
+
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_double(),
@@ -3104,154 +3936,147 @@ 4.2 Import and set up data
## `contact_zip/postal_code` = col_double(),
## data_row_count = col_double(),
## taxid_ch1 = col_double()
-## )
-## See spec(...) for full column specifications.
-# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
- # Tuck away the gene ID column as rownames
- tibble::column_to_rownames("Gene")
-## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+# Read in data TSV file
+df <- readr::read_tsv(data_file) %>%
+ # Tuck away the gene ID column as row names, leaving only numeric values
+ tibble::column_to_rownames("Gene")
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
-## See spec(...) for full column specifications.
+## ℹ Use `spec()` for the full column specifications.
Let’s ensure that the metadata and data are in the same sample order.
-# Make the data in the order of the metadata
-df <- df %>%
- dplyr::select(metadata$geo_accession)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$geo_accession)
+# Make the sure the columns (samples) are in the same order as the metadata
+df <- df %>%
+ dplyr::select(metadata$geo_accession)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$geo_accession)
## [1] TRUE
Now we are going to use a combination of functions from base R and the ggplot2
package to perform and visualize the results of the Principal Component Analysis (PCA) dimension reduction technique on our medulloblastoma samples.
In this code chunk, we are going to perform Principal Component Analysis (PCA) on our data and create a data frame using the PCA scores and the variables from our metadata that we are going to use to annotate our plot later. We are using the base R prcomp()
function to perform Principal Component Analysis (PCA) here.
# Perform Principal Component Analysis (PCA) using the `prcomp()` function
-pca <- prcomp(
- t(df), # We have to transpose our data frame so we are obtaining PCA scores for samples instead of genes
- scale = TRUE # This tells R that we want the variables scaled to have unit variance
-)
-Let’s take a preview at the PCA results. We are using indexing to only print out the first 10 PCs: [, 1:10]
.
# We can access the results from our `pca` object using `$x`
-head(pca$x[, 1:10])
-## PC1 PC2 PC3 PC4 PC5 PC6
-## GSM917111 -18.1468442 -63.659225 1.136901 -5.242875 10.29114 -17.724403
-## GSM917250 -51.2350460 45.762776 -2.183100 7.948304 -21.26658 -24.668972
-## GSM917281 -42.4774592 2.216351 19.941918 -5.664602 -49.00085 -13.306383
-## GSM917062 -7.6116070 -25.887243 -65.257099 -6.226487 32.73786 -7.159006
-## GSM917288 -54.9540801 51.804918 42.332093 28.506307 54.17483 46.601027
-## GSM917230 -0.0325771 -11.070407 -6.555240 28.661922 -20.85879 17.266892
-## PC7 PC8 PC9 PC10
-## GSM917111 6.527599 -15.023332 7.640187 -16.155061
-## GSM917250 -8.472525 -1.592417 -6.242858 -5.730141
-## GSM917281 39.499111 3.513968 -18.804197 -3.578259
-## GSM917062 -10.549933 -47.861391 3.256229 -29.137628
-## GSM917288 24.893176 5.242399 67.374223 -8.003784
-## GSM917230 19.109068 33.763989 -17.114146 1.061423
-In total, we do have 285 principal component values, because we provided 285 sample’s data.
+In this code chunk, we are going to perform Principal Component Analysis (PCA) on our data and create a data frame using the PCA scores and the variables from our metadata that we are going to use to annotate our plot later. We are using the base R prcomp()
function to perform Principal Component Analysis (PCA) here. The prcomp()
function calculates principal component scores for each row of a matrix, but our data is arranged with each sample in a column, so we will need to transpose the data frame first. In most cases, we will want to use the scale = TRUE
argument so that all of the expression measurements have the same variance. This prevents the PCA results from being dominated by a few highly variable genes.
# Perform Principal Component Analysis (PCA) using the `prcomp()` function
+pca <- prcomp(
+ t(df), # transpose our data frame to obtain PC scores for samples, not genes
+ scale = TRUE # we want the data scaled to have unit variance for each gene
+)
Let’s take a preview of the PCA results. Each row will be a sample, as in the transposed data matrix we used as input, and each column is one of the new principal component (PC) values. We are using indexing to only print out the first 5 PC columns: [, 1:5]
.
## PC1 PC2 PC3 PC4 PC5
+## GSM917111 -18.1468442 -63.659225 1.136901 -5.242875 10.29114
+## GSM917250 -51.2350460 45.762776 -2.183100 7.948304 -21.26658
+## GSM917281 -42.4774592 2.216351 19.941918 -5.664602 -49.00085
+## GSM917062 -7.6116070 -25.887243 -65.257099 -6.226487 32.73786
+## GSM917288 -54.9540801 51.804918 42.332093 28.506307 54.17483
+## GSM917230 -0.0325771 -11.070407 -6.555240 28.661922 -20.85879
+In total, we have 285 principal component values, because we provided 285 samples’ data (we will always have as many PCs as the smaller dimension of the input data matrix).
Before visualizing and interpreting the results, it can be useful to understand the proportion of variance explained by each principal component. The principal components are automatically ordered by the variance they explain, meaning PC1 would always be the principal component that explains the most variance in your dataset. If the largest variance component, PC1, explained 96% of the variance in your data and very clearly showed a difference between sample batches you would be very concerned about your dataset! On the other hand, if a separation of batches was apparent in a different principal component that explained a low proportion of variance and the first few PCs explained most of the variance and appeared to correspond to something like tissue type and treatment, you would be less concerned (CCDL 2020).
+Before visualizing and interpreting the results, it can be useful to understand the proportion of variance explained by each principal component. The principal components are automatically ordered by the variance they explain, meaning PC1 would always be the principal component that explains the most variance in your dataset. If the largest variance component, PC1, explained 96% of the variance in your data and very clearly showed a difference between sample batches you would be very concerned about your dataset! On the other hand, if a separation of batches was apparent in a different principal component that explained a low proportion of variance and the first few PCs explained most of the variance and appeared to correspond to something like tissue type and treatment, you would be less concerned (Childhood Cancer Data Lab 2020).
The summary()
function reports the proportion of variance explained by each principal component.
# Save the summary of the PCA results using the `summary()` function
-pca_summary <- summary(pca)
-By accessing the importance
element, which contains the proportion of variance explained by each principal component, with pca_summary$importance
, we can use indexing to only look at the first n
PCs.
# Now access the importance information for the first 10 PCs -- we can access this information `pca_summary$importance`
-pca_summary$importance[, 1:10]
-## PC1 PC2 PC3 PC4 PC5 PC6
-## Standard deviation 45.50502 33.87890 30.50038 27.51175 23.99828 22.88496
-## Proportion of Variance 0.09560 0.05299 0.04295 0.03494 0.02659 0.02418
-## Cumulative Proportion 0.09560 0.14858 0.19153 0.22647 0.25306 0.27724
-## PC7 PC8 PC9 PC10
-## Standard deviation 21.17880 18.73075 18.13267 17.95956
-## Proportion of Variance 0.02071 0.01620 0.01518 0.01489
-## Cumulative Proportion 0.29795 0.31414 0.32932 0.34421
-Now that we’ve seen the proportion of variance for the first ten PCs, let’s prepare and plot the PC scores for the first two principal components, the components responsible for the most explained proportion of variance in our dataset.
+ +The importance
element of the summary object contains the proportion of variance explained by each principal component along with other statistics, with pca_summary$importance
, we can use indexing to only look at the first n
PCs.
## PC1 PC2 PC3 PC4 PC5
+## Standard deviation 45.50502 33.87890 30.50038 27.51175 23.99828
+## Proportion of Variance 0.09560 0.05299 0.04295 0.03494 0.02659
+## Cumulative Proportion 0.09560 0.14858 0.19153 0.22647 0.25306
+Now that we’ve seen the proportion of variance for the first set of PCs, let’s prepare and plot the PC scores for the first two principal components, the components that explain the largest proportion of the expression variance in our dataset. (Note though, that in this case, they explain less than 15% of the total variance!)
In the next chunk, we are going to extract the first two principal components from our pca
object to prepare a data frame for plotting.
# Make the first two principal components into a data frame for plotting with `ggplot2`
-pca_df <- data.frame(pca$x[, 1:2]) %>%
- # Turn samples_ids stored as rownames into column
- tibble::rownames_to_column("refinebio_accession_code") %>%
- # Bring only the variables that we want from the metadata into this data frame -- here we are going to join by `refinebio_accession_code` values
- dplyr::inner_join(dplyr::select(metadata, refinebio_accession_code, histology, subgroup),
- by = "refinebio_accession_code"
- )
+# Make the first two PCs into a data frame for plotting with `ggplot2`
+pca_df <- data.frame(pca$x[, 1:2]) %>%
+ # Turn samples IDs stored as row names into a column
+ tibble::rownames_to_column("refinebio_accession_code") %>%
+ # Bring only the variables that we want from the metadata into this data frame
+ # here we are going to join by `refinebio_accession_code` values
+ dplyr::inner_join(
+ dplyr::select(metadata, refinebio_accession_code, histology, subgroup),
+ by = "refinebio_accession_code"
+ )
Now let’s plot the PC scores for the first two principal components since we know that they are responsible for the most explained proportion of variance in our dataset.
-Let’s also label the data points based on their genotype subgroup since medulloblastoma has been found to comprise of subgroups that each have molecularly distinct profiles (Northcott et al. 2012).
-# Make a scatterplot using `ggplot2` functionality
-pca_plot <- ggplot(
- pca_df,
- aes(
- x = PC1,
- y = PC2,
- color = subgroup # This will label points with different colors for each `subgroup`
- )
-) +
- geom_point() + # This tells R that we want a scatterplot
- theme_classic() # This tells R to return a classic-looking plot with no gridlines
-
-# Print out plot here
-pca_plot
+Now let’s plot the PC scores for the first two principal components.
+Let’s also label the data points based on their genotype subgroup since medulloblastoma has been found to comprise of subgroups that each have molecularly distinct profiles (Northcott et al. 2012).
+# Make a scatterplot using `ggplot2` functionality
+pca_plot <- ggplot(
+ pca_df,
+ aes(
+ x = PC1,
+ y = PC2,
+ color = subgroup # label points with different colors for each `subgroup`
+ )
+) +
+ geom_point() + # Plot individual points to make a scatterplot
+ theme_classic() # Format as a classic-looking plot with no gridlines
+
+# Print out the plot here
+pca_plot
Looks like Group 4 and SHH groups somewhat cluster with each other but Group 3 seems to be less distinct as there are some samples clustering with Group 4 as well.
+Looks like Group 4 and SHH groups cluster with each other somewhat, but Group 3 seems to be less distinct, as there are some samples clustering with Group 4 as well. Most of the differences that we see between the groups are along the first axis of variation, PC1.
We can add another label to our plot to get more information about our dataset. Let’s also label the data points based on the histological subtype that each sample belongs to.
-# Make a scatterplot with ggplot2
-pca_plot <- ggplot(
- pca_df,
- aes(
- x = PC1,
- y = PC2,
- color = subgroup, # This will label points with different colors for each `subgroup`
- shape = histology # This will label points with different colors for each `histology` group
- )
-) +
- geom_point() +
- theme_classic()
-
-# Print out plot here
-pca_plot
+# Make a scatterplot with ggplot2
+pca_plot <- ggplot(
+ pca_df,
+ aes(
+ x = PC1,
+ y = PC2,
+ color = subgroup, # Draw points with different colors for each `subgroup`
+ shape = histology # Use a different shape for each `histology` group
+ )
+) +
+ geom_point() +
+ theme_classic()
+
+# Print out the plot here
+pca_plot
Adding the histological subtype label to our plot made our plot more informative, but the diffuse Group 3 data doesn’t appear to be related to a histology subtype. We could test out other variables as annotation labels to get a further understanding of the cluster behavior of each subgroup.
+Adding the histological subtype label to our plot made our plot more informative, but the diffuse Group 3 data doesn’t appear to be related to a histology subtype. We could test out other variables as annotation labels to get a further understanding of the cluster behavior of each subgroup, or plot other PC values to see if they might also reveal some structure in the data.
Now that we have an annotated PCA plot, let’s save it!
You can easily switch this to save to a JPEG or TIFF by changing the file name within the ggsave()
function to the respective file suffix.
# Save plot using `ggsave()` function
-ggsave(file.path(
- plots_dir,
- "GSE37382_pca_scatterplot.png" # Replace with name relevant your plotted data
-),
-plot = pca_plot # Here we are giving the function the plot object that we want saved to file
-)
+# Save plot using `ggsave()` function
+ggsave(
+ file.path(
+ plots_dir,
+ "GSE37382_pca_scatterplot.png" # Replace with a good file name for your plot
+ ),
+ plot = pca_plot # The plot object that we want saved to file
+)
## Saving 7 x 5 in image
ggplot2
(Prabhakaran 2016)ggplot2
(Prabhakaran 2016)At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessioninfo::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -3261,13 +4086,13 @@ 6 Session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-10-16
+## date 2020-12-14
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
+## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
-## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
@@ -3291,16 +4116,16 @@ 6 Session info
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
-## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
-## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
-## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
@@ -3308,7 +4133,7 @@ 6 Session info
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
-## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
@@ -3322,10 +4147,10 @@ 6 Session info
References
-Brems M., 2017 A one-stop shop for principal component analysis
+Brems M., 2017 A one-stop shop for principal component analysis. https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
-CCDL, 2020 OpenPBTA: Cluster validation.
+Childhood Cancer Data Lab, 2020 OpenPBTA: Cluster validation. https://github.com/AlexsLemonade/training-modules/blob/3dbc6f3f53c680ec6aa2f513851c1cd4635cc31c/machine-learning/02-openpbta_consensus_clustering.Rmd#L310
Nguyen L. H., and S. Holmes, 2019 Ten quick tips for effective dimensionality reduction. PLOS Computational Biology 15. https://doi.org/10.1371/journal.pcbi.1006907
@@ -3334,14 +4159,19 @@ References
Northcott P., D. Shih, J. Peacock, L. Garzia, and S. Morrissy et al., 2012 Subgroup specific structural variation across 1,000 medulloblastoma genomes. Nature 488. https://doi.org/10.1038/nature11327
-Powell V., and L. Lehe, Principal component analysis explained visually
+Powell V., and L. Lehe, Principal component analysis explained visually. https://setosa.io/ev/principal-component-analysis/
-Prabhakaran S., 2016 The complete ggplot2 tutorial.
+Prabhakaran S., 2016 The complete ggplot2 tutorial. http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html
.Rmd
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
- dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
-# Define the file path to the data directory
-data_dir <- file.path("data", "GSE37382") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE37382.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv") # Replace with file path to your metadata
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE37382")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE37382.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
@@ -3072,31 +3903,32 @@See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
-In this analysis, we will be using the R package umap
(Konopka 2020) for the production of UMAP dimension reduction values and the R package ggplot2
(Prabhakaran 2016) for plotting the UMAP values.
if (!("umap" %in% installed.packages())) {
- # Install umap package
- BiocManager::install("umap", update = FALSE)
-}
+In this analysis, we will be using the R package umap
(Konopka 2020) for the production of UMAP dimension reduction values and the R package ggplot2
(Prabhakaran 2016) for plotting the UMAP values.
if (!("umap" %in% installed.packages())) {
+ # Install umap package
+ BiocManager::install("umap", update = FALSE)
+}
Attach the packages we need for this analysis:
-# Attach the `umap` library
-library(umap)
-
-# Attach the `ggplot2` library
-library(ggplot2)
-
-# We will need this so we can use the pipe: %>%
-library(magrittr)
+# Attach the `umap` library
+library(umap)
+
+# Attach the `ggplot2` library
+library(ggplot2)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
The UMAP algorithm utilizes random sampling so we are going to set the seed to make our results reproducible.
-# Set the seed so our results are reproducible:
-set.seed(12345)
+
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-## Parsed with column specification:
+
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_double(),
@@ -3116,130 +3948,148 @@ 4.2 Import and set up data
## `contact_zip/postal_code` = col_double(),
## data_row_count = col_double(),
## taxid_ch1 = col_double()
-## )
-## See spec(...) for full column specifications.
-# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
- # Tuck away the gene ID column as rownames
- tibble::column_to_rownames("Gene")
-## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+# Read in data TSV file
+df <- readr::read_tsv(data_file) %>%
+ # Tuck away the gene ID column as row names, leaving only numeric values
+ tibble::column_to_rownames("Gene")
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
-## See spec(...) for full column specifications.
+## ℹ Use `spec()` for the full column specifications.
Let’s ensure that the metadata and data are in the same sample order.
-# Make the data in the order of the metadata
-df <- df %>%
- dplyr::select(metadata$geo_accession)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$geo_accession)
+# Make sure the columns (samples) are in the same order as the metadata
+df <- df %>%
+ dplyr::select(metadata$geo_accession)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$geo_accession)
## [1] TRUE
+Now we are going to use a combination of functions from the umap
and the ggplot2
packages to perform and visualize the results of the Uniform Manifold Approximation (UMAP) dimension reduction technique on our medulloblastoma samples.
Now we are going to use a combination of functions from the umap
and the ggplot2
packages to perform and visualize the results of the Uniform Manifold Approximation (UMAP) dimension reduction technique on our medulloblastoma samples.
In this code chunk, we are going to perform Uniform Manifold Approximation (UMAP) on our data and create a data frame using the UMAP scores and the variables from our metadata that we are going to use to annotate our plot later.
-# Perform Uniform Manifold Approximation (UMAP) using the `umap::umap()` function
-umap_data <- umap::umap(
- t(df) # We have to transpose our data frame so we are obtaining UMAP scores for samples instead of genes
-)
-
-# Make into data frame for plotting with `ggplot2`
-umap_df <- data.frame(umap_data$layout) %>% # The umap values we need for plotting are stored in this `layout` element
- # Turn samples_ids stored as rownames into column
- tibble::rownames_to_column("refinebio_accession_code") %>%
- # Bring only the variables that we want from the metadata into this data frame; match by sample ids
- dplyr::inner_join(dplyr::select(metadata, refinebio_accession_code, histology, subgroup),
- by = "refinebio_accession_code"
- )
-Let’s take a preview at the data frame we created in the chunk above.
-head(umap_df)
+In this code chunk, we are going to perform Uniform Manifold Approximation (UMAP) on our data and create a data frame using the UMAP scores and the variables from our metadata that we are going to use to annotate our plot later. The umap()
function calculates scores for each row of a matrix, but our data is arranged with each sample in a column, so we will need to transpose the data frame first.
# Perform UMAP using the `umap::umap()` function
+umap_data <- umap::umap(
+ t(df) # transpose our data frame to obtain PC scores for samples, not genes
+)
+
+# Make into data frame for plotting with `ggplot2`
+# The UMAP values we need for plotting are stored in the `layout` element
+umap_df <- data.frame(umap_data$layout) %>%
+ # Turn sample IDs stored as row names into a column
+ tibble::rownames_to_column("refinebio_accession_code") %>%
+ # Add on the variables that we want from the metadata into this data frame;
+ # match by sample IDs
+ dplyr::inner_join(
+ dplyr::select(metadata, refinebio_accession_code, histology, subgroup),
+ by = "refinebio_accession_code"
+ )
Let’s take a look at the data frame we created in the chunk above.
+Now, let’s plot the UMAP scores.
-# Make a scatterplot using `ggplot2` functionality
-umap_plot <- ggplot(
- umap_df,
- aes(
- x = X1,
- y = X2
- )
-) +
- geom_point() + # This tells R that we want a scatterplot
- theme_classic() # This tells R to return a classic-looking plot with no gridlines
-
-# Print out plot here
-umap_plot
-
+Here we can see that UMAP took the data from thousands of genes, and reduced it to just two variables, X1
and X2
. Now, let’s plot those UMAP scores.
# Make a scatterplot using `ggplot2` functionality
+umap_plot <- ggplot(
+ umap_df,
+ aes(
+ x = X1,
+ y = X2
+ )
+) +
+ geom_point() + # Plot individual points to make a scatterplot
+ theme_classic() # Format as a classic-looking plot with no gridlines
+
+# Print out the plot here
+umap_plot
It’s hard to interpret our UMAP results without some metadata labels on our plot.
-Let’s label the data points based on their genotype subgroup since this is central to the subgroup specific based hypothesis in the original paper (Northcott et al. 2012).
-# Make a scatterplot with ggplot2
-umap_plot <- ggplot(
- umap_df,
- aes(
- x = X1,
- y = X2,
- color = subgroup # This will label points with different colors for each `subgroup`
- )
-) +
- geom_point() +
- theme_classic()
-
-# Print out plot here
-umap_plot
-
-It looks like Group 4 and SHH groups somewhat cluster with each other but Group 3 seems to also be clustering with Group 4.
+Let’s label the data points based on their genotype subgroup since this is central to the subgroup specific based hypothesis in the original paper (Northcott et al. 2012).
+# Make a scatterplot with ggplot2
+umap_plot <- ggplot(
+ umap_df,
+ aes(
+ x = X1,
+ y = X2,
+ color = subgroup # label points with different colors for each `subgroup`
+ )
+) +
+ geom_point() +
+ theme_classic()
+
+# Print out the plot here
+umap_plot
It looks like SHH clusters pretty distinctly, with Group 3 and Group 4 being more similar and grouping together (with some division).
We can add another label to our plot to potentially gain insight on the clustering behavior of our data. Let’s also label the data points based on the histological subtype that each sample belongs to.
-# Make a scatterplot with ggplot2
-umap_plot <- ggplot(
- umap_df,
- aes(
- x = X1,
- y = X2,
- color = subgroup, # This will label points with different colors for each `subgroup`
- shape = histology # This will label points with different colors for each `histology` group
- )
-) +
- geom_point() +
- theme_classic()
-
-# Print out plot here
-umap_plot
-
+# Make a scatterplot with ggplot2
+umap_plot <- ggplot(
+ umap_df,
+ aes(
+ x = X1,
+ y = X2,
+ color = subgroup, # Draw points with different colors for each `subgroup`
+ shape = histology # Use a different shape for each `histology` group
+ )
+) +
+ geom_point() +
+ theme_classic()
+
+# Print out plot here
+umap_plot
Our histological subtype groups don’t appear to be clustering in a discernible pattern. We could test out other variables as annotation labels to get a further understanding of the cluster behavior of each subgroup.
+In summary, a good rule of thumb to remember is: if the results of an analysis can be completely changed by changing its parameters, you should be very cautious when it comes to the conclusions you draw from it as well as having good rationale for the parameters you choose (adapted from Childhood Cancer Data Lab (2020) training materials).
+You can easily switch this to save to a JPEG or TIFF by changing the file name within the ggsave()
function to the respective file suffix.
# Save plot using `ggsave()` function
-ggsave(file.path(
- plots_dir,
- "GSE37382_umap_scatterplot.png" # Replace with name relevant your plotted data
-),
-plot = umap_plot # Here we are giving the function the plot object that we want saved to file
-)
+# Save plot using `ggsave()` function
+ggsave(
+ file.path(
+ plots_dir,
+ "GSE37382_umap_plot.png" # Replace with a good file name for your plot
+ ),
+ plot = umap_plot # The plot object that we want saved to file
+)
## Saving 7 x 5 in image
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessioninfo::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -3249,14 +4099,14 @@ 6 Session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-10-16
+## date 2020-12-14
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
+## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## askpass 1.1 2019-01-13 [1] RSPM (R 4.0.0)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
-## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
@@ -3284,6 +4134,7 @@ 6 Session info
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
@@ -3291,10 +4142,10 @@ 6 Session info
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
-## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
## reticulate 1.16 2020-05-27 [1] RSPM (R 4.0.2)
-## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
## RSpectra 0.16-0 2019-12-01 [1] RSPM (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
@@ -3303,7 +4154,7 @@ 6 Session info
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
-## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
## umap * 0.2.6.0 2020-06-16 [1] RSPM (R 4.0.2)
## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
@@ -3317,27 +4168,35 @@ 6 Session info
References
+
+Childhood Cancer Data Lab, 2020 scRNA-seq dimension reduction. https://github.com/AlexsLemonade/training-modules/blob/3dbc6f3f53c680ec6aa2f513851c1cd4635cc31c/scRNA-seq/05-dimension_reduction_scRNA-seq.Rmd#L382
+
-Konopka T., 2020 Uniform manifold approximation and projection.
+Konopka T., 2020 Uniform manifold approximation and projection. https://cran.r-project.org/web/packages/umap/umap.pdf
-McInnes L., 2018 How umap works
+McInnes L., 2018 How UMAP works. https://umap-learn.readthedocs.io/en/latest/how_umap_works.html#
-McInnes L., J. Healy, and J. Melville, 2018 UMAP: Uniform manifold approximation and projection for dimension reduction
+McInnes L., J. Healy, and J. Melville, 2018 UMAP: Uniform manifold approximation and projection for dimension reduction. https://arxiv.org/abs/1802.03426
Northcott P., D. Shih, J. Peacock, L. Garzia, and S. Morrissy et al., 2012 Subgroup specific structural variation across 1,000 medulloblastoma genomes. Nature 488. https://doi.org/10.1038/nature11327
-Prabhakaran S., 2016 The complete ggplot2 tutorial.
+Prabhakaran S., 2016 The complete ggplot2 tutorial. http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html
-R CRAN Team, 2019 Uniform manifold approximation and projection in r.
+R CRAN Team, 2019 Uniform manifold approximation and projection in R. https://cran.r-project.org/web/packages/umap/vignettes/umap.html
+
The purpose of this notebook is to provide an example of mapping gene IDs for microarray data obtained from refine.bio using AnnotationDbi
packages (Carlson 2020a).
The purpose of this notebook is to provide an example of mapping gene IDs for microarray data obtained from refine.bio using AnnotationDbi
packages (Pagès et al. 2020).
.Rmd
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
- dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For this example analysis, we will use this mouse glioma stem cells dataset.
-This dataset has 15 microarray mouse glioma model samples. The samples were obtained from parental biological replicates and from resistant sub-line biological replicates that were transplanted into recipient mice.
+This dataset has 15 microarrays measuring gene expression in a transgenic mouse model of glioma. The authors compared cells from side populations and non-side populations in both tumor samples and normal neural stem cells.
data/
folderIn order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
-# Define the file path to the data directory
-data_dir <- file.path("data", "GSE13490") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE13490.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE13490.tsv") # Replace with file path to your metadata
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE13490")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE13490.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE13490.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
@@ -3062,73 +3893,39 @@If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
refine.bio data comes with gene level data with Ensembl IDs. Although this example notebook uses Ensembl IDs from Mouse, (Mus musculus), to obtain gene symbols this notebook can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers.
-For different species, wherever the abbreviation org.Mm.eg.db
or Mm
is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens org.Hs.eg.db
or Hs
would be used. In the case of our RNA-seq gene identifier annotation example notebook, a Zebrafish (Danio rerio) dataset is used, meaning org.Dr.eg.db
or Dr
would also need to be used there. A full list of the annotation R packages from Bioconductor is at this link (R Bioconductor Team 2003).
refine.bio data comes with gene level data identified by Ensembl IDs. Although this example notebook uses Ensembl IDs from Mouse, (Mus musculus), to obtain gene symbols this notebook can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers.
+For different species, wherever the abbreviation org.Mm.eg.db
or Mm
is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens org.Hs.eg.db
or Hs
would be used. In the case of our RNA-seq gene identifier annotation example notebook, a Zebrafish (Danio rerio) dataset is used, meaning org.Dr.eg.db
or Dr
would also need to be used there. A full list of the annotation R packages from Bioconductor is at this link.
Ensembl IDs can be used to obtain various different annotations at the gene/transcript level. Let’s get ready to use the Ensembl IDs from our mouse dataset to obtain the associated gene symbols.
+refine.bio uses Ensembl IDs as the primary gene identifier in its data sets. While this is a consistent and useful identifier, a string of apparently random letters and numbers is not the most user-friendly or informative for interpretation. Luckily, we can use the Ensembl IDs that we have to obtain various different annotations at the gene/transcript level. Let’s get ready to use the Ensembl IDs from our mouse dataset to obtain the associated gene symbols.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
-In this analysis, we will be using the org.Mm.eg.db
R package (Carlson 2019). Other species can be used.
# Install the mouse package
-if (!("org.Mm.eg.db" %in% installed.packages())) {
- # Install this package if it isn't installed yet
- BiocManager::install("org.Mm.eg.db", update = FALSE)
-}
-Attach the packages we need for this analysis.
-# Attach the library
-library(org.Mm.eg.db)
-## Loading required package: AnnotationDbi
-## Loading required package: stats4
-## Loading required package: BiocGenerics
-## Loading required package: parallel
-##
-## Attaching package: 'BiocGenerics'
-## The following objects are masked from 'package:parallel':
-##
-## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
-## clusterExport, clusterMap, parApply, parCapply, parLapply,
-## parLapplyLB, parRapply, parSapply, parSapplyLB
-## The following objects are masked from 'package:stats':
-##
-## IQR, mad, sd, var, xtabs
-## The following objects are masked from 'package:base':
-##
-## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
-## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
-## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
-## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
-## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
-## union, unique, unsplit, which, which.max, which.min
-## Loading required package: Biobase
-## Welcome to Bioconductor
-##
-## Vignettes contain introductory material; view with
-## 'browseVignettes()'. To cite Bioconductor, see
-## 'citation("Biobase")', and for packages 'citation("pkgname")'.
-## Loading required package: IRanges
-## Loading required package: S4Vectors
-##
-## Attaching package: 'S4Vectors'
-## The following object is masked from 'package:base':
-##
-## expand.grid
-##
-# We will need this so we can use the pipe: %>%
-library(magrittr)
+In this analysis, we will be using the org.Mm.eg.db
R package (Carlson 2019), which is part of the Bioconductor AnnotationDbi
framework (Pagès et al. 2020). Bioconductor compiles annotations from various sources, and these packages provide convenient methods to access and translate among those annotations. Other species can be used.
# Install the mouse annotation package
+if (!("org.Mm.eg.db" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("org.Mm.eg.db", update = FALSE)
+}
Attach the packages we need for this analysis. Note that attaching org.Mm.eg.db
will automatically also attach AnnotationDbi
.
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-## Parsed with column specification:
+
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_logical(),
@@ -3148,13 +3945,14 @@ 4.2 Import and set up data
## channel_count = col_double(),
## data_row_count = col_double(),
## taxid_ch1 = col_double()
-## )
-## See spec(...) for full column specifications.
-# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
- # Tuck away the Gene ID column as rownames
- tibble::column_to_rownames("Gene")
-## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+# Read in data TSV file
+expression_df <- readr::read_tsv(data_file) %>%
+ # Tuck away the Gene ID column as row names
+ tibble::column_to_rownames("Gene")
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## Gene = col_character(),
## GSM340064 = col_double(),
@@ -3174,36 +3972,36 @@ 4.2 Import and set up data
## GSM340078 = col_double()
## )
Let’s ensure that the metadata and data are in the same sample order.
-# Make the data in the order of the metadata
-df <- df %>%
- dplyr::select(metadata$geo_accession)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$geo_accession)
+# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+ dplyr::select(metadata$geo_accession)
+
+# Check if this is in the same order
+all.equal(colnames(expression_df), metadata$geo_accession)
## [1] TRUE
-# Bring back the "Gene" column in preparation for mapping
-df <- df %>%
- tibble::rownames_to_column("Gene")
+
The mapIds()
function has a multiVals
argument which denotes what to do when there are multiple mapped values for a single gene identifier. The default behavior is to return just the first mapped value. It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. Use ?mapIds
to see more options or strategies.
The main work of translating among annotations will be done with the the AnnotationDbi
function mapIds()
. The mapIds()
function has a multiVals
argument which denotes what to do when there are multiple mapped values for a single gene identifier. The default behavior is to return just the first mapped value. It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. Use ?mapIds
to see more options or strategies.
In the next chunk, we will run the mapIds()
function and supply the multiVals
argument with the "list"
option in order to get a large list with all the mapped values found for each gene identifier.
# Map ensembl IDs to their associated gene symbols
-mapped_list <- mapIds(
- org.Mm.eg.db, # Replace with annotation package for the organism relevant to your data
- keys = df$Gene,
- column = "SYMBOL", # Replace with the type of gene identifiers you would like to map to
- keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
- multiVals = "list"
-)
+# Map ensembl IDs to their associated gene symbols
+mapped_list <- mapIds(
+ org.Mm.eg.db, # Replace with annotation package for your organism
+ keys = expression_df$Gene,
+ keytype = "ENSEMBL", # Replace with the gene identifiers used in your data
+ column = "SYMBOL", # The type of gene identifiers you would like to map to
+ multiVals = "list"
+)
## 'select()' returned 1:many mapping between keys and columns
Now, let’s take a look at our mapped_list
to see how the mapping went.
# Let's use the `head()` function to take a preview at our mapped list
-head(mapped_list)
+Now, let’s take a look at our mapped object to see how the mapping went.
+## $ENSMUSG00000000001
## [1] "Gnai3"
##
@@ -3222,42 +4020,44 @@ 4.4 Explore gene ID conversion
It looks like we have gene symbols that were successfully mapped to the Ensembl IDs we provided. However, the data is now in a list
object, making it a little more difficult to explore. We are going to turn our list object into a data frame object in the next chunk.
# Let's make our object a bit more manageable for exploration by turning it into a data frame
-mapped_df <- mapped_list %>%
- tibble::enframe(name = "Ensembl", value = "Symbol") %>%
- # enframe makes a `list` column, so we will convert that to simpler format with `unnest()
- # This will result in one row of our data frame per list item
- tidyr::unnest(cols = Symbol)
+# Let's make our list a bit more manageable by turning it into a data frame
+mapped_df <- mapped_list %>%
+ tibble::enframe(name = "Ensembl", value = "Symbol") %>%
+ # enframe() makes a `list` column; we will simplify it with unnest()
+ # This will result in a data frame with one row per list item
+ tidyr::unnest(cols = Symbol)
Now let’s take a peek at our data frame.
-head(mapped_df)
+
We can see that our data frame has a new column Symbol
. Let’s get a summary of the gene symbols returned in the Symbol
column of our mapped data frame.
# We can use the `summary()` function to get a better idea of the distribution of symbols in the `Symbol` column
-summary(mapped_df$Symbol)
-## Length Class Mode
-## 17977 character character
-There are 998 NAs in our data frame, which means that 998 out of the 17918 Ensembl IDs did not map to gene symbols. 998 out of 17918 is not too bad a rate, in our opinion, but note that different gene identifier types will have different mapping rates and that is to be expected. Regardless, it is always good to be aware of how many genes you are potentially “losing” if you rely on this new gene identifier you’ve mapped to for downstream analyses.
-However, if you have almost all NAs it is possible that the function was executed incorrectly or you may want to consider using a different gene identifier, if possible.
+We can see that our data frame has a new column Symbol
. Let’s get a summary of the gene symbols in the Symbol
column of our mapped data frame.
# Use the `summary()` function to show the distribution of Symbol values
+# We need to use `as.factor()` here to get the counts of unique values
+# `maxsum = 10` limits the summary to 10 distinct values
+summary(as.factor(mapped_df$Symbol), maxsum = 10)
## Cyp2c39 Pms2 0610005C13Rik 0610009B22Rik 0610009L18Rik
+## 2 2 1 1 1
+## 0610010F05Rik 0610012G03Rik 0610030E20Rik (Other) NA's
+## 1 1 1 17021 942
+There are 942 NA
s in the Symbol
column, which means that 942 out of the 17918 Ensembl IDs did not map to gene symbols. 942 out of 17918 is not too bad a rate, in our opinion, but note that different gene identifier types will have different mapping rates and that is to be expected. Regardless, it is always good to be aware of how many genes you are potentially “losing” if you rely on this new gene identifier you’ve mapped to for downstream analyses.
However, if you have almost all NA
s it is possible that the function was executed incorrectly or you may want to consider using a different gene identifier, if possible.
Now let’s check to see if we have any genes that were mapped to multiple symbols.
-multi_mapped <- mapped_df %>%
- # Let's group by the Ensembl IDs in the `Ensembl` column
- dplyr::group_by(Ensembl) %>%
- # Create a new variable containing the number of symbols mapped to each ID
- dplyr::mutate(gene_symbol_count = dplyr::n()) %>%
- # Arrange by the genes with the highest number of symbols mapped
- dplyr::arrange(desc(gene_symbol_count)) %>%
- # Filter to include only the rows with multi mappings
- dplyr::filter(gene_symbol_count > 1)
-
-# Let's look at the first 6 rows of our `multi_mapped` object
-head(multi_mapped)
+multi_mapped <- mapped_df %>%
+ # Let's count the number of times each Ensembl ID appears in `Ensembl` column
+ dplyr::count(Ensembl, name = "gene_symbol_count") %>%
+ # Arrange by the genes with the highest number of symbols mapped
+ dplyr::arrange(desc(gene_symbol_count)) %>%
+ # Filter to include only the rows with multi mappings
+ dplyr::filter(gene_symbol_count > 1)
+
+# Let's look at the first 6 rows of our `multi_mapped` object
+head(multi_mapped)
Looks like we have some cases where 3 gene symbols mapped to a single Ensembl ID. We have a total of 130 out of 17984 Ensembl IDs with multiple mappings to gene symbols. If we are not too worried about the 130 IDs with multiple mappings, we can filter them out for the purpose of having 1:1 mappings for our downstream analysis.
@@ -3265,45 +4065,45 @@The next code chunk we will rerun the mapIds()
function, this time supplying the "filter"
option to the multiVals
argument. This will remove all instances of multiple mappings and return a list of only the gene identifiers and symbols that had 1:1 mapping. Use ?mapIds
to see more options or strategies.
# Map ensembl IDs to their associated gene symbols
-filtered_mapped_df <- data.frame(
- "Symbol" = mapIds(
- org.Mm.eg.db, # Replace with annotation package for the organism relevant to your data
- keys = df$Gene,
- column = "SYMBOL", # Replace with the type of gene identifiers you would like to map to
- keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
- multiVals = "filter" # This will drop any `Gene`s that have multiple matches
- )
-) %>%
- # Make an `Ensembl` column to store the rownames
- tibble::rownames_to_column("Ensembl") %>%
- # Join the remaining data from `df` using the Ensembl IDs
- dplyr::inner_join(df, by = c("Ensembl" = "Gene"))
+# Map Ensembl IDs to their associated gene symbols
+filtered_mapped_df <- data.frame(
+ "Symbol" = mapIds(
+ org.Mm.eg.db, # Replace with annotation package for your organism
+ keys = expression_df$Gene,
+ keytype = "ENSEMBL", # Replace with the gene identifiers used in your data
+ column = "SYMBOL", # The type of gene identifiers you would like to map to
+ multiVals = "filter" # This will drop any genes that have multiple matches
+ )
+) %>%
+ # Make an `Ensembl` column to store the rownames
+ tibble::rownames_to_column("Ensembl") %>%
+ # Join the remaining data from `expression_df` using the Ensembl IDs
+ dplyr::inner_join(expression_df, by = c("Ensembl" = "Gene"))
## 'select()' returned 1:many mapping between keys and columns
Now, let’s write our filtered and mapped results to file!
# Write mapped and annotated data frame to output file
-readr::write_tsv(filtered_mapped_df, file.path(
- results_dir,
- "GSE13490_Gene_Symbols.tsv" # Replace with a relevant output file name
-))
+
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessioninfo::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -3313,19 +4113,19 @@ 6 Session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-10-16
+## date 2020-12-21
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
+## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
-## AnnotationDbi * 1.50.3 2020-07-25 [1] Bioconductor
+## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
-## Biobase * 2.48.0 2020-04-27 [1] Bioconductor
-## BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor
+## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
+## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2)
## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0)
-## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0)
## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
@@ -3338,16 +4138,17 @@ 6 Session info
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
-## IRanges * 2.22.2 2020-05-21 [1] Bioconductor
+## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0)
## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
-## org.Mm.eg.db * 3.11.4 2020-10-06 [1] Bioconductor
+## org.Mm.eg.db * 3.12.0 2020-12-16 [1] Bioconductor
## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
@@ -3355,18 +4156,18 @@ 6 Session info
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
-## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
-## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
-## RSQLite 2.2.0 2020-01-07 [1] RSPM (R 4.0.2)
+## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
-## S4Vectors * 0.26.1 2020-05-16 [1] Bioconductor
+## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
-## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
## tidyr 1.1.2 2020-08-27 [1] RSPM (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
@@ -3380,24 +4181,23 @@ 6 Session info
References
-
-Carlson M., 2019 Genome wide annotation for mouse
-
-
-Carlson M., 2020a AnnotationDbi
+
+Carlson M., 2019 Genome wide annotation for mouse. https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html
-Carlson M., 2020b AnnotationDbi: Introduction to bioconductor annotation packages
+Carlson M., 2020 AnnotationDbi: Introduction to bioconductor annotation packages. https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf
-
-CCDL, 2020 Obtaining annotation for ensembl ids - rna-seq.
-
-
-R Bioconductor Team, 2003 Packages found under annotationdata
+
+Pagès H., M. Carlson, S. Falcon, and N. Li, 2020 AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor. https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html
+
diff --git a/02-microarray/ortholog-mapping_microarray_01_ensembl.Rmd b/02-microarray/ortholog-mapping_microarray_01_ensembl.Rmd
index 4a1f8266..0d4b5a6c 100644
--- a/02-microarray/ortholog-mapping_microarray_01_ensembl.Rmd
+++ b/02-microarray/ortholog-mapping_microarray_01_ensembl.Rmd
@@ -1,7 +1,7 @@
---
title: "Ortholog Mapping - Microarray"
author: "CCDL for ALSF"
-date: "October 2020"
+date: "December 2020"
output:
html_notebook:
toc: true
@@ -43,7 +43,7 @@ if (!dir.exists("data")) {
}
# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
+plots_dir <- "plots"
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
@@ -51,7 +51,7 @@ if (!dir.exists(plots_dir)) {
}
# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
+results_dir <- "results"
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
@@ -81,7 +81,7 @@ You will get an email when it is ready.
## About the dataset we are using for this example
For this example analysis, we will use this [CREB overexpression zebrafish dataset](https://www.refine.bio/experiments/GSE71270/creb-overexpression-induces-leukemia-in-zebrafish-by-blocking-myeloid-differentiation-process).
-@Tregnago2016 measured microarray gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.
+@Tregnago2016 used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.
## Place the dataset in your new `data/` folder
@@ -122,19 +122,24 @@ This is handy to do because if we want to switch the dataset (see next section f
```{r}
# Define the file path to the data directory
-data_dir <- file.path("data", "GSE13490") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE13490.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE13490.tsv") # Replace with file path to your metadata
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE71270")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE71270.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE71270.tsv")
```
Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above.
```{r}
-# Check if the gene expression matrix file is at the file path stored in `data_file`
+# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
# Check if the metadata file is at the file path stored in `metadata_file`
@@ -164,7 +169,7 @@ See our Getting Started page with [instructions for package installation](https:
Attach a package we need for this analysis.
-```{r}
+```{r message=FALSE}
# We will need this so we can use the pipe: %>%
library(magrittr)
```
@@ -178,36 +183,50 @@ The [HGNC Comparison of Orthology Predictions (HCOP)](https://www.genenames.org/
In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene.
HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes, including zebrafish, which we will use in this notebook [@hcop-help].
-First, we need to download the file from the server holding the HGNC data.
-Go to this [directory page of the HGNC Comparison of Orthology Predictions (HCOP) files](ftp://ftp.ebi.ac.uk/pub/databases/genenames/hcop/).
+We can download the human to zebrafish translation file we need for this example using the `download.file()` command.
+For this notebook, we want to download the file named `human_zebrafish_hcop_fifteen_column.txt.gz`.
-This is where the files that reflect the data provided via the [HGNC database](https://www.genenames.org/) are maintained.
-Ortholog species files with the '6 Column' output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the '15 Column' output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.
+First we'll declare a sensible file path for this.
-*Note:* If you are using Safari (or the above FTP server link does not open in a web browser), you may need to go to the [link for the HCOP search tool](https://www.genenames.org/tools/hcop/) and scroll down to "Bulk Downloads" to choose a file to download.
-Here, you can find the same files you would find at the server linked above.
+```{r}
+# Declare what we want the downloaded file to be called and its location
+zebrafish_hgnc_file <- file.path(
+ data_dir,
+ # The name the file will have locally
+ "human_zebrafish_hcop_fifteen_column.txt.gz"
+)
+```
-To download a file, click the file name.
-For this notebook, you will want to download the file named `human_zebrafish_hcop_fifteen_column.txt.gz`.
-If you are using a different dataset, you can replace `zebrafish` in `human_zebrafish_hcop_fifteen_column.txt.gz` with the name of the species you have data for, and click on that file to download.
+Using the file path we just declared, we can use the `destfile` argument to download the file we need to this directory and use this file name.
-
+We are downloading this orthology predictions file from the [HGNC database](https://www.genenames.org/).
+If you are looking for a different species, see the [directory page of the HGNC Comparison of Orthology Predictions (HCOP) files](http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/) and find the file name of the species you are looking for.
-Next, move the `human_zebrafish_hcop_fifteen_column.txt.gz` file into your `data/` folder.
+```{r}
+download.file(
+ paste0(
+ "http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/",
+ # Replace with the file name for the species conversion you want
+ "human_zebrafish_hcop_fifteen_column.txt.gz"
+ ),
+ # The file will be saved to the name and location we defined earlier
+ destfile = zebrafish_hgnc_file
+)
+```
-*Note:* If you are using Safari, this file will automatically be decompressed, so the name of the file would instead be `human_zebrafish_hcop_fifteen_column.txt` (don't forget to change the file name in the chunk below if this is the case).
+If you are using a different dataset, in the last chunk you can replace `zebrafish` in `human_zebrafish_hcop_fifteen_column.txt.gz` with the name of the species you have data for (if you see it listed in the directory).
+Don't forget to change the destination file as well to reflect what you download!
-Now let's double check that the file is in the right place.
+Ortholog species files with the '6 Column' output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the '15 Column' output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.
-```{r}
-# Define the file path to organism orthology file downloaded from the HGNC database
-zebrafish_hgnc_file <- file.path("data", "human_zebrafish_hcop_fifteen_column.txt.gz")
+Now let's double check that the zebrafish ortholog file is in the right place.
+```{r}
# Check if the organism orthology file file is in the `data` directory
file.exists(zebrafish_hgnc_file)
```
-In the next chunk, we will read in the orthology file that was just downloaded.
+Now we can read in the orthology file that we downloaded.
```{r}
# Read in the data from HGNC
@@ -239,8 +258,8 @@ We stored our file path for the dataset in an object named `data_file` in [this
```{r}
# Read in data TSV file
zebrafish_genes <- readr::read_tsv(data_file) %>%
- # We only want the gene IDs so let's pull the `Gene` column
- dplyr::pull("Gene")
+ # We only want the gene IDs so let's keep only the `Gene` column
+ dplyr::select("Gene")
```
## Mapping human gene symbols to zebrafish Ensembl gene IDs
@@ -248,7 +267,7 @@ zebrafish_genes <- readr::read_tsv(data_file) %>%
refine.bio data uses Ensembl gene identifiers, which will be in the first column.
```{r}
-# Let's take a look at the first 6 items of `zebrafish_genes`
+# Let's take a look at the first 6 rows of `zebrafish_genes`
head(zebrafish_genes)
```
@@ -263,7 +282,7 @@ This column may assist with addressing some of the multi-mappings that we will t
```{r}
human_zebrafish_key <- zebrafish %>%
- # We'll want to subset zebrafish to only the columns we're interested in
+ # Reduce the zebrafish table to only the columns we're interested in
dplyr::select(zebrafish_ensembl, human_symbol, support)
# Since we ignored the additional columns in `zebrafish`, let's check to see if
@@ -272,57 +291,61 @@ any(duplicated(human_zebrafish_key))
```
We do have duplicates!
-We don't want to handle duplicate data, so let's remove those duplicates before moving forward.
+Let's remove those duplicates before moving forward, as they provide no extra information at this point.
```{r}
human_zebrafish_key <- human_zebrafish_key %>%
- # We need to use the `distinct()` function to remove duplicates resulted from
- # ignoring the additional columns in the `zebrafish` object
+ # Use the `distinct()` function to remove duplicates resulting from
+ # dropping the additional columns in the `zebrafish` data frame
dplyr::distinct()
```
Now let's join the mapped data from `human_zebrafish_key` with the gene data in `zebrafish_genes`.
+We are using a "left join" here so that we get at least one row per zebrafish gene, even if there is no matching human symbol in the mapping table.
```{r}
-# First, we need to convert our vector of zebrafish genes into a data frame
-human_zebrafish_mapped_df <- data.frame("Gene" = zebrafish_genes) %>%
+human_zebrafish_mapped_df <- zebrafish_genes %>%
# Now we can join the mapped data
dplyr::left_join(human_zebrafish_key, by = c("Gene" = "zebrafish_ensembl"))
```
Here's what the new data frame looks like:
-```{r}
-head(human_zebrafish_mapped_df, n = 25)
+```{r rownames.print = FALSE}
+head(human_zebrafish_mapped_df, n = 10)
```
+Looks like we have mapped symbols!
+
+So now we have all the zebrafish genes mapped to human, but there might be places where there are multiple zebrafish genes that are orthologous to the same human gene, or vice versa.
+
Let's get a summary of the human gene symbols returned in our mapped data frame, `human_zebrafish_mapped_df`.
```{r}
-# We can use this `count()` function after `group_by()`to get a count of how many
+# We can use the `count()` function to get a tally of how many
# `zebrafish_ensembl` IDs there are per `human_symbol`
human_zebrafish_mapped_df %>%
- dplyr::group_by(human_symbol) %>%
- dplyr::count() %>%
- # Sort by highest `n` which would be the human gene symbol with the most
+ # Remove the support column
+ dplyr::select(Gene, human_symbol) %>%
+ # Remove any remaining duplicates
+ dplyr::distinct() %>%
+ # Count the number of rows per human gene
+ dplyr::count(human_symbol) %>%
+ # Sort by highest `n` which will be the human gene symbol with the most
# mapped zebrafish Ensembl IDs
dplyr::arrange(desc(n))
```
-Looks like we have mapped symbols!
+There are certainly a good number of places where we mapped multiple zebrafish Ensembl IDs to the same human symbol!
+We'll look at this in a bit.
-Now, let's get an idea of how many zebrafish Ensembl IDs we have that were not mapped to human gene symbols.
-
-```{r}
-sum(is.na(human_zebrafish_mapped_df$human_symbol))
-```
-
-We have 463 NAs, which means we have 463 zebrafish Ensembl IDs that were not mapped to human gene symbols.
-This is okay because we do not expect everything to map across species.
+We can also see that there 738 zebrafish Ensembl IDs that did not map to a human symbol.
+These are the ones with a value of NA.
+This is okay because we do not expect everything to map neatly across species.
## Take a look at some multi-mappings
-If a zebrafish Ensembl gene ID maps to multiple human symbols, the associated values will get duplicated.
+If a zebrafish Ensembl gene ID maps to multiple human symbols, the associated Ensembl ID values will get duplicated in our output data.
Let's look at the `ENSDARG00000069142` example below.
```{r}
@@ -330,7 +353,7 @@ human_zebrafish_mapped_df %>%
dplyr::filter(Gene == "ENSDARG00000069142")
```
-On the other hand, if you were to look at the original data associated to the zebrafish Ensembl IDs, when a human gene symbol maps to multiple zebrafish Ensembl IDs, the values will not get duplicated, but you will have multiple rows associated with that human symbol.
+On the other hand, if you were to look at the original data associated to the zebrafish Ensembl IDs, when a human gene symbol maps to multiple zebrafish Ensembl IDs, the Ensembl IDs will not get duplicated, but you will have multiple rows associated with that human symbol.
Let's look at the `MATR3` example below.
```{r}
@@ -338,6 +361,11 @@ human_zebrafish_mapped_df %>%
dplyr::filter(human_symbol == "MATR3")
```
+We can see that we have multiple zebrafish Ensembl IDs that mapped to the same gene.
+(Notice that we also still have some duplicate zebrafish Ensembl ID/human symbol pairs here because the `support` column was different in the original data set!
+This is why we removed that column before counting above.)
+
+
## Collapse zebrafish genes mapping to multiple human genes
Remember that if a zebrafish Ensembl gene ID maps to multiple human symbols, the values get duplicated.
@@ -349,9 +377,16 @@ In the next chunk, we show how we can collapse all the human gene symbols into o
collapsed_human_symbol_df <- human_zebrafish_mapped_df %>%
# Group by zebrafish Ensembl IDs
dplyr::group_by(Gene) %>%
- # Collapse the mapped values in `human_zebrafish_mapped_df` into one column named
- # `all_human_symbols` -- note that we will lose the `support` column in this summarizing step
- dplyr::summarize(all_human_symbols = paste(human_symbol, collapse = ";"))
+ # Collapse the mapped values in `human_zebrafish_mapped_df` to a
+ # `all_human_symbols` column, removing any duplicated human symbols
+ # note that we will lose the `support` column in this summarizing step
+ dplyr::summarize(
+ # combine unique symbols with semicolons between them
+ all_human_symbols = paste(
+ sort(unique(human_symbol)),
+ collapse = ";"
+ )
+ )
head(collapsed_human_symbol_df)
```
@@ -377,11 +412,11 @@ Since multiple zebrafish Ensembl gene IDs map to the same human symbol, we may w
*This is not at all straightforward!* (see [this paper](https://doi.org/10.1093/bioinformatics/btaa468) for just one example) [@Stamboulian2020].
Gene duplications along the zebrafish lineage may result in complicated relationships among genes, especially with regard to divisions of function.
-Simply combining values across zebrafish transcripts using an average may result in the loss of a lot of data and will likely not be representative of the zebrafish biology.
+Simply combining expression values across zebrafish transcripts that correspond to the same human gene using an average or other summary statistic may result in the loss of a lot of data and will likely not be representative of the zebrafish biology.
One thing we might do to make the problem somewhat simpler is to reduce the number of multi-mapped genes by requiring a certain level of support for each mapping from across the various databases included in `HCOP`.
This will not fully solve the problem (and may not even be desirable in some cases), but we present it here as an example of an approach one might take.
-Therefore, we will use the `support` column to decide which mappings to retain.
+To do this, we will use the `support` column to decide which mappings to retain.
Let's take a look at `support`.
```{r}
@@ -389,31 +424,32 @@ head(human_zebrafish_mapped_df$support)
```
Looks like we have a variety of databases for multiple mappings, but we do have some instances of only one database reported in support of the mapping.
-As we noted earlier, an orthology prediction where more than one of the databases concur would be considered reliable.
-Therefore, where we have multi-mapped zebrafish Ensembl gene IDs, we will take the mappings with more than one database to support the assertion.
+As we noted earlier, an orthology prediction where more than one of the databases concur would be considered more reliable.
+Therefore, where we have multi-mapped zebrafish Ensembl gene IDs, we will retain the mappings with more than one database to support the assertion.
Before we do, let's take a look how many multi-mapped genes there are in the data frame.
```{r}
human_zebrafish_mapped_df %>%
- # Group by human gene symbols
- dplyr::group_by(human_symbol) %>%
+ # Remove the `support` column
+ dplyr::select(Gene, human_symbol) %>%
+ # Remove any remaining duplicates
+ dplyr::distinct() %>%
# Count the number of rows in the dataframe for each symbol
- dplyr::count() %>%
+ dplyr::count(human_symbol) %>%
# Filter out the symbols without multimapped genes
dplyr::filter(n > 1)
```
-Looks like we have 4,192 human gene symbols with multiple mappings.
+Looks like we have 4,169 human gene symbols with multiple mappings.
Now let's filter out the less reliable mappings.
```{r}
filtered_zebrafish_ensembl_df <- human_zebrafish_mapped_df %>%
- # Count the number of databases in the support column for each prediction
+ # Count the number of databases in the support column
+ # by using the number of commas that separate the databases
dplyr::mutate(n_databases = stringr::str_count(support, ",") + 1) %>%
- # Group by human gene symbols
- dplyr::group_by(human_symbol) %>%
- # Now filter for the rows with more than one database in support for each human gene symbol
+ # Now filter to the rows where more than one database supports the mapping
dplyr::filter(n_databases > 1)
head(filtered_zebrafish_ensembl_df)
@@ -423,14 +459,16 @@ Let's count how many multi-mapped genes we have now.
```{r}
filtered_zebrafish_ensembl_df %>%
- # Group by human gene symbols
- dplyr::group_by(human_symbol) %>%
+ # Remove the support column
+ dplyr::select(Gene, human_symbol) %>%
+ # Remove any remaining duplicates
+ dplyr::distinct() %>%
# Count the number of rows in the dataframe for each symbol
- dplyr::count() %>%
- # Filter out the symbols without multimapped genes
+ dplyr::count(human_symbol) %>%
+ # Filter to the symbols with multimapped genes
dplyr::filter(n > 1)
```
-Now we only have 1,803 multi-mapped genes, compared to the 4,192 that we began with.
+Now we only have 1,695 multi-mapped genes, compared to the 4,169 that we began with.
Although we haven't filtered down to zero multi-mapped genes, we have hopefully removed some of the less _reliable_ mappings.
### Write results to file
diff --git a/02-microarray/ortholog-mapping_microarray_01_ensembl.html b/02-microarray/ortholog-mapping_microarray_01_ensembl.html
index bc145f48..051773a1 100644
--- a/02-microarray/ortholog-mapping_microarray_01_ensembl.html
+++ b/02-microarray/ortholog-mapping_microarray_01_ensembl.html
@@ -1263,25 +1263,22 @@
};
-
-
+
+ code.sourceCode > span { display: inline-block; line-height: 1.25; }
+ code.sourceCode > span { color: inherit; text-decoration: inherit; }
+ code.sourceCode > span:empty { height: 1.2em; }
+ .sourceCode { overflow: visible; }
+ code.sourceCode { white-space: pre; position: relative; }
+ div.sourceCode { margin: 1em 0; }
+ pre.sourceCode { margin: 0; }
+ @media screen {
+ div.sourceCode { overflow: auto; }
+ }
+ @media print {
+ code.sourceCode { white-space: pre-wrap; }
+ code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
+ }
+ pre.numberSource code
+ { counter-reset: source-line 0; }
+ pre.numberSource code > span
+ { position: relative; left: -4em; counter-increment: source-line; }
+ pre.numberSource code > span > a:first-child::before
+ { content: counter(source-line);
+ position: relative; left: -1em; text-align: right; vertical-align: baseline;
+ border: none; display: inline-block;
+ -webkit-touch-callout: none; -webkit-user-select: none;
+ -khtml-user-select: none; -moz-user-select: none;
+ -ms-user-select: none; user-select: none;
+ padding: 0 4px; width: 4em;
+ color: #aaaaaa;
+ }
+ pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; }
+ div.sourceCode
+ { }
+ @media screen {
+ code.sourceCode > span > a:first-child::before { text-decoration: underline; }
+ }
+ code span.al { color: #ff0000; } /* Alert */
+ code span.an { color: #008000; } /* Annotation */
+ code span.at { } /* Attribute */
+ code span.bu { } /* BuiltIn */
+ code span.cf { color: #0000ff; } /* ControlFlow */
+ code span.ch { color: #008080; } /* Char */
+ code span.cn { } /* Constant */
+ code span.co { color: #008000; } /* Comment */
+ code span.cv { color: #008000; } /* CommentVar */
+ code span.do { color: #008000; } /* Documentation */
+ code span.er { color: #ff0000; font-weight: bold; } /* Error */
+ code span.ex { } /* Extension */
+ code span.im { } /* Import */
+ code span.in { color: #008000; } /* Information */
+ code span.kw { color: #0000ff; } /* Keyword */
+ code span.op { } /* Operator */
+ code span.ot { color: #ff4000; } /* Other */
+ code span.pp { color: #ff4000; } /* Preprocessor */
+ code span.sc { color: #008080; } /* SpecialChar */
+ code span.ss { color: #008080; } /* SpecialString */
+ code span.st { color: #008080; } /* String */
+ code span.va { } /* Variable */
+ code span.vs { color: #008080; } /* VerbatimString */
+ code span.wa { color: #008000; font-weight: bold; } /* Warning */
+
+
+
+
-
-
+
@@ -2874,15 +3686,20 @@
@@ -2948,7 +3774,7 @@
Ortholog Mapping - Microarray
CCDL for ALSF
-October 2020
+December 2020
@@ -2969,26 +3795,26 @@ 2.1 Obtain the .Rmd
2.2 Set up your analysis folders
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
-# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
- dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For this example analysis, we will use this CREB overexpression zebrafish dataset. Tregnago et al. (2016) measured microarray gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.
+For this example analysis, we will use this CREB overexpression zebrafish dataset. Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.
data/
folderIn order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
-# Define the file path to the data directory
-data_dir <- file.path("data", "GSE13490") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "GSE13490.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_GSE13490.tsv") # Replace with file path to your metadata
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE71270")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE71270.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE71270.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
@@ -3069,31 +3900,43 @@See our Getting Started page with instructions for package installation for a list of the software you will need, as well as more tips and resources.
Attach a package we need for this analysis.
-# We will need this so we can use the pipe: %>%
-library(magrittr)
+
The HUGO Gene Nomenclature Committee (HGNC) assigns a unique and ideally meaningful name and symbol to every human gene. The HGNC database currently contains over 39,000 public records containing approved human gene nomenclature and associated gene information (Gray et al. 2015).
-The HGNC Comparison of Orthology Predictions (HCOP) is a search tool that combines orthology predictions for a specified human gene, or set of human genes from a variety of sources, including Ensembl Compara, HGNC, and NCBI Gene Orthology (Wright et al. 2005). In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes, including zebrafish, which we will use in this notebook (HGNC team 2020).
-First, we need to download the file from the server holding the HGNC data. Go to this directory page of the HGNC Comparison of Orthology Predictions (HCOP) files.
-This is where the files that reflect the data provided via the HGNC database are maintained. Ortholog species files with the ‘6 Column’ output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the ‘15 Column’ output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.
-Note: If you are using Safari (or the above FTP server link does not open in a web browser), you may need to go to the link for the HCOP search tool and scroll down to “Bulk Downloads” to choose a file to download. Here, you can find the same files you would find at the server linked above.
-To download a file, click the file name. For this notebook, you will want to download the file named human_zebrafish_hcop_fifteen_column.txt.gz
. If you are using a different dataset, you can replace zebrafish
in human_zebrafish_hcop_fifteen_column.txt.gz
with the name of the species you have data for, and click on that file to download.
Next, move the human_zebrafish_hcop_fifteen_column.txt.gz
file into your data/
folder.
Note: If you are using Safari, this file will automatically be decompressed, so the name of the file would instead be human_zebrafish_hcop_fifteen_column.txt
(don’t forget to change the file name in the chunk below if this is the case).
Now let’s double check that the file is in the right place.
-# Define the file path to organism orthology file downloaded from the HGNC database
-zebrafish_hgnc_file <- file.path("data", "human_zebrafish_hcop_fifteen_column.txt.gz")
-
-# Check if the organism orthology file file is in the `data` directory
-file.exists(zebrafish_hgnc_file)
+The HUGO Gene Nomenclature Committee (HGNC) assigns a unique and ideally meaningful name and symbol to every human gene. The HGNC database currently contains over 39,000 public records containing approved human gene nomenclature and associated gene information (Gray et al. 2015).
+The HGNC Comparison of Orthology Predictions (HCOP) is a search tool that combines orthology predictions for a specified human gene, or set of human genes from a variety of sources, including Ensembl Compara, HGNC, and NCBI Gene Orthology (Wright et al. 2005). In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes, including zebrafish, which we will use in this notebook (HGNC Team 2020).
+We can download the human to zebrafish translation file we need for this example using the download.file()
command. For this notebook, we want to download the file named human_zebrafish_hcop_fifteen_column.txt.gz
.
First we’ll declare a sensible file path for this.
+# Declare what we want the downloaded file to be called and its location
+zebrafish_hgnc_file <- file.path(
+ data_dir,
+ # The name the file will have locally
+ "human_zebrafish_hcop_fifteen_column.txt.gz"
+)
Using the file path we just declared, we can use the destfile
argument to download the file we need to this directory and use this file name.
We are downloading this orthology predictions file from the HGNC database. If you are looking for a different species, see the directory page of the HGNC Comparison of Orthology Predictions (HCOP) files and find the file name of the species you are looking for.
+download.file(
+ paste0(
+ "http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/",
+ # Replace with the file name for the species conversion you want
+ "human_zebrafish_hcop_fifteen_column.txt.gz"
+ ),
+ # The file will be saved to the name and location we defined earlier
+ destfile = zebrafish_hgnc_file
+)
If you are using a different dataset, in the last chunk you can replace zebrafish
in human_zebrafish_hcop_fifteen_column.txt.gz
with the name of the species you have data for (if you see it listed in the directory). Don’t forget to change the destination file as well to reflect what you download!
Ortholog species files with the ‘6 Column’ output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the ‘15 Column’ output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.
+Now let’s double check that the zebrafish ortholog file is in the right place.
+# Check if the organism orthology file file is in the `data` directory
+file.exists(zebrafish_hgnc_file)
## [1] TRUE
-In the next chunk, we will read in the orthology file that was just downloaded.
-# Read in the data from HGNC
-zebrafish <- readr::read_tsv(zebrafish_hgnc_file)
-## Parsed with column specification:
+Now we can read in the orthology file that we downloaded.
+
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## human_entrez_gene = col_character(),
## human_ensembl_gene = col_character(),
@@ -3112,230 +3955,247 @@ 4.2 Import data from HGNC
## support = col_character()
## )
Let’s take a look at what zebrafish
looks like.
-zebrafish
+
We are going to manipulate some of the column names to make things easier when calling them downstream.
-zebrafish <- zebrafish %>%
- set_names(names(.) %>%
- # Removing extra word in some of the column names
- gsub("_gene$", "", .))
+
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the data TSV file and add it as an object to your environment.
We stored our file path for the dataset in an object named data_file
in this previous step.
# Read in data TSV file
-zebrafish_genes <- readr::read_tsv(data_file) %>%
- # We only want the gene IDs so let's pull the `Gene` column
- dplyr::pull("Gene")
-## Parsed with column specification:
+# Read in data TSV file
+zebrafish_genes <- readr::read_tsv(data_file) %>%
+ # We only want the gene IDs so let's keep only the `Gene` column
+ dplyr::select("Gene")
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## Gene = col_character(),
-## GSM340064 = col_double(),
-## GSM340065 = col_double(),
-## GSM340066 = col_double(),
-## GSM340067 = col_double(),
-## GSM340068 = col_double(),
-## GSM340069 = col_double(),
-## GSM340070 = col_double(),
-## GSM340071 = col_double(),
-## GSM340072 = col_double(),
-## GSM340073 = col_double(),
-## GSM340074 = col_double(),
-## GSM340075 = col_double(),
-## GSM340076 = col_double(),
-## GSM340077 = col_double(),
-## GSM340078 = col_double()
+## GSM1831675 = col_double(),
+## GSM1831676 = col_double(),
+## GSM1831677 = col_double(),
+## GSM1831678 = col_double(),
+## GSM1831679 = col_double(),
+## GSM1831680 = col_double(),
+## GSM1831681 = col_double(),
+## GSM1831682 = col_double(),
+## GSM1831683 = col_double(),
+## GSM1831684 = col_double()
## )
refine.bio data uses Ensembl gene identifiers, which will be in the first column.
-# Let's take a look at the first 6 items of `zebrafish_genes`
-head(zebrafish_genes)
-## [1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028"
-## [4] "ENSMUSG00000000031" "ENSMUSG00000000037" "ENSMUSG00000000049"
+
+Ensembl gene identifiers have different species-specific prefixes. In zebrafish, Ensembl gene identifiers begin with ENSDARG
(in human, ENSG
, etc.).
Now let’s do the mapping!
We’re interested in the human_symbol
, zebrafish_ensembl
, and support
columns specifically. The support
column contains a list of associated databases that support each assertion. This column may assist with addressing some of the multi-mappings that we will talk about later.
human_zebrafish_key <- zebrafish %>%
- # We'll want to subset zebrafish to only the columns we're interested in
- dplyr::select(zebrafish_ensembl, human_symbol, support)
-
-# Since we ignored the additional columns in `zebrafish`, let's check to see if
-# we have any duplicates in our `human_zebrafish_key`
-any(duplicated(human_zebrafish_key))
+human_zebrafish_key <- zebrafish %>%
+ # Reduce the zebrafish table to only the columns we're interested in
+ dplyr::select(zebrafish_ensembl, human_symbol, support)
+
+# Since we ignored the additional columns in `zebrafish`, let's check to see if
+# we have any duplicates in our `human_zebrafish_key`
+any(duplicated(human_zebrafish_key))
## [1] TRUE
-We do have duplicates! We don’t want to handle duplicate data, so let’s remove those duplicates before moving forward.
-human_zebrafish_key <- human_zebrafish_key %>%
- # We need to use the `distinct()` function to remove duplicates resulted from
- # ignoring the additional columns in the `zebrafish` object
- dplyr::distinct()
-Now let’s join the mapped data from human_zebrafish_key
with the gene data in zebrafish_genes
.
# First, we need to convert our vector of zebrafish genes into a data frame
-human_zebrafish_mapped_df <- data.frame("Gene" = zebrafish_genes) %>%
- # Now we can join the mapped data
- dplyr::left_join(human_zebrafish_key, by = c("Gene" = "zebrafish_ensembl"))
+We do have duplicates! Let’s remove those duplicates before moving forward, as they provide no extra information at this point.
+human_zebrafish_key <- human_zebrafish_key %>%
+ # Use the `distinct()` function to remove duplicates resulting from
+ # dropping the additional columns in the `zebrafish` data frame
+ dplyr::distinct()
Now let’s join the mapped data from human_zebrafish_key
with the gene data in zebrafish_genes
. We are using a “left join” here so that we get at least one row per zebrafish gene, even if there is no matching human symbol in the mapping table.
human_zebrafish_mapped_df <- zebrafish_genes %>%
+ # Now we can join the mapped data
+ dplyr::left_join(human_zebrafish_key, by = c("Gene" = "zebrafish_ensembl"))
Here’s what the new data frame looks like:
-head(human_zebrafish_mapped_df, n = 25)
+
Looks like we have mapped symbols!
+So now we have all the zebrafish genes mapped to human, but there might be places where there are multiple zebrafish genes that are orthologous to the same human gene, or vice versa.
Let’s get a summary of the human gene symbols returned in our mapped data frame, human_zebrafish_mapped_df
.
# We can use this `count()` function after `group_by()`to get a count of how many
-# `zebrafish_ensembl` IDs there are per `human_symbol`
-human_zebrafish_mapped_df %>%
- dplyr::group_by(human_symbol) %>%
- dplyr::count() %>%
- # Sort by highest `n` which would be the human gene symbol with the most
- # mapped zebrafish Ensembl IDs
- dplyr::arrange(desc(n))
+# We can use the `count()` function to get a tally of how many
+# `zebrafish_ensembl` IDs there are per `human_symbol`
+human_zebrafish_mapped_df %>%
+ # Remove the support column
+ dplyr::select(Gene, human_symbol) %>%
+ # Remove any remaining duplicates
+ dplyr::distinct() %>%
+ # Count the number of rows per human gene
+ dplyr::count(human_symbol) %>%
+ # Sort by highest `n` which will be the human gene symbol with the most
+ # mapped zebrafish Ensembl IDs
+ dplyr::arrange(desc(n))
Looks like we have mapped symbols!
-Now, let’s get an idea of how many zebrafish Ensembl IDs we have that were not mapped to human gene symbols.
-sum(is.na(human_zebrafish_mapped_df$human_symbol))
-## [1] 17918
-We have 463 NAs, which means we have 463 zebrafish Ensembl IDs that were not mapped to human gene symbols. This is okay because we do not expect everything to map across species.
+There are certainly a good number of places where we mapped multiple zebrafish Ensembl IDs to the same human symbol! We’ll look at this in a bit.
+We can also see that there 738 zebrafish Ensembl IDs that did not map to a human symbol. These are the ones with a value of NA. This is okay because we do not expect everything to map neatly across species.
If a zebrafish Ensembl gene ID maps to multiple human symbols, the associated values will get duplicated. Let’s look at the ENSDARG00000069142
example below.
human_zebrafish_mapped_df %>%
- dplyr::filter(Gene == "ENSDARG00000069142")
+If a zebrafish Ensembl gene ID maps to multiple human symbols, the associated Ensembl ID values will get duplicated in our output data. Let’s look at the ENSDARG00000069142
example below.
On the other hand, if you were to look at the original data associated to the zebrafish Ensembl IDs, when a human gene symbol maps to multiple zebrafish Ensembl IDs, the values will not get duplicated, but you will have multiple rows associated with that human symbol. Let’s look at the MATR3
example below.
human_zebrafish_mapped_df %>%
- dplyr::filter(human_symbol == "MATR3")
+On the other hand, if you were to look at the original data associated to the zebrafish Ensembl IDs, when a human gene symbol maps to multiple zebrafish Ensembl IDs, the Ensembl IDs will not get duplicated, but you will have multiple rows associated with that human symbol. Let’s look at the MATR3
example below.
We can see that we have multiple zebrafish Ensembl IDs that mapped to the same gene. (Notice that we also still have some duplicate zebrafish Ensembl ID/human symbol pairs here because the support
column was different in the original data set! This is why we removed that column before counting above.)
Remember that if a zebrafish Ensembl gene ID maps to multiple human symbols, the values get duplicated. We can collapse the multi-mapped values into a list for each Ensembl ID as to not have duplicate values in our data frame.
In the next chunk, we show how we can collapse all the human gene symbols into one column where we store them all for each unique zebrafish Ensembl ID.
-collapsed_human_symbol_df <- human_zebrafish_mapped_df %>%
- # Group by zebrafish Ensembl IDs
- dplyr::group_by(Gene) %>%
- # Collapse the mapped values in `human_zebrafish_mapped_df` into one column named
- # `all_human_symbols` -- note that we will lose the `support` column in this summarizing step
- dplyr::summarize(all_human_symbols = paste(human_symbol, collapse = ";"))
+collapsed_human_symbol_df <- human_zebrafish_mapped_df %>%
+ # Group by zebrafish Ensembl IDs
+ dplyr::group_by(Gene) %>%
+ # Collapse the mapped values in `human_zebrafish_mapped_df` to a
+ # `all_human_symbols` column, removing any duplicated human symbols
+ # note that we will lose the `support` column in this summarizing step
+ dplyr::summarize(
+ # combine unique symbols with semicolons between them
+ all_human_symbols = paste(
+ sort(unique(human_symbol)),
+ collapse = ";"
+ )
+ )
## `summarise()` ungrouping output (override with `.groups` argument)
-head(collapsed_human_symbol_df)
+
Now let’s write our list of human gene symbols for each unique zebrafish Ensembl ID results to file.
-readr::write_tsv(
- collapsed_human_symbol_df,
- file.path(
- results_dir,
- # Replace with a relevant output file name
- "GSE71270_zebrafish_ensembl_to_collapsed_human_gene_symbol.tsv"
- )
-)
+
Since multiple zebrafish Ensembl gene IDs map to the same human symbol, we may want to identify which one of these mappings represents the “true” ortholog, i.e. which zebrafish gene is most similar to the human gene we are interested in. This is not at all straightforward! (see this paper for just one example) (Stamboulian et al. 2020). Gene duplications along the zebrafish lineage may result in complicated relationships among genes, especially with regard to divisions of function.
-Simply combining values across zebrafish transcripts using an average may result in the loss of a lot of data and will likely not be representative of the zebrafish biology. One thing we might do to make the problem somewhat simpler is to reduce the number of multi-mapped genes by requiring a certain level of support for each mapping from across the various databases included in HCOP
. This will not fully solve the problem (and may not even be desirable in some cases), but we present it here as an example of an approach one might take.
Therefore, we will use the support
column to decide which mappings to retain. Let’s take a look at support
.
head(human_zebrafish_mapped_df$support)
-## [1] NA NA NA NA NA NA
-Looks like we have a variety of databases for multiple mappings, but we do have some instances of only one database reported in support of the mapping. As we noted earlier, an orthology prediction where more than one of the databases concur would be considered reliable. Therefore, where we have multi-mapped zebrafish Ensembl gene IDs, we will take the mappings with more than one database to support the assertion.
+Since multiple zebrafish Ensembl gene IDs map to the same human symbol, we may want to identify which one of these mappings represents the “true” ortholog, i.e. which zebrafish gene is most similar to the human gene we are interested in. This is not at all straightforward! (see this paper for just one example) (Stamboulian et al. 2020). Gene duplications along the zebrafish lineage may result in complicated relationships among genes, especially with regard to divisions of function.
+Simply combining expression values across zebrafish transcripts that correspond to the same human gene using an average or other summary statistic may result in the loss of a lot of data and will likely not be representative of the zebrafish biology. One thing we might do to make the problem somewhat simpler is to reduce the number of multi-mapped genes by requiring a certain level of support for each mapping from across the various databases included in HCOP
. This will not fully solve the problem (and may not even be desirable in some cases), but we present it here as an example of an approach one might take.
To do this, we will use the support
column to decide which mappings to retain. Let’s take a look at support
.
## [1] "Inparanoid,PhylomeDB,HomoloGene,Ensembl,OMA,NCBI,ZFIN,OrthoMCL,Panther,OrthoDB"
+## [2] "Inparanoid,EggNOG,HomoloGene,Treefam,Ensembl,OMA,NCBI,ZFIN,OrthoMCL,Panther,OrthoDB"
+## [3] "HomoloGene,Treefam,Ensembl,ZFIN,Panther,OrthoDB"
+## [4] "OrthoDB"
+## [5] "Inparanoid,HomoloGene,EggNOG,Treefam,NCBI,ZFIN,Panther"
+## [6] "OrthoMCL"
+Looks like we have a variety of databases for multiple mappings, but we do have some instances of only one database reported in support of the mapping. As we noted earlier, an orthology prediction where more than one of the databases concur would be considered more reliable. Therefore, where we have multi-mapped zebrafish Ensembl gene IDs, we will retain the mappings with more than one database to support the assertion.
Before we do, let’s take a look how many multi-mapped genes there are in the data frame.
-human_zebrafish_mapped_df %>%
- # Group by human gene symbols
- dplyr::group_by(human_symbol) %>%
- # Count the number of rows in the dataframe for each symbol
- dplyr::count() %>%
- # Filter out the symbols without multimapped genes
- dplyr::filter(n > 1)
+human_zebrafish_mapped_df %>%
+ # Remove the `support` column
+ dplyr::select(Gene, human_symbol) %>%
+ # Remove any remaining duplicates
+ dplyr::distinct() %>%
+ # Count the number of rows in the dataframe for each symbol
+ dplyr::count(human_symbol) %>%
+ # Filter out the symbols without multimapped genes
+ dplyr::filter(n > 1)
Looks like we have 4,192 human gene symbols with multiple mappings.
+Looks like we have 4,169 human gene symbols with multiple mappings.
Now let’s filter out the less reliable mappings.
-filtered_zebrafish_ensembl_df <- human_zebrafish_mapped_df %>%
- # Count the number of databases in the support column for each prediction
- dplyr::mutate(n_databases = stringr::str_count(support, ",") + 1) %>%
- # Group by human gene symbols
- dplyr::group_by(human_symbol) %>%
- # Now filter for the rows with more than one database in support for each human gene symbol
- dplyr::filter(n_databases > 1)
-
-head(filtered_zebrafish_ensembl_df)
+filtered_zebrafish_ensembl_df <- human_zebrafish_mapped_df %>%
+ # Count the number of databases in the support column
+ # by using the number of commas that separate the databases
+ dplyr::mutate(n_databases = stringr::str_count(support, ",") + 1) %>%
+ # Now filter to the rows where more than one database supports the mapping
+ dplyr::filter(n_databases > 1)
+
+head(filtered_zebrafish_ensembl_df)
Let’s count how many multi-mapped genes we have now.
-filtered_zebrafish_ensembl_df %>%
- # Group by human gene symbols
- dplyr::group_by(human_symbol) %>%
- # Count the number of rows in the dataframe for each symbol
- dplyr::count() %>%
- # Filter out the symbols without multimapped genes
- dplyr::filter(n > 1)
+filtered_zebrafish_ensembl_df %>%
+ # Remove the support column
+ dplyr::select(Gene, human_symbol) %>%
+ # Remove any remaining duplicates
+ dplyr::distinct() %>%
+ # Count the number of rows in the dataframe for each symbol
+ dplyr::count(human_symbol) %>%
+ # Filter to the symbols with multimapped genes
+ dplyr::filter(n > 1)
Now we only have 1,803 multi-mapped genes, compared to the 4,192 that we began with. Although we haven’t filtered down to zero multi-mapped genes, we have hopefully removed some of the less reliable mappings.
+Now we only have 1,695 multi-mapped genes, compared to the 4,169 that we began with. Although we haven’t filtered down to zero multi-mapped genes, we have hopefully removed some of the less reliable mappings.
Now let’s write our filtered_zebrafish_ensembl_df
object, with the reliable zebrafish Ensembl IDs for each unique human gene symbol, to file.
readr::write_tsv(
- filtered_zebrafish_ensembl_df,
- file.path(
- results_dir,
- # Replace with a relevant output file name
- "GSE71270_filtered_zebrafish_ensembl_to_human_gene_symbol.tsv"
- )
-)
+
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessioninfo::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -3345,13 +4205,13 @@ 6 Session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-10-16
+## date 2020-12-21
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
+## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
-## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
@@ -3370,23 +4230,23 @@ 6 Session info
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
-## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
-## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
-## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
-## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
@@ -3400,23 +4260,28 @@ 6 Session info
References
-Gray K. A., B. Yates, R. L. Seal, M. W. Wright, and E. A. Bruford, 2015 Genenames.org: The hgnc resources in 2015. Nucleic Acids Res 43. https://doi.org/10.1038/nature11327
+Gray K. A., B. Yates, R. L. Seal, M. W. Wright, and E. A. Bruford, 2015 Genenames.org: The HGNC resources in 2015. Nucleic Acids Research 43. https://doi.org/10.1038/nature11327
-HGNC team, 2020 HCOP help
+HGNC Team, 2020 HCOP help. https://www.genenames.org/help/hcop/
Stamboulian M., R. F. Guerrero, M. W. Hahn, and P. Radivojac, 2020 The ortholog conjecture revisited: The value of orthologs and paralogs in function prediction. Bioinformatics 36: i219–i226. https://doi.org/10.1093/bioinformatics/btaa468
-Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896.
+Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98
-Wright M. W., T. A. Eyre, M. J. Lush, S. Povey, and E. A. Bruford, 2005 HCOP: The hgnc comparison of orthology predictions search tool. Mammalian Genome 16: 827–8. https://doi.org/10.1007/s00335-005-0103-2
+Wright M. W., T. A. Eyre, M. J. Lush, S. Povey, and E. A. Bruford, 2005 HCOP: The HGNC comparison of orthology predictions search tool. Mammalian Genome 16: 827–8. https://doi.org/10.1007/s00335-005-0103-2
This example is one of pathway analysis module set, we recommend looking at the pathway analysis introduction to help you determine which pathway analysis method is best suited for your purposes.
-This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes (e.g., those differentially expressed using some cutoff) shares more or fewer genes with gene sets/pathways than we would expect at random. This pathway analysis method does not require any particular sample size, since the only input from your dataset is a set of genes of interest (Yaari et al. 2013).
+This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.
+This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes (e.g., those differentially expressed using some cutoff) shares more or fewer genes with gene sets/pathways than we would expect by chance.
+ORA is a broadly applicable technique that may be good in scenarios where your dataset or scientific questions don’t fit the requirements of other pathway analyses methods. It also does not require any particular sample size, since the only input from your dataset is a set of genes of interest (Yaari et al. 2013).
+If you have differential expression results or something with a gene-level ranking and a two-group comparison, we recommend considering GSEA for your pathway analysis questions.
⬇️ Jump to the analysis code ⬇️
+Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.
+We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning
section.
This table summarizes the pathway analyses examples in this module.
+Analysis | +What is required for input | +What output looks like | +✅ Pros | +⚠️ Cons | +
---|---|---|---|---|
ORA (Over-representation Analysis) | +A list of gene IDs (no stats needed) | +A per-pathway hypergeometric test result | +- Simple - Inexpensive computationally to calculate p-values |
+- Requires arbitrary thresholds and ignores any statistics associated with a gene - Assumes independence of genes and pathways |
+
GSEA (Gene Set Enrichment Analysis) | +A list of genes IDs with gene-level summary statistics | +A per-pathway enrichment score | +- Includes all genes (no arbitrary threshold!) - Attempts to measure coordination of genes |
+- Permutations can be expensive - Does not account for pathway overlap - Two-group comparisons not always appropriate/feasible |
+
GSVA (Gene Set Variation Analysis) | +A gene expression matrix (like what you get from refine.bio directly) | +Pathway-level scores on a per-sample basis | +- Does not require two groups to compare upfront - Normally distributed scores |
+- Scores are not a good fit for gene sets that contain genes that go up AND down - Method doesn’t assign statistical significance itself - Recommended sample size n > 10 |
+
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
To run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
.Rmd
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
- dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
In this example, we are using differential expression results table we obtained from an example analysis of zebrafish samples overexpressing human CREB experiment using limma
(Ritchie et al. 2015). The table contains Ensembl gene IDs, log fold-changes, and adjusted p-values (FDR in this case).
In this example, we are using differential expression results table we obtained from an example analysis of zebrafish samples overexpressing human CREB experiment using limma
(Ritchie et al. 2015). The table contains Ensembl gene IDs, log fold-changes, and adjusted p-values (FDR in this case).
We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you’d like to follow the steps for obtaining this results file yourself, we suggest going through that differential expression analysis example.
For this example analysis, we will use this CREB overexpression zebrafish experiment (Tregnago et al. 2016). Tregnago et al. (2016) measured microarray gene expression of zebrafish samples overexpressing human CREB, as well as control samples.
+For this example analysis, we will use this CREB overexpression zebrafish experiment (Tregnago et al. 2016). Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.
If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
The file we use here has several columns of differential expression summary statistics. If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend replacing the dge_results_file
with a different file path to a read in a similar table of genes with the information that you are interested in. If your gene table differs, many steps will need to be changed or deleted entirely depending on the contents of that file (particularly in the Determine our genes of interest list
section).
We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
-In this analysis, we will be using clusterProfiler
package to perform ORA and the msigdbr
package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler
(Dolgalev 2020; Subramanian et al. 2005).
We will also need the org.Dr.eg.db
package to perform gene identifier conversion and ggupset
to make an UpSet plot (Carlson 2019; Ahlmann-Eltze 2020).
if (!("clusterProfiler" %in% installed.packages())) {
- # Install this package if it isn't installed yet
- BiocManager::install("clusterProfiler", update = FALSE)
-}
-
-# This is required to make one of the plots that clusterProfiler will make
-if (!("ggupset" %in% installed.packages())) {
- # Install this package if it isn't installed yet
- BiocManager::install("ggupset", update = FALSE)
-}
-
-if (!("msigdbr" %in% installed.packages())) {
- # Install this package if it isn't installed yet
- BiocManager::install("msigdbr", update = FALSE)
-}
-
-if (!("org.Dr.eg.db" %in% installed.packages())) {
- # Install this package if it isn't installed yet
- BiocManager::install("org.Dr.eg.db", update = FALSE)
-}
+In this analysis, we will be using clusterProfiler
package to perform ORA and the msigdbr
package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler
(Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005; Liberzon et al. 2011).
We will also need the org.Dr.eg.db
package to perform gene identifier conversion and ggupset
to make an UpSet plot (Carlson 2019; Ahlmann-Eltze 2020).
if (!("clusterProfiler" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("clusterProfiler", update = FALSE)
+}
+
+# This is required to make one of the plots that clusterProfiler will make
+if (!("ggupset" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("ggupset", update = FALSE)
+}
+
+if (!("msigdbr" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("msigdbr", update = FALSE)
+}
+
+if (!("org.Dr.eg.db" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("org.Dr.eg.db", update = FALSE)
+}
Attach the packages we need for this analysis.
-# Attach the library
-library(clusterProfiler)
-##
-## clusterProfiler v3.16.1 For help: https://guangchuangyu.github.io/software/clusterProfiler
-##
-## If you use clusterProfiler in published research, please cite:
-## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
-##
-## Attaching package: 'clusterProfiler'
-## The following object is masked from 'package:stats':
-##
-## filter
-# Package that contains MSigDB gene sets in tidy format
-library(msigdbr)
-
-# Danio rerio annotation package we'll use for gene identifier conversion
-library(org.Dr.eg.db)
-## Loading required package: AnnotationDbi
-## Loading required package: stats4
-## Loading required package: BiocGenerics
-## Loading required package: parallel
-##
-## Attaching package: 'BiocGenerics'
-## The following objects are masked from 'package:parallel':
-##
-## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
-## clusterExport, clusterMap, parApply, parCapply, parLapply,
-## parLapplyLB, parRapply, parSapply, parSapplyLB
-## The following objects are masked from 'package:stats':
-##
-## IQR, mad, sd, var, xtabs
-## The following objects are masked from 'package:base':
-##
-## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
-## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
-## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
-## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
-## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
-## union, unique, unsplit, which, which.max, which.min
-## Loading required package: Biobase
-## Welcome to Bioconductor
-##
-## Vignettes contain introductory material; view with
-## 'browseVignettes()'. To cite Bioconductor, see
-## 'citation("Biobase")', and for packages 'citation("pkgname")'.
-## Loading required package: IRanges
-## Loading required package: S4Vectors
-##
-## Attaching package: 'S4Vectors'
-## The following object is masked from 'package:clusterProfiler':
-##
-## rename
-## The following object is masked from 'package:base':
-##
-## expand.grid
-##
-## Attaching package: 'IRanges'
-## The following object is masked from 'package:clusterProfiler':
-##
-## slice
-##
-## Attaching package: 'AnnotationDbi'
-## The following object is masked from 'package:clusterProfiler':
-##
-## select
-##
-# We will need this so we can use the pipe: %>%
-library(magrittr)
+
We will read in the differential expression results we will download from online. These results are from a zebrafish microarray experiment we used for differential expression analysis for two groups using limma
(Ritchie et al. 2015). The table contains Ensembl gene IDs, log fold-changes for each group, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to ORA.
For ORA, we only need a list of gene IDs as our input, so this example can work for any situations where you have gene list and want to know more about what biological pathways it shares genes with.
+For this example, we will read in results from a differential expression analysis that we have already performed. Rather than reading from a local file, we will download the results table directly from a URL. These results are from a zebrafish microarray experiment we used for differential expression analysis for two groups using limma
(Ritchie et al. 2015). The table contains Ensembl gene IDs, log fold-changes for each group, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to ORA.
Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. First we will assign the URL to its own variable called, dge_url
.
# Define the url to your differential expression results file
-dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv"
-Read in the file that has differential expression results. Here we are using the URL we set up above, but this can be a local file path instead i.e. you can replace dge_url
in the code below with a path to file you have on your computer like: file.path("results", "GSE71270_limma_results.tsv")
.
# Read in the contents of your differential expression results file
-# `dge_url` can be replaced with a file path to a TSV file with your
-# desired gene list results
-dge_df <- readr::read_tsv(dge_url)
-## Parsed with column specification:
+# Define the url to your differential expression results file
+dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv"
+We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.
+
+Using the URL (dge_url
) and file path (dge_results_file
) we can download the file and use the destfile
argument to specify where it should be saved.
+download.file(
+ dge_url,
+ # The file will be saved to this location and with this name
+ destfile = dge_results_file
+)
+Now let’s double check that the results file is in the right place.
+
+## [1] TRUE
+
Read in the file that has differential expression results.
+# Read in the contents of the differential expression results file
+dge_df <- readr::read_tsv(dge_results_file)
##
+## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────
## cols(
## Gene = col_character(),
## logFC = col_double(),
@@ -3144,336 +3983,404 @@ 4.2 Import data
## adj.P.Val = col_double(),
## B = col_double()
## )
-read_tsv()
can read TSV files online and doesn’t necessarily require you download the file first. Let’s take a look at what these contrast results from the differential expression analysis look like.
dge_df
+Note that read_tsv()
can also read TSV files directly from a URL and doesn’t necessarily require you download the file first. If you wanted to use that feature, you could replace the call above with readr::read_tsv(dge_url)
and skip the download steps.
Let’s take a look at what these results from the differential expression analysis look like.
+clusterProfiler
’s optionsLet’s take a look at what organisms the package supports.
-msigdbr_species()
+msigdbr
The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). We can use the msigdbr
package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler
(Yu et al. 2012; Dolgalev 2020).
The gene sets available directly from MSigDB are applicable to human studies. msigdbr
also supports commonly studied model organisms.
Let’s take a look at what organisms the package supports with msigdbr_species()
.
The data we’re interested in here comes from zebrafish samples, so we can obtain just the gene sets relevant to D. rerio with the species
argument to msigdbr()
.
dr_msigdb_df <- msigdbr(species = "Danio rerio")
-MSigDB contains 8 different gene set collections (Subramanian et al. 2005).
-H: hallmark gene sets
-C1: positional gene sets
-C2: curated gene sets
-C3: motif gene sets
-C4: computational gene sets
-C5: GO gene sets
-C6: oncogenic signatures
-C7: immunologic signatures
-In this example, we will use pathways that are gene sets considered to be “canonical representations of a biological process compiled by domain experts” and are a subset of C2: curated gene sets
(Subramanian et al. 2005).
Specifically, we will use the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways (Kanehisa and Goto 2000).
-First, let’s take a look at what information is included in this data frame.
-head(dr_msigdb_df)
+The data we’re interested in here comes from zebrafish samples, so we can obtain only the gene sets relevant to D. rerio with the species
argument to msigdbr()
.
MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated). In this example, we will use pathways that are gene sets considered to be “canonical representations of a biological process compiled by domain experts” and are a subset of C2: curated gene sets
(Subramanian et al. 2005; Liberzon et al. 2011).
Specifically, we will use the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways (Kanehisa and Goto 2000).
+First, let’s take a look at what information is included in the data frame returned by msigdbr()
.
We will need to use gs_cat
and gs_subcat
columns to construct a filter step that will only keep curated gene sets and KEGG pathways.
# Filter the zebrafish data frame to the KEGG pathways that are included in the
-# curated gene sets
-dr_kegg_df <- dr_msigdb_df %>%
- dplyr::filter(
- gs_cat == "C2", # This is to filter only to the C2 curated gene sets
- gs_subcat == "CP:KEGG" # This is because we only want KEGG pathways
- )
-Note: We could have specified that we wanted the KEGG gene sets using the category
and subcategory
arguments of msigdbr()
, but we’re going for general steps! – use ?msigdbr
to see more information.
# Filter the zebrafish data frame to the KEGG pathways that are included in the
+# curated gene sets
+dr_kegg_df <- dr_msigdb_df %>%
+ dplyr::filter(
+ gs_cat == "C2", # This is to filter only to the C2 curated gene sets
+ gs_subcat == "CP:KEGG" # This is because we only want KEGG pathways
+ )
The clusterProfiler()
function we will use requires a data frame with two columns, where one column contains the term identifier or name and one column contains gene identifiers that match our gene lists we want to check for enrichment.
Our data frame with KEGG terms contains Entrez IDs and gene symbols.
In our differential expression results data frame, dge_df
we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs.
We’re going to convert our identifiers in dge_df
to gene symbols because they are a bit more human readable, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!
The annotation package org.Dr.eg.db
contains information for different identifiers (Carlson 2019). org.Dr.eg.db
is specific to Danio rerio – this is what the Dr
in the package name is referencing.
The annotation package org.Dr.eg.db
contains information for different identifiers (Carlson 2019). org.Dr.eg.db
is specific to Danio rerio – this is what the Dr
in the package name is referencing.
Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.
We can see what types of IDs are available to us in an annotation package with keytypes()
.
keytypes(org.Dr.eg.db)
-## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
-## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
-## [11] "GO" "GOALL" "IPI" "ONTOLOGY" "ONTOLOGYALL"
-## [16] "PATH" "PFAM" "PMID" "PROSITE" "REFSEQ"
+
+## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
+## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
+## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
+## [13] "IPI" "ONTOLOGY" "ONTOLOGYALL" "PATH"
+## [17] "PFAM" "PMID" "PROSITE" "REFSEQ"
## [21] "SYMBOL" "UNIGENE" "UNIPROT" "ZFIN"
Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL
) to gene symbols (SYMBOL
), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS
) to Entrez IDs (ENTREZID
).
-The function we will use to map from Ensembl gene IDs to gene symbols is called mapIds()
.
-# This returns a named vector which we can convert to a data frame, where
-# the keys (Ensembl IDs) are the names
-symbols_vector <- mapIds(org.Dr.eg.db, # Specify the annotation package
- # The vector of gene identifiers we want to map
- keys = dge_df$Gene,
- # The type of gene identifier we want returned
- column = "SYMBOL",
- # What type of gene identifiers we're starting with
- keytype = "ENSEMBL",
- # In the case of 1:many mappings, return the
- # first one. This is default behavior!
- multiVals = "first"
-)
+The function we will use to map from Ensembl gene IDs to gene symbols is called mapIds()
and comes from the AnnotationDbi
package.
+# This returns a named vector which we can convert to a data frame, where
+# the keys (Ensembl IDs) are the names
+symbols_vector <- mapIds(
+ # Replace with annotation package for the organism relevant to your data
+ org.Dr.eg.db,
+ # The vector of gene identifiers we want to map
+ keys = dge_df$Gene,
+ # Replace with the type of gene identifiers in your data
+ keytype = "ENSEMBL",
+ # Replace with the type of gene identifiers you would like to map to
+ column = "SYMBOL",
+ # In the case of 1:many mappings, return the
+ # first one. This is default behavior!
+ multiVals = "first"
+)
## 'select()' returned 1:many mapping between keys and columns
This message is letting us know that sometimes Ensembl gene identifiers will map to multiple gene symbols. In this case, it’s also possible that a gene symbol will map to multiple Ensembl IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.
Let’s create a two column data frame that shows the gene symbols and their Ensembl IDs side-by-side.
-# We would like a data frame we can join to the differential expression stats
-gene_key_df <- data.frame(
- ensembl_id = names(symbols_vector),
- gene_symbol = symbols_vector,
- stringsAsFactors = FALSE
-) %>%
- # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
- # from the data frame
- dplyr::filter(!is.na(gene_symbol))
+# We would like a data frame we can join to the differential expression stats
+gene_key_df <- data.frame(
+ ensembl_id = names(symbols_vector),
+ gene_symbol = symbols_vector,
+ stringsAsFactors = FALSE
+) %>%
+ # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
+ # from the data frame
+ dplyr::filter(!is.na(gene_symbol))
Let’s see a preview of gene_key_df
.
-head(gene_key_df)
+
Now we are ready to add the gene_key_df
to our data frame with the differential expression stats, dge_df
. Here we’re using a dplyr::left_join()
because we only want to retain the genes that have gene symbols and this will filter out anything in our dge_df
that does not have gene symbols when we join using the Ensembl gene identifiers.
-dge_annot_df <- gene_key_df %>%
- # Using a left join removes the rows without gene symbols because those rows
- # have already been removed in `gene_symbols_df`
- dplyr::left_join(dge_df,
- # The name of the column that contains the Ensembl gene IDs
- # in the left data frame and right data frame
- by = c("ensembl_id" = "Gene")
- )
+dge_annot_df <- gene_key_df %>%
+ # Using a left join removes the rows without gene symbols because those rows
+ # have already been removed in `gene_symbols_df`
+ dplyr::left_join(dge_df,
+ # The name of the column that contains the Ensembl gene IDs
+ # in the left data frame and right data frame
+ by = c("ensembl_id" = "Gene")
+ )
Let’s take a look at what this data frame looks like.
-# Print out a preview
-head(dge_annot_df)
+
Over-representation testing using clusterProfiler
is based on a hypergeometric test (Guangchuang Yu).
\(p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{ {M \choose i}{ {N-M} \choose {n-i} } } { {N \choose n} }\)
-Where N
is the number of genes in the background distribution, M
is the number of genes in a pathway, n
is the number of genes we are interested in (our differentially expressed genes), and k
is the number of genes that overlap between the pathway and our genes of interest.
So, we will need to provide to clusterProfiler
two genes lists:
Over-representation testing using clusterProfiler
is based on a hypergeometric test (often referred to as Fisher’s exact test) (Yu 2020). For more background on hypergeometric tests, this handy tutorial explains more about how hypergeometric tests work (Puthier and van Helden 2015).
We will need to provide to clusterProfiler
two genes lists:
n
)N
). (All genes that originally had an opportunity to be measured).We will use our differential expression results to get a genes of interest list. Let’s use our adjusted p values as a cutoff.
-# Select genes that are below a cutoff
-genes_of_interest <- dge_annot_df %>%
- # Here we want the top differentially expressed genes and we will use downregulated genes
- dplyr::filter(adj.P.Val < 0.05, logFC < -1) %>%
- # We are extracting the gene symbols as a vector
- dplyr::pull(gene_symbol)
+This step is highly variable depending on what your gene list is, what information it contains and what your goals are. You may want to delete this next chunk entirely if you supply an already determined list of genes OR you may need to adjust the cutoffs and column names.
+# Select genes that are below a cutoff
+genes_of_interest <- dge_annot_df %>%
+ # Here we want the top differentially expressed genes and we will use
+ # downregulated genes
+ dplyr::filter(adj.P.Val < 0.05, logFC < -1) %>%
+ # We are extracting the gene symbols as a vector
+ dplyr::pull(gene_symbol)
There are a lot of ways we could make a genes of interest list, and using a p-value cutoff for differential expression analysis is just one way you can do this.
ORA generally requires you make some sort of arbitrary decision to obtain your genes of interest list and this is one of the approach’s weaknesses – to get to a gene list we’ve removed all other context.
Because one gene_symbol
may map to multiple Ensembl IDs, we need to make sure we have no repeated gene symbols in this list.
# Reduce to only unique gene symbols
-genes_of_interest <- unique(as.character(genes_of_interest))
-
-# Let's print out some of these genes
-head(genes_of_interest)
+# Reduce to only unique gene symbols
+genes_of_interest <- unique(as.character(genes_of_interest))
+
+# Let's print out some of these genes
+head(genes_of_interest)
## [1] "si:ch1073-67j19.1" "ypel3" "pdia4"
## [4] "cst14a.2" "viml" "spink2.1"
Sometimes folks consider genes from the entire genome to comprise the background, but for our microarray data, we should consider all genes that were measured as our background set. In other words, if we are unable to detect a gene, it should not be in our background set.
We can obtain our detected genes list from our data frame, dge_annot_df
(which we haven’t done filtering on).
background_set <- unique(as.character(dge_annot_df$gene_symbol))
+
enricher()
functionenricher()
functionNow that we have our background set, our genes of interest, and our pathway information, we’re ready to run ORA using the enricher()
function.
kegg_ora_results <- enricher(
- gene = genes_of_interest, # A vector of your genes of interest
- pvalueCutoff = 0.1, # Can choose a FDR cutoff
- pAdjustMethod = "BH", # What method for multiple testing correction should we use
- universe = background_set, # A vector containing your background set genes
- # The pathway information should be a data frame with a term name or
- # identifier and the gene identifiers
- TERM2GENE = dplyr::select(
- dr_kegg_df,
- gs_name,
- gene_symbol
- )
-)
+kegg_ora_results <- enricher(
+ gene = genes_of_interest, # A vector of your genes of interest
+ pvalueCutoff = 0.1, # Can choose a FDR cutoff
+ pAdjustMethod = "BH", # Method to be used for multiple testing correction
+ universe = background_set, # A vector containing your background set genes
+ # The pathway information should be a data frame with a term name or
+ # identifier and the gene identifiers
+ TERM2GENE = dplyr::select(
+ dr_kegg_df,
+ gs_name,
+ gene_symbol
+ )
+)
Note: using enrichKEGG()
is a shortcut for doing ORA using KEGG, but the approach we covered here can be used with any gene sets you’d like!
What is returned by enricher()
? You can run View(kegg_ora_results)
or click on the object in your Environment panel.
The information we’re most likely interested in is in the results
slot. Let’s convert this into a data frame that we can write to file.
kegg_result_df <- data.frame(kegg_ora_results@result)
+
Let’s print out a sneak peek of it here and take a look at how many sets do we have that fit our cutoff of 0.1
FDR?
kegg_result_df %>%
- dplyr::filter(p.adjust < 0.1)
+
Looks like there are four KEGG sets returned as significant at FDR of 0.1
.
We can use a dot plot to visualize our significant enrichment results. The enrichplot::dotplot()
function will only plot gene sets that are significant according to the multiple testing corrected p values (in the p.adjust
column) and the pvalueCutoff
you provided in the enricher()
step.
enrich_plot <- enrichplot::dotplot(kegg_ora_results)
-
-# Print out the plot here
-enrich_plot
-
+
+## wrong orderBy parameter; set to default `orderBy = "x"`
+
+
Use ?enrichplot::dotplot
to see the help page for more about how to use this function.
This plot is arguably more useful when we have a large number of significant pathways.
Let’s save it to a PNG.
-ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_enrich_plot.png"),
- plot = enrich_plot
-)
+
## Saving 7 x 5 in image
We can use an UpSet plot to visualize the overlap between the gene sets that were returned as significant.
-enrichplot::upsetplot(kegg_ora_results)
+
See that KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION
and KEGG_LYSOSOME
have all their genes in common. Gene sets or pathways aren’t independent!
Let’s also save this to a PNG.
-ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_upset_plot.png"),
- plot = enrich_plot
-)
+
## Saving 7 x 5 in image
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessionInfo()
-## R version 4.0.2 (2020-06-22)
-## Platform: x86_64-pc-linux-gnu (64-bit)
-## Running under: Ubuntu 20.04 LTS
-##
-## Matrix products: default
-## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
-##
-## locale:
-## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
-## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
-## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
-## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
-## [9] LC_ADDRESS=C LC_TELEPHONE=C
-## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
-##
-## attached base packages:
-## [1] parallel stats4 stats graphics grDevices utils datasets
-## [8] methods base
+
+## ─ Session info ─────────────────────────────────────────────────────
+## setting value
+## version R version 4.0.2 (2020-06-22)
+## os Ubuntu 20.04 LTS
+## system x86_64, linux-gnu
+## ui X11
+## language (EN)
+## collate en_US.UTF-8
+## ctype en_US.UTF-8
+## tz Etc/UTC
+## date 2020-12-21
##
-## other attached packages:
-## [1] magrittr_1.5 org.Dr.eg.db_3.11.4 AnnotationDbi_1.50.3
-## [4] IRanges_2.22.2 S4Vectors_0.26.1 Biobase_2.48.0
-## [7] BiocGenerics_0.34.0 msigdbr_7.2.1 clusterProfiler_3.16.1
-## [10] optparse_1.6.6
+## ─ Packages ─────────────────────────────────────────────────────────
+## package * version date lib source
+## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor
+## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
+## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
+## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
+## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
+## BiocManager 1.30.10 2019-11-16 [1] RSPM (R 4.0.0)
+## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
+## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2)
+## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2)
+## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
+## clusterProfiler * 3.18.0 2020-10-27 [1] Bioconductor
+## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
+## cowplot 1.1.0 2020-09-08 [1] RSPM (R 4.0.2)
+## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
+## data.table 1.13.0 2020-07-24 [1] RSPM (R 4.0.2)
+## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0)
+## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
+## DO.db 2.9 2020-12-16 [1] Bioconductor
+## DOSE 3.16.0 2020-10-27 [1] Bioconductor
+## downloader 0.4 2015-07-09 [1] RSPM (R 4.0.0)
+## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
+## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0)
+## enrichplot 1.10.1 2020-11-14 [1] Bioconductor
+## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0)
+## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
+## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0)
+## fastmatch 1.1-0 2017-01-28 [1] RSPM (R 4.0.0)
+## fgsea 1.16.0 2020-10-27 [1] Bioconductor
+## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
+## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
+## ggforce 0.3.2 2020-06-23 [1] RSPM (R 4.0.2)
+## ggplot2 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
+## ggraph 2.0.3 2020-05-20 [1] RSPM (R 4.0.2)
+## ggrepel 0.8.2 2020-03-08 [1] RSPM (R 4.0.2)
+## ggupset 0.3.0 2020-05-05 [1] RSPM (R 4.0.0)
+## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
+## GO.db 3.12.1 2020-12-16 [1] Bioconductor
+## GOSemSim 2.16.1 2020-10-29 [1] Bioconductor
+## graphlayouts 0.7.0 2020-04-25 [1] RSPM (R 4.0.2)
+## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.0.0)
+## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
+## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
+## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
+## igraph 1.2.6 2020-10-06 [1] RSPM (R 4.0.2)
+## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
+## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
+## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
+## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0)
+## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
+## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0)
+## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
+## MASS 7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+## Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2)
+## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0)
+## msigdbr * 7.2.1 2020-10-02 [1] RSPM (R 4.0.2)
+## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
+## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
+## org.Dr.eg.db * 3.12.0 2020-12-16 [1] Bioconductor
+## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
+## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## plyr 1.8.6 2020-03-03 [1] RSPM (R 4.0.2)
+## polyclip 1.10-0 2019-03-14 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
+## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
+## qvalue 2.22.0 2020-10-27 [1] Bioconductor
+## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
+## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2)
+## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
+## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0)
+## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
+## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
+## reshape2 1.4.4 2020-04-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
+## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
+## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2)
+## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
+## rvcheck 0.1.8 2020-03-01 [1] RSPM (R 4.0.0)
+## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
+## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
+## scatterpie 0.1.5 2020-09-09 [1] RSPM (R 4.0.2)
+## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
+## shadowtext 0.0.7 2019-11-06 [1] RSPM (R 4.0.0)
+## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
+## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
+## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
+## tidygraph 1.2.0 2020-05-12 [1] RSPM (R 4.0.2)
+## tidyr 1.1.2 2020-08-27 [1] RSPM (R 4.0.2)
+## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
+## tweenr 1.0.1 2018-12-14 [1] RSPM (R 4.0.2)
+## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
+## viridis 0.5.1 2018-03-29 [1] RSPM (R 4.0.0)
+## viridisLite 0.3.0 2018-02-01 [1] RSPM (R 4.0.0)
+## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
+## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2)
+## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
##
-## loaded via a namespace (and not attached):
-## [1] enrichplot_1.8.1 bit64_4.0.5 progress_1.2.2
-## [4] httr_1.4.2 RColorBrewer_1.1-2 R.cache_0.14.0
-## [7] tools_4.0.2 backports_1.1.10 R6_2.4.1
-## [10] DBI_1.1.0 colorspace_1.4-1 prettyunits_1.1.1
-## [13] tidyselect_1.1.0 gridExtra_2.3 curl_4.3
-## [16] bit_4.0.4 compiler_4.0.2 cli_2.0.2
-## [19] scatterpie_0.1.5 xml2_1.3.2 labeling_0.3
-## [22] triebeard_0.3.0 scales_1.1.1 readr_1.3.1
-## [25] ggridges_0.5.2 stringr_1.4.0 digest_0.6.25
-## [28] ggupset_0.3.0 rmarkdown_2.4 DOSE_3.14.0
-## [31] R.utils_2.10.1 pkgconfig_2.0.3 htmltools_0.5.0
-## [34] styler_1.3.2 rlang_0.4.7 rstudioapi_0.11
-## [37] RSQLite_2.2.0 gridGraphics_0.5-0 generics_0.0.2
-## [40] farver_2.0.3 jsonlite_1.7.1 BiocParallel_1.22.0
-## [43] GOSemSim_2.14.2 dplyr_1.0.2 R.oo_1.24.0
-## [46] ggplotify_0.0.5 GO.db_3.11.4 Matrix_1.2-18
-## [49] Rcpp_1.0.5 munsell_0.5.0 fansi_0.4.1
-## [52] viridis_0.5.1 lifecycle_0.2.0 R.methodsS3_1.8.1
-## [55] stringi_1.5.3 yaml_2.2.1 ggraph_2.0.3
-## [58] MASS_7.3-51.6 plyr_1.8.6 qvalue_2.20.0
-## [61] grid_4.0.2 blob_1.2.1 ggrepel_0.8.2
-## [64] DO.db_2.9 crayon_1.3.4 lattice_0.20-41
-## [67] cowplot_1.1.0 graphlayouts_0.7.0 splines_4.0.2
-## [70] hms_0.5.3 knitr_1.30 pillar_1.4.6
-## [73] fgsea_1.14.0 igraph_1.2.5 reshape2_1.4.4
-## [76] fastmatch_1.1-0 glue_1.4.2 evaluate_0.14
-## [79] downloader_0.4 BiocManager_1.30.10 data.table_1.13.0
-## [82] urltools_1.7.3 vctrs_0.3.4 tweenr_1.0.1
-## [85] gtable_0.3.0 getopt_1.20.3 purrr_0.3.4
-## [88] polyclip_1.10-0 tidyr_1.1.2 rematch2_2.1.2
-## [91] assertthat_0.2.1 ggplot2_3.3.2 xfun_0.18
-## [94] ggforce_0.3.2 europepmc_0.4 tidygraph_1.2.0
-## [97] viridisLite_0.3.0 tibble_3.0.3 rvcheck_0.1.8
-## [100] memoise_1.1.0 ellipsis_0.3.1
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
Ahlmann-Eltze C., 2020 Ggupset: Combination matrix axis for ’ggplot2’ to create ’upset’ plots.
+Ahlmann-Eltze C., 2020 ggupset: Combination matrix axis for ’ggplot2’ to create ’upset’ plots. https://github.com/const-ae/ggupset
Carlson M., 2019 Genome wide annotation for zebrafish
+Carlson M., 2019 Genome wide annotation for zebrafish. https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html
Dolgalev I., 2020 Msigdbr: MSigDB gene sets for multiple organisms in a tidy data format.
-Guangchuang Yu, ClusterProfiler: Universal enrichment tool for functional and comparative study
+Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html
Kanehisa M., and S. Goto, 2000 KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28: 27–30.
+Kanehisa M., and S. Goto, 2000 KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28: 27–30. https://doi.org/10.1093/nar/28.1.27
Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput. Biol. 8: e1002375.
+Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375
+Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, and P. Tamayo et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260
+Puthier D., and J. van Helden, 2015 Statistics for Bioinformatics - Practicals - Gene enrichment statistics. https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html
Ritchie M. E., B. Phipson, D. Wu, Y. Hu, and C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007
Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102: 15545–15550.
+Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102
Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896.
+Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98
Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: a method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Res. 41: e170.
+Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660
Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 ClusterProfiler: An r package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118
+Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118
+Yu G., 2020 clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html
This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.
+This particular example analysis shows how you can use Gene Set Enrichment Analysis (GSEA) to detect situations where genes in a predefined gene set or pathway change in a coordinated way between two conditions (Subramanian et al. 2005). Changes at the pathway-level may be statistically significant, and contribute to phenotypic differences, even if the changes in the expression level of individual genes are small.
+⬇️ Jump to the analysis code ⬇️
+Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.
+We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning
section.
This table summarizes the pathway analyses examples in this module.
+Analysis | +What is required for input | +What output looks like | +✅ Pros | +⚠️ Cons | +
---|---|---|---|---|
ORA (Over-representation Analysis) | +A list of gene IDs (no stats needed) | +A per-pathway hypergeometric test result | +- Simple - Inexpensive computationally to calculate p-values |
+- Requires arbitrary thresholds and ignores any statistics associated with a gene - Assumes independence of genes and pathways |
+
GSEA (Gene Set Enrichment Analysis) | +A list of genes IDs with gene-level summary statistics | +A per-pathway enrichment score | +- Includes all genes (no arbitrary threshold!) - Attempts to measure coordination of genes |
+- Permutations can be expensive - Does not account for pathway overlap - Two-group comparisons not always appropriate/feasible |
+
GSVA (Gene Set Variation Analysis) | +A gene expression matrix (like what you get from refine.bio directly) | +Pathway-level scores on a per-sample basis | +- Does not require two groups to compare upfront - Normally distributed scores |
+- Scores are not a good fit for gene sets that contain genes that go up AND down - Method doesn’t assign statistical significance itself - Recommended sample size n > 10 |
+
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
+.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
+If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
In this example, we are using differential expression results table we obtained from an example analysis of zebrafish samples overexpressing human CREB experiment using limma
(Ritchie et al. 2015). The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case).
We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you’d like to follow the steps for obtaining this results file yourself, we suggest going through that differential expression analysis example.
+For this example analysis, we will use this CREB overexpression zebrafish experiment (Tregnago et al. 2016). Tregnago et al. (2016) used microarrays to measure gene expression of ten zebrafish samples, five overexpressing human CREB, as well as five control samples.
+Your new analysis folder should contain:
+.Rmd
you downloadeddata
(currently empty)plots
(currently empty)results
(currently empty)Your example analysis folder should contain your .Rmd
and three empty folders (which won’t be empty for long!).
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
+If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
+In this analysis, we will be using clusterProfiler
package to perform GSEA and the msigdbr
package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler
(Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005; Liberzon et al. 2011). In this analysis, we will be using clusterProfiler
package to perform GSEA and the msigdbr
package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler
(Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005; Liberzon et al. 2011).
We will also need the org.Dr.eg.db
package to perform gene identifier conversion (Carlson 2019).
if (!("clusterProfiler" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("clusterProfiler", update = FALSE)
+}
+
+if (!("msigdbr" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("msigdbr", update = FALSE)
+}
+
+if (!("org.Dr.eg.db" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("org.Dr.eg.db", update = FALSE)
+}
Attach the packages we need for this analysis.
+ +We will read in the differential expression results we will download from online. These results are from a zebrafish microarray experiment we used for differential expression analysis for two groups using limma
(Ritchie et al. 2015). The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case).
Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. First we will assign the URL to its own variable called, dge_url
.
# Define the url to your differential expression results file
+dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv"
We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.
+ +Using the URL (dge_url
) and file path (dge_results_file
) we can download the file and use the destfile
argument to specify where it should be saved.
download.file(
+ dge_url,
+ # The file will be saved to this location and with this name
+ destfile = dge_results_file
+)
Now let’s double check that the results file is in the right place.
+ +## [1] TRUE
+Read in the file that has differential expression results.
+# Read in the contents of the differential expression results file
+dge_df <- readr::read_tsv(dge_results_file)
##
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
+## cols(
+## Gene = col_character(),
+## logFC = col_double(),
+## AveExpr = col_double(),
+## t = col_double(),
+## P.Value = col_double(),
+## adj.P.Val = col_double(),
+## B = col_double()
+## )
+Note that read_tsv()
can also read TSV files directly from a URL and doesn’t necessarily require you download the file first. If you wanted to use that feature, you could replace the call above with readr::read_tsv(dge_url)
and skip the download steps.
Let’s take a look at what these results from the differential expression analysis look like.
+ +msigdbr
We can use the msigdbr
package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler
(Yu et al. 2012; Dolgalev 2020). The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). We can use the msigdbr
package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler
(Yu et al. 2012; Dolgalev 2020).
The gene sets available directly from MSigDB are applicable to human studies. msigdbr
also supports commonly studied model organisms.
Let’s take a look at what organisms the package supports with msigdbr_species()
.
MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).
+In this example, we will use a collection called Hallmark gene sets for GSEA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:
+++Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.
+
Notably, there are only 50 gene sets included in this collection. The fewer gene sets we test, the lower our multiple hypothesis testing burden.
+The data we’re interested in here comes from zebrafish samples, so we can obtain only the Hallmarks gene sets relevant to D. rerio by specifying category = "H"
and species = "Danio rerio"
, respectively, to the msigdbr()
function.
dr_hallmark_df <- msigdbr(
+ species = "Danio rerio", # Replace with species name relevant to your data
+ category = "H"
+)
If you run the chunk above without specifying a category
to the msigdbr()
function, it will return all of the MSigDB gene sets for zebrafish.
Let’s preview what’s in dr_hallmark_df
.
Looks like we have a data frame of gene sets with associated gene symbols and Entrez IDs.
+In our differential expression results data frame, dge_df
we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs for GSEA.
We’re going to convert our identifiers in dge_df
to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!
The annotation package org.Dr.eg.db
contains information for different identifiers (Carlson 2019). org.Dr.eg.db
is specific to Danio rerio – this is what the Dr
in the package name is referencing.
We can see what types of IDs are available to us in an annotation package with keytypes()
.
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
+## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
+## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
+## [13] "IPI" "ONTOLOGY" "ONTOLOGYALL" "PATH"
+## [17] "PFAM" "PMID" "PROSITE" "REFSEQ"
+## [21] "SYMBOL" "UNIGENE" "UNIPROT" "ZFIN"
+Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL
) to Entrez IDs (ENTREZID
), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS
) to gene symbols (SYMBOL
).
Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.
+The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds()
and comes from the AnnotationDbi
package.
Let’s create a data frame that shows the mapped Entrez IDs along with the differential expression stats for the respective Ensembl IDs.
+# First let's create a mapped data frame we can join to the differential
+# expression stats
+dge_mapped_df <- data.frame(
+ entrez_id = mapIds(
+ # Replace with annotation package for the organism relevant to your data
+ org.Dr.eg.db,
+ keys = dge_df$Gene,
+ # Replace with the type of gene identifiers in your data
+ keytype = "ENSEMBL",
+ # Replace with the type of gene identifiers you would like to map to
+ column = "ENTREZID",
+ # This will keep only the first mapped value for each Ensembl ID
+ multiVals = "first"
+ )
+) %>%
+ # If an Ensembl gene identifier doesn't map to a Entrez gene identifier,
+ # drop that from the data frame
+ dplyr::filter(!is.na(entrez_id)) %>%
+ # Make an `Ensembl` column to store the rownames
+ tibble::rownames_to_column("Ensembl") %>%
+ # Now let's join the rest of the expression data
+ dplyr::inner_join(dge_df, by = c("Ensembl" = "Gene"))
## 'select()' returned 1:many mapping between keys and columns
+This 1:many mapping between keys and columns
message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSEA later in this notebook, we keep only the first mapped IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.
Let’s see a preview of dge_mapped_df
.
The goal of GSEA is to detect situations where many genes in a gene set change in a coordinated way, even when individual changes are small in magnitude (Subramanian et al. 2005).
+GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic. This score reflects whether or not a gene set or pathway is overrepresented at the top or bottom of the gene rankings (Yu 2020; Subramanian et al. 2005). Specifically, genes are ranked from most positive to most negative based on their statistic and a running sum is calculated by starting with the most highly ranked genes and increasing the score when a gene is in the pathway and decreasing the score when a gene is not. In this example, the enrichment score for a pathway is the running sum’s maximum deviation from zero. GSEA also assesses statistical significance of the scores for each pathway through permutation testing. As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing (Yu 2020; Subramanian et al. 2005).
+The implementation of GSEA we use in this examples requires a gene list ordered by some statistic (here we’ll use the t-statistic calculated as part of differential gene expression analysis) and input gene sets (Hallmark collection). When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.
+The GSEA()
function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. It requires unique gene identifiers to produce the most accurate results, so we will need to resolve any duplicates found in our dataset. (The GSEA()
function will throw a warning if we do not do this ahead of time.)
Let’s check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs in our data frame of differential expression results.
+ +## [1] TRUE
+Looks like we do have duplicated Entrez IDs. Let’s find out which ones.
+dup_entrez_ids <- dge_mapped_df %>%
+ dplyr::filter(duplicated(entrez_id)) %>%
+ dplyr::pull(entrez_id)
+
+dup_entrez_ids
## [1] "336702" "57924"
+Now let’s take a look at the rows associated with the duplicated Entrez IDs.
+ +We can see that the associated values vary for each row.
+As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSEA()
step. Let’s keep the Entrez IDs associated with the higher absolute value of the t-statistic. GSEA relies on genes’ rankings on the basis of a gene-level statistic and the enrichment score that is calculated reflects the degree to which genes in a gene set are overrepresented in the top or bottom of the rankings (Yu 2020; Subramanian et al. 2005).
Retaining the instance of the Entrez ID with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. We should keep this decision in mind when interpreting our results. For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking.
+We are removing values for two genes here, so it is unlikely to have a considerable impact on our results.
+filtered_dge_mapped_df <- dge_mapped_df %>%
+ # Sort so that the highest absolute values of the t-statistic are at the top
+ dplyr::arrange(dplyr::desc(abs(t))) %>%
+ # Filter out the duplicated rows using `dplyr::distinct()`-- this will keep
+ # the first row with the duplicated value thus keeping the row with the
+ # highest absolute value of the t-statistic
+ dplyr::distinct(entrez_id, .keep_all = TRUE)
Let’s check to see that we removed the duplicate Entrez IDs and kept the rows with the higher absolute value of the t-statistic.
+ +Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher absolute value of the t-statistic!
+In the next chunk, we will create a named vector ranked based on the gene-level t-statistic values.
+# Let's create a named vector ranked based on the t-statistic values
+t_vector <- filtered_dge_mapped_df$t
+names(t_vector) <- filtered_dge_mapped_df$entrez_id
+
+# We need to sort the t-statistic values in descending order here
+t_vector <- sort(t_vector, decreasing = TRUE)
Let’s preview our pre-ranked named vector.
+ +## 555053 140633 407728 368924 335916 323329
+## 20.22172 14.48634 13.88657 12.45258 11.24450 10.92140
+GSEA()
functionGenes were ranked from most positive to most negative, weighted according to their gene-level statistic, in the previous section. In this section, we will implement GSEA to calculate the enrichment score for each gene set using our pre-ranked gene list.
+The GSEA algorithm utilizes random sampling so we are going to set the seed to make our results reproducible.
+ +We can use the GSEA()
function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., gseKEGG()
).
gsea_results <- GSEA(
+ geneList = t_vector, # Ordered ranked gene list
+ minGSSize = 25, # Minimum gene set size
+ maxGSSize = 500, # Maximum gene set set
+ pvalueCutoff = 0.05, # p-value cutoff
+ eps = 0, # Boundary for calculating the p-value
+ seed = TRUE, # Set seed to make results reproducible
+ pAdjustMethod = "BH", # Benjamini-Hochberg correction
+ TERM2GENE = dplyr::select(
+ dr_hallmark_df,
+ gs_name,
+ entrez_gene
+ )
+)
## preparing geneSet collections...
+## GSEA analysis...
+## leading edge analysis...
+## done...
+Significance is assessed by permuting the gene labels of the pre-ranked gene list and recomputing the enrichment scores of the gene set for the permuted data, which generates a null distribution for the enrichment score. The pAdjustMethod
argument to GSEA()
above specifies what method to use for adjusting the p-values to account for multiple hypothesis testing; the pvalueCutoff
argument tells the function to only return pathways with adjusted p-values less than that threshold in the results
slot.
Let’s take a look at the table in the result
slot of gsea_results
.
Looks like we have gene sets returned as significant at FDR (false discovery rate) of 0.05
. If we did not have results that met the pvalueCutoff
condition, this table would be empty.
The NES
column contains the normalized enrichment score, which normalizes for the gene set size, for that pathway.
Let’s convert the contents of result
into a data frame that we can use for further analysis and write to a file later.
We can visualize GSEA results for individual pathways or gene sets using enrichplot::gseaplot()
. Let’s take a look at 2 different pathways – one with a highly positive NES and one with a highly negative NES – to get more insight into how ES are calculated.
Let’s look at the 3 gene sets with the most positive NES.
+gsea_result_df %>%
+ # This returns the 3 rows with the largest NES values
+ dplyr::slice_max(n = 3, order_by = NES)
The gene set HALLMARK_TNFA_SIGNALING_VIA_NFKB
has the most positive NES score.
most_positive_nes_plot <- enrichplot::gseaplot(
+ gsea_results,
+ geneSetID = "HALLMARK_TNFA_SIGNALING_VIA_NFKB",
+ title = "HALLMARK_TNFA_SIGNALING_VIA_NFKB",
+ color.line = "#0d76ff"
+)
+
+most_positive_nes_plot
Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. The red dashed line indicates the enrichment score, which is the maximum deviation from zero. As mentioned earlier, an enrichment is calculated by starting with the most highly ranked genes (according to the gene-level t-statistic values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway.
+The plots returned by enrichplot::gseaplot
are ggplots, so we can use ggplot2::ggsave()
to save them to file.
Let’s save to PNG.
+ggplot2::ggsave(file.path(plots_dir, "GSE71270_gsea_enrich_positive_plot.png"),
+ plot = most_positive_nes_plot
+)
## Saving 7 x 5 in image
+Let’s look for the 3 gene sets with the most negative NES.
+gsea_result_df %>%
+ # Return the 3 rows with the smallest (most negative) NES values
+ dplyr::slice_min(n = 3, order_by = NES)
The gene set HALLMARK_E2F_TARGETS
has the most negative NES.
most_negative_nes_plot <- enrichplot::gseaplot(
+ gsea_results,
+ geneSetID = "HALLMARK_E2F_TARGETS",
+ title = "HALLMARK_E2F_TARGETS",
+ color.line = "#0d76ff"
+)
+
+most_negative_nes_plot
This gene set shows the opposite pattern – genes in the pathway tend to be on the right side of the graph. Again, the red dashed line here indicates the maximum deviation from zero, in other words, the enrichment score. A negative enrichment score will be returned when many genes are near the bottom of the ranked list.
+Let’s save this plot to PNG as well.
+ggplot2::ggsave(file.path(plots_dir, "GSE71270_gsea_enrich_negative_plot.png"),
+ plot = most_negative_nes_plot
+)
## Saving 7 x 5 in image
+At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
+ +## ─ Session info ─────────────────────────────────────────────────────
+## setting value
+## version R version 4.0.2 (2020-06-22)
+## os Ubuntu 20.04 LTS
+## system x86_64, linux-gnu
+## ui X11
+## language (EN)
+## collate en_US.UTF-8
+## ctype en_US.UTF-8
+## tz Etc/UTC
+## date 2020-12-21
+##
+## ─ Packages ─────────────────────────────────────────────────────────
+## package * version date lib source
+## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor
+## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
+## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
+## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
+## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
+## BiocManager 1.30.10 2019-11-16 [1] RSPM (R 4.0.0)
+## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
+## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2)
+## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2)
+## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
+## clusterProfiler * 3.18.0 2020-10-27 [1] Bioconductor
+## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
+## cowplot 1.1.0 2020-09-08 [1] RSPM (R 4.0.2)
+## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
+## data.table 1.13.0 2020-07-24 [1] RSPM (R 4.0.2)
+## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0)
+## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
+## DO.db 2.9 2020-12-01 [1] Bioconductor
+## DOSE 3.16.0 2020-10-27 [1] Bioconductor
+## downloader 0.4 2015-07-09 [1] RSPM (R 4.0.0)
+## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
+## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0)
+## enrichplot 1.10.1 2020-11-14 [1] Bioconductor
+## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0)
+## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
+## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0)
+## fastmatch 1.1-0 2017-01-28 [1] RSPM (R 4.0.0)
+## fgsea 1.16.0 2020-10-27 [1] Bioconductor
+## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
+## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
+## ggforce 0.3.2 2020-06-23 [1] RSPM (R 4.0.2)
+## ggplot2 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
+## ggraph 2.0.3 2020-05-20 [1] RSPM (R 4.0.2)
+## ggrepel 0.8.2 2020-03-08 [1] RSPM (R 4.0.2)
+## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
+## GO.db 3.12.1 2020-12-01 [1] Bioconductor
+## GOSemSim 2.16.1 2020-10-29 [1] Bioconductor
+## graphlayouts 0.7.0 2020-04-25 [1] RSPM (R 4.0.2)
+## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.0.0)
+## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
+## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
+## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
+## igraph 1.2.6 2020-10-06 [1] RSPM (R 4.0.2)
+## IRanges * 2.24.0 2020-10-27 [1] Bioconductor
+## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
+## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
+## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0)
+## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
+## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0)
+## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
+## MASS 7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+## Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2)
+## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0)
+## msigdbr * 7.2.1 2020-10-02 [1] RSPM (R 4.0.2)
+## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
+## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
+## org.Dr.eg.db * 3.12.0 2020-12-01 [1] Bioconductor
+## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
+## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## plyr 1.8.6 2020-03-03 [1] RSPM (R 4.0.2)
+## polyclip 1.10-0 2019-03-14 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
+## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
+## qvalue 2.22.0 2020-10-27 [1] Bioconductor
+## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
+## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2)
+## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
+## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0)
+## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
+## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
+## reshape2 1.4.4 2020-04-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
+## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
+## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2)
+## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
+## rvcheck 0.1.8 2020-03-01 [1] RSPM (R 4.0.0)
+## S4Vectors * 0.28.0 2020-10-27 [1] Bioconductor
+## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
+## scatterpie 0.1.5 2020-09-09 [1] RSPM (R 4.0.2)
+## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
+## shadowtext 0.0.7 2019-11-06 [1] RSPM (R 4.0.0)
+## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
+## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
+## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
+## tidygraph 1.2.0 2020-05-12 [1] RSPM (R 4.0.2)
+## tidyr 1.1.2 2020-08-27 [1] RSPM (R 4.0.2)
+## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
+## tweenr 1.0.1 2018-12-14 [1] RSPM (R 4.0.2)
+## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
+## viridis 0.5.1 2018-03-29 [1] RSPM (R 4.0.0)
+## viridisLite 0.3.0 2018-02-01 [1] RSPM (R 4.0.0)
+## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
+## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2)
+## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
+##
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
+Carlson M., 2019 Genome wide annotation for zebrafish. https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html
+Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html
+Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375
+Liberzon A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, and J. P. Mesirov et al., 2015 The molecular signatures database hallmark gene set collection. Cell Systems 1. https://doi.org/10.1016/j.cels.2015.12.004
+Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, and P. Tamayo et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260
+Ritchie M. E., B. Phipson, D. Wu, Y. Hu, and C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007
+Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102
+Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98
+UC San Diego and Broad Institute Team, GSEA: Gene set enrichment analysis. https://www.gsea-msigdb.org/gsea/index.jsp
+Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118
+Yu G., 2020 clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html
+This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.
+In this example we will cover a method called Gene Set Variation Analysis (GSVA) to calculate gene set or pathway scores on a per-sample basis (Hänzelmann et al. 2013). You can use the GSVA scores for other downstream analyses. In this analysis, we will test GSVA scores for differential expression.
+⬇️ Jump to the analysis code ⬇️
+Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.
+We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning
section.
This table summarizes the pathway analyses examples in this module.
+Analysis | +What is required for input | +What output looks like | +✅ Pros | +⚠️ Cons | +
---|---|---|---|---|
ORA (Over-representation Analysis) | +A list of gene IDs (no stats needed) | +A per-pathway hypergeometric test result | +- Simple - Inexpensive computationally to calculate p-values |
+- Requires arbitrary thresholds and ignores any statistics associated with a gene - Assumes independence of genes and pathways |
+
GSEA (Gene Set Enrichment Analysis) | +A list of genes IDs with gene-level summary statistics | +A per-pathway enrichment score | +- Includes all genes (no arbitrary threshold!) - Attempts to measure coordination of genes |
+- Permutations can be expensive - Does not account for pathway overlap - Two-group comparisons not always appropriate/feasible |
+
GSVA (Gene Set Variation Analysis) | +A gene expression matrix (like what you get from refine.bio directly) | +Pathway-level scores on a per-sample basis | +- Does not require two groups to compare upfront - Normally distributed scores |
+- Scores are not a good fit for gene sets that contain genes that go up AND down - Method doesn’t assign statistical significance itself - Recommended sample size n > 10 |
+
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
+.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
+If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
+
+# Define the file path to the gene_sets directory
+gene_sets_dir <- "gene_sets"
+
+# Create the gene_sets folder if it doesn't exist
+if (!dir.exists(gene_sets_dir)) {
+ dir.create(gene_sets_dir)
+}
In the same place you put this .Rmd
file, you should now have four new empty folders called data
, plots
, results
, and gene_sets
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
+Go to this dataset’s page on refine.bio.
+Click the “Download Now” button on the right side of this screen.
+ +Fill out the pop up window with your email and our Terms and Conditions:
+ +It may take a few minutes for the dataset to process. You will get an email when it is ready.
+For this example analysis, we will use this medulloblastoma dataset (Northcott et al. 2012). The data that we downloaded from refine.bio for this analysis has 285 microarray samples obtained from patients with medulloblastoma. Medulloblastoma is the most common childhood brain cancer and is often categorized by subgroups. We will use these subgroup
labels from our metadata to perform differential expression with our GSVA scores.
data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
+The GSE37382
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the GSE37382
folder into your newly created data/
folder.
Your new analysis folder should contain:
+.Rmd
you downloadedGSE37382
folder which contains:
+plots
(currently empty)results
(currently empty)gene_sets
(currently empty)Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):
+ +In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
+First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "GSE37382")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "GSE37382.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
+If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
+If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
+In this analysis, we will be using the GSVA
package to perform GSVA and the qusage
package to read in the GMT file containing the gene set data (Hänzelmann et al. 2013; Yaari et al. 2013).
We will also need the org.Hs.eg.db
package to perform gene identifier conversion (Carlson 2020).
We’ll also be performing differential expression test on our GSVA scores, and for that we will use limma
(Ritchie et al. 2015) and we’ll make a sina plot of the scores of our most significant pathway using a ggplot2 companion package, ggforce
.
if (!("GSVA" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("GSVA", update = FALSE)
+}
+
+if (!("qusage" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("qusage", update = FALSE)
+}
+
+if (!("org.Hs.eg.db" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("org.Hs.eg.db", update = FALSE)
+}
+
+if (!("limma" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("limma", update = FALSE)
+}
+
+if (!("ggforce" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ install.packages("ggforce")
+}
Attach the packages we need for this analysis.
+# Attach the `qusage` library
+library(qusage)
+
+# Attach the `GSVA` library
+library(GSVA)
+
+# Human annotation package we'll use for gene identifier conversion
+library(org.Hs.eg.db)
+
+# Attach the ggplot2 package for plotting
+library(ggplot2)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read both TSV files and add them as data frames to your environment.
+We stored our file paths as objects named metadata_file
and data_file
in this previous step.
##
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
+## cols(
+## .default = col_character(),
+## refinebio_age = col_double(),
+## refinebio_cell_line = col_logical(),
+## refinebio_compound = col_logical(),
+## refinebio_disease = col_logical(),
+## refinebio_disease_stage = col_logical(),
+## refinebio_genetic_information = col_logical(),
+## refinebio_processed = col_logical(),
+## refinebio_race = col_logical(),
+## refinebio_source_archive_url = col_logical(),
+## refinebio_specimen_part = col_logical(),
+## refinebio_subject = col_logical(),
+## refinebio_time = col_logical(),
+## refinebio_treatment = col_logical(),
+## channel_count = col_double(),
+## `contact_zip/postal_code` = col_double(),
+## data_row_count = col_double(),
+## taxid_ch1 = col_double()
+## )
+## ℹ Use `spec()` for the full column specifications.
+
+##
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
+## cols(
+## .default = col_double(),
+## Gene = col_character()
+## )
+## ℹ Use `spec()` for the full column specifications.
+Let’s ensure that the metadata and data are in the same sample order.
+# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+ dplyr::select(c(Gene, metadata$geo_accession))
+
+# Check if this is in the same order
+all.equal(colnames(expression_df)[-1], metadata$geo_accession)
## [1] TRUE
+The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).
+In this example, we will use a collection called Hallmark gene sets for GSVA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:
+++Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression.
+
The function that we will use to run GSVA requires the gene sets to be in a list. We are going to download a GMT file that contains the the Hallmark gene sets (Liberzon et al. 2015) from MSigDB (Subramanian et al. 2005; Liberzon et al. 2011) to the gene_sets
directory.
Note that when downloading GMT files from MSigDB, only Homo sapiens gene sets are supported. If you’d like to use MSigDB gene sets with GSVA for a commonly studied model organism, check out our RNA-seq GSVA example.
+hallmarks_gmt_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.2/h.all.v7.2.entrez.gmt"
We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.
+ +Using the URL (hallmarks_gmt_url
) and file path (hallmark_gmt_file
) we can download the file and use the destfile
argument to specify where it should be saved.
download.file(
+ hallmarks_gmt_url,
+ # The file will be saved to this location and with this name
+ destfile = hallmarks_gmt_file
+)
Now let’s double check that the file that contains the gene sets is in the right place.
+ +## [1] TRUE
+Now we’re ready to read the file into R with qusage::read.gmt()
.
# QuSAGE is another pathway analysis method, the `qusage` package has a function
+# for reading GMT files and turning them into a list that we can use with GSVA
+hallmarks_list <- qusage::read.gmt(hallmarks_gmt_file)
What does this hallmarks_list
look like?
## $HALLMARK_TNFA_SIGNALING_VIA_NFKB
+## [1] "3726" "2920" "467" "4792" "7128" "5743" "2919"
+## [8] "8870" "9308" "6364" "2921" "23764" "4791" "7127"
+## [15] "1839" "1316" "330" "5329" "7538" "3383" "3725"
+## [22] "1960" "3553" "597" "23645" "80149" "6648" "4929"
+## [29] "3552" "5971" "7185" "7832" "1843" "1326" "2114"
+## [36] "2152" "6385" "1958" "3569" "7124" "23135" "4790"
+## [43] "3976" "5806" "8061" "3164" "182" "6351" "2643"
+## [50] "6347" "1827" "1844" "10938" "9592" "5966" "8837"
+## [57] "8767" "4794" "8013" "22822" "51278" "8744" "2669"
+## [64] "1647" "3627" "10769" "8553" "1959" "9021" "11182"
+## [71] "5734" "1847" "5055" "4783" "5054" "10221" "25976"
+## [78] "5970" "329" "6372" "9516" "7130" "960" "3624"
+## [85] "5328" "4609" "3604" "6446" "10318" "10135" "2355"
+## [92] "10957" "3398" "969" "3575" "1942" "7262" "5209"
+## [99] "6352" "79693" "3460" "8878" "10950" "4616" "8942"
+## [106] "50486" "694" "4170" "7422" "5606" "1026" "3491"
+## [113] "10010" "3433" "3606" "7280" "3659" "2353" "4973"
+## [120] "388" "374" "4814" "65986" "8613" "9314" "6373"
+## [127] "6303" "1435" "1880" "56937" "5791" "7097" "57007"
+## [134] "7071" "4082" "3914" "1051" "9322" "2150" "687"
+## [141] "3949" "7050" "127544" "55332" "2683" "11080" "1437"
+## [148] "5142" "8303" "5341" "6776" "23258" "595" "23586"
+## [155] "8877" "941" "25816" "57018" "2526" "9034" "80176"
+## [162] "8848" "9334" "150094" "23529" "4780" "2354" "5187"
+## [169] "10725" "490" "3593" "3572" "9120" "19" "3280"
+## [176] "604" "8660" "6515" "1052" "51561" "4088" "6890"
+## [183] "9242" "64135" "3601" "79155" "602" "24145" "24147"
+## [190] "1906" "10209" "650" "1846" "10611" "23308" "9945"
+## [197] "10365" "3371" "5271" "4084"
+##
+## $HALLMARK_HYPOXIA
+## [1] "5230" "5163" "2632" "5211" "226" "2026" "5236"
+## [8] "10397" "3099" "230" "2821" "4601" "6513" "5033"
+## [15] "133" "8974" "2023" "5214" "205" "26355" "5209"
+## [22] "7422" "665" "7167" "30001" "55818" "901" "3939"
+## [29] "2997" "2597" "8553" "51129" "3725" "5054" "4015"
+## [36] "2645" "8497" "23764" "54541" "6515" "3486" "4783"
+## [43] "2353" "3516" "3098" "10370" "3669" "2584" "26118"
+## [50] "5837" "6781" "23036" "694" "123" "1466" "7436"
+## [57] "23210" "2131" "2152" "5165" "55139" "7360" "229"
+## [64] "8614" "54206" "2027" "10957" "3162" "5228" "26330"
+## [71] "9435" "55076" "63827" "467" "857" "272" "2719"
+## [78] "3340" "8660" "8819" "2548" "6385" "8987" "8870"
+## [85] "5313" "3484" "5329" "112464" "8839" "9215" "25819"
+## [92] "6275" "58528" "7538" "1956" "1907" "3423" "1026"
+## [99] "6095" "1843" "4282" "5507" "10570" "11015" "1837"
+## [106] "136" "9957" "284119" "2908" "1316" "2239" "3491"
+## [113] "7128" "771" "3073" "633" "23645" "55276" "5292"
+## [120] "25824" "55577" "1027" "680" "8277" "4493" "538"
+## [127] "4502" "9672" "25976" "5317" "302" "5224" "1649"
+## [134] "5578" "2542" "7852" "1944" "1356" "8609" "1490"
+## [141] "9469" "7163" "56925" "124872" "10891" "596" "2651"
+## [148] "3036" "54800" "949" "6576" "6383" "839" "7428"
+## [155] "2309" "5155" "126792" "6518" "8406" "1942" "2745"
+## [162] "57007" "5066" "7045" "1634" "6478" "51316" "2203"
+## [169] "8459" "5260" "4627" "1028" "9380" "5105" "3623"
+## [176] "3309" "8509" "23327" "7162" "7511" "3569" "6533"
+## [183] "4214" "3948" "9590" "26136" "3798" "3906" "1289"
+## [190] "2817" "3069" "10994" "1463" "7052" "2113" "3219"
+## [197] "8991" "2355" "6820" "7043"
+Looks like we have a list of gene sets with associated Entrez IDs.
+In our gene expression data frame, expression_df
we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into Entrez IDs for GSVA.
We’re going to convert our identifiers in expression_df
to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!
The annotation package org.Hs.eg.db
contains information for different identifiers (Carlson 2020). org.Hs.eg.db
is specific to Homo sapiens – this is what the Hs
in the package name is referencing.
We can see what types of IDs are available to us in an annotation package with keytypes()
.
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
+## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
+## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
+## [13] "IPI" "MAP" "OMIM" "ONTOLOGY"
+## [17] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
+## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
+## [25] "UNIGENE" "UNIPROT"
+Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL
) to Entrez IDs (ENTREZID
), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS
) to gene symbols (SYMBOL
).
Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.
+The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds()
and comes from the AnnotationDbi
package.
Let’s create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs.
+# First let's create a mapped data frame we can join to the gene expression
+# values
+mapped_df <- data.frame(
+ "entrez_id" = mapIds(
+ # Replace with annotation package for the organism relevant to your data
+ org.Hs.eg.db,
+ keys = expression_df$Gene,
+ # Replace with the type of gene identifiers in your data
+ keytype = "ENSEMBL",
+ # Replace with the type of gene identifiers you would like to map to
+ column = "ENTREZID",
+ # This will keep only the first mapped value for each Ensembl ID
+ multiVals = "first"
+ )
+) %>%
+ # If an Ensembl gene identifier doesn't map to a Entrez gene identifier, drop
+ # that from the data frame
+ dplyr::filter(!is.na(entrez_id)) %>%
+ # Make an `Ensembl` column to store the row names
+ tibble::rownames_to_column("Ensembl") %>%
+ # Now let's join the rest of the expression data
+ dplyr::inner_join(expression_df, by = c("Ensembl" = "Gene"))
## 'select()' returned 1:many mapping between keys and columns
+This 1:many mapping between keys and columns
message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSVA later in this notebook, we keep only the first mapped IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.
Let’s see a preview of mapped_df
.
We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as row names. In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers.
+Let’s check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs.
+ +## [1] TRUE
+Looks like we do have duplicated Entrez IDs. Let’s find out which ones.
+dup_entrez_ids <- mapped_df %>%
+ dplyr::filter(duplicated(entrez_id)) %>%
+ dplyr::pull(entrez_id)
+
+dup_entrez_ids
## [1] "6013" "3117"
+As we mentioned earlier, we will not want any duplicate gene identifiers in our data frame when we convert it into a matrix in preparation for the GSVA steps later. GSVA is executed on a per sample basis so let’s keep the maximum expression value per sample associated with the duplicate Entrez gene identifiers. In other words, we will keep only the maximum expression value found across the duplicate Entrez gene identifier instances for each sample or column.
+Let’s take a look at the rows associated with the duplicated Entrez IDs and see how this will play out.
+ +As an example using the strategy we described, for GSM917111
’s data in the first column, 0.2294387
is larger than 0.1104345
so moving forward, Entrez gene 6013
will have 0.2294387
value and the 0.1104345
would be dropped from the dataset. This is just one method of handling duplicate gene identifiers. See the Gene Set Enrichment Analysis (GSEA) User guide for more information on other commonly used strategies, such as taking the median expression value.
Now, let’s implement the maximum value method for all samples and Entrez IDs using tidyverse functions.
+max_dup_df <- mapped_df %>%
+ # We won't be using Ensembl IDs moving forward, so we will drop this column
+ dplyr::select(-Ensembl) %>%
+ # Filter to include only the rows associated with the duplicate Entrez gene
+ # identifiers
+ dplyr::filter(entrez_id %in% dup_entrez_ids) %>%
+ # Group by Entrez IDs
+ dplyr::group_by(entrez_id) %>%
+ # Get the maximum expression value using all values associated with each
+ # duplicate Entrez ID for each column or sample in this case
+ dplyr::summarize_all(max)
+
+max_dup_df
We can see GSM917111
now has the 0.2294387
value for Entrez ID 6013
like expected. Looks like we were able to successfully get rid of the duplicate Entrez gene identifiers!
Now let’s combine our newly de-duplicated data with the rest of the mapped data!
+filtered_mapped_df <- mapped_df %>%
+ # We won't be using Ensembl IDs moving forward, so we will drop this column
+ dplyr::select(-Ensembl) %>%
+ # First let's get the data associated with the Entrez gene identifiers that
+ # aren't duplicated
+ dplyr::filter(!entrez_id %in% dup_entrez_ids) %>%
+ # Now let's bind the rows of the maximum expression data we stored in
+ # `max_dup_df`
+ dplyr::bind_rows(max_dup_df)
As mentioned earlier, we need a matrix for GSVA. Let’s now convert our data frame into a matrix and prepare our object for GSVA.
+filtered_mapped_matrix <- filtered_mapped_df %>%
+ # We need to store our gene identifiers as row names
+ tibble::column_to_rownames("entrez_id") %>%
+ # Now we can convert our object into a matrix
+ as.matrix()
Note that if we had duplicate gene identifiers here, we would not be able to set them as row names.
+GSVA fits a model and ranks genes based on their expression level relative to the sample distribution (Hänzelmann et al. 2013). The pathway-level score calculated is a way of asking how genes within a gene set vary as compared to genes that are outside of that gene set (Malhotra 2018).
+The idea here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) (Hänzelmann et al. 2013). This means that GSVA scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and ran GSVA again, you would expect the scores to change (Hänzelmann et al. 2013).
+The output is a gene set by sample matrix of GSVA scores.
+Let’s perform GSVA using the gsva()
function. See ?gsva
for more options.
gsva_results <- gsva(
+ filtered_mapped_matrix,
+ hallmarks_list,
+ method = "gsva",
+ # Appropriate for our log2-transformed microarray data
+ kcdf = "Gaussian",
+ # Minimum gene set size
+ min.sz = 15,
+ # Maximum gene set size
+ max.sz = 500,
+ # Compute Gaussian-distributed scores
+ mx.diff = TRUE,
+ # Don't print out the progress bar
+ verbose = FALSE
+)
Let’s explore what the output of gsva()
looks like.
## GSM917111 GSM917250
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.2784726 -0.29221444
+## HALLMARK_HYPOXIA -0.1907117 -0.13033725
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.2307863 -0.22997233
+## HALLMARK_MITOTIC_SPINDLE -0.2134439 0.09773602
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3061668 0.27041084
+## HALLMARK_TGF_BETA_SIGNALING -0.2285640 0.08510027
+## GSM917281 GSM917062
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.30693127 -0.2953894
+## HALLMARK_HYPOXIA -0.24058274 -0.2658532
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.25341066 -0.2214914
+## HALLMARK_MITOTIC_SPINDLE -0.13886393 -0.2020978
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.06319446 -0.2363895
+## HALLMARK_TGF_BETA_SIGNALING -0.14161796 -0.2284998
+## GSM917288 GSM917230
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.22966329 -0.20914620
+## HALLMARK_HYPOXIA 0.06741065 -0.02691280
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.08702648 -0.03084332
+## HALLMARK_MITOTIC_SPINDLE -0.17902098 0.05763884
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.21274606 0.08273239
+## HALLMARK_TGF_BETA_SIGNALING 0.01208862 -0.13097578
+## GSM917152 GSM917242
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.33276903 0.001857506
+## HALLMARK_HYPOXIA 0.18446386 -0.118269791
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS 0.05273271 0.104042284
+## HALLMARK_MITOTIC_SPINDLE 0.14226250 -0.052122165
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.37981263 -0.037661623
+## HALLMARK_TGF_BETA_SIGNALING 0.15915374 0.300603909
+## GSM917226 GSM917290
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.1329156 -0.385841741
+## HALLMARK_HYPOXIA -0.2641157 -0.145480093
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.2136088 -0.267519873
+## HALLMARK_MITOTIC_SPINDLE -0.3753805 -0.001471942
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3570903 -0.006265662
+## HALLMARK_TGF_BETA_SIGNALING -0.1973818 -0.130123427
+If we want to identify most differentially expressed pathways across subgroups, we can use functionality in the limma
package to test the GSVA scores.
This is one approach for working with GSVA scores; the mx.diff = TRUE
argument that we supplied to the gsva()
function in the previous section means the GSVA output scores should be normally distributed, which can be helpful if you want to perform downstream analyses with approaches that make assumptions of normality (Hänzelmann et al. 2020).
limma
needs a numeric design matrix to indicate which subtype of medulloblastoma a sample originates from. Now we will create a model matrix based on our subgroup
variable. We are using a + 0
in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. If you have a control group, you might want to leave off the + 0
so the model includes an intercept representing the control group expression level, with the remaining coefficients the changes relative to that expression level.
Let’s take a look at the design matrix we created.
+ +## subgroupGroup 3 subgroupGroup 4 subgroupSHH
+## 1 0 1 0
+## 2 0 1 0
+## 3 0 1 0
+## 4 1 0 0
+## 5 0 1 0
+## 6 0 1 0
+The design matrix column names are a bit messy, so we will neaten them up by dropping the subgroup
designation they all have and any spaces in names.
Run the linear model on each pathway (each row of gsva_results
) with lmFit()
and apply empirical Bayes smoothing with eBayes()
.
# Apply linear model to data
+fit <- limma::lmFit(gsva_results, design = des_mat)
+
+# Apply empirical Bayes to smooth standard errors
+fit <- limma::eBayes(fit)
Now that we have our basic model fitting, we will want to make the contrasts among all our groups. Depending on your scientific questions, you will need to customize the next steps. Consulting the limma users guide for how to set up your model based on your hypothesis is a good idea.
+In this contrasts matrix, we are comparing each subgroup to the average of the other subgroups.
+We’re dividing by two in this expression so that each group is compared to the average of the other two groups (makeContrasts()
doesn’t allow you to use functions like mean()
; it wants a formula).
contrast_matrix <- makeContrasts(
+ "G3vsOther" = Group3 - (Group4 + SHH) / 2,
+ "G4vsOther" = Group4 - (Group3 + SHH) / 2,
+ "SHHvsOther" = SHH - (Group3 + Group4) / 2,
+ levels = des_mat
+)
Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulae would look like Group3 = Group3 - Control
for each one. We highly recommend consulting the limma users guide for figuring out what your makeContrasts()
and model.matrix()
setups should look like (Ritchie et al. 2015).
Now that we have the contrasts matrix set up, we can use it to re-fit the model and re-smooth it with eBayes()
.
# Fit the model according to the contrasts matrix
+contrasts_fit <- contrasts.fit(fit, contrast_matrix)
+
+# Re-smooth the Bayes
+contrasts_fit <- eBayes(contrasts_fit)
Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).
+Now let’s create the results table based on the contrasts fitted model.
+This step will provide the Benjamini-Hochberg multiple testing correction. The topTable()
function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method
argument (see the ?topTable
help page for more about the options).
# Apply multiple testing correction and obtain stats
+stats_df <- topTable(contrasts_fit, number = nrow(expression_df)) %>%
+ tibble::rownames_to_column("Gene")
Let’s take a peek at our results table.
+ +For each pathway, each group’s fold change in expression, compared to the average of the other groups is reported.
+By default, results are ordered from largest F
value to the smallest, which means your most differentially expressed pathways across all groups should be toward the top.
This means HALLMARK_UNFOLDED_PROTEIN_RESPONSE
appears to be the pathway that contains the most significant distribution of scores across subgroups.
Let’s make a plot for our most significant pathway, HALLMARK_UNFOLDED_PROTEIN_RESPONSE
.
First we need to get our GSVA scores for this pathway into a long data frame, an appropriate format for ggplot2
.
Let’s look at the current format of gsva_results
.
## GSM917111 GSM917250
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.2784726 -0.29221444
+## HALLMARK_HYPOXIA -0.1907117 -0.13033725
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.2307863 -0.22997233
+## HALLMARK_MITOTIC_SPINDLE -0.2134439 0.09773602
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3061668 0.27041084
+## HALLMARK_TGF_BETA_SIGNALING -0.2285640 0.08510027
+## GSM917281 GSM917062
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.30693127 -0.2953894
+## HALLMARK_HYPOXIA -0.24058274 -0.2658532
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.25341066 -0.2214914
+## HALLMARK_MITOTIC_SPINDLE -0.13886393 -0.2020978
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.06319446 -0.2363895
+## HALLMARK_TGF_BETA_SIGNALING -0.14161796 -0.2284998
+## GSM917288 GSM917230
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.22966329 -0.20914620
+## HALLMARK_HYPOXIA 0.06741065 -0.02691280
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.08702648 -0.03084332
+## HALLMARK_MITOTIC_SPINDLE -0.17902098 0.05763884
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.21274606 0.08273239
+## HALLMARK_TGF_BETA_SIGNALING 0.01208862 -0.13097578
+## GSM917152 GSM917242
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.33276903 0.001857506
+## HALLMARK_HYPOXIA 0.18446386 -0.118269791
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS 0.05273271 0.104042284
+## HALLMARK_MITOTIC_SPINDLE 0.14226250 -0.052122165
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.37981263 -0.037661623
+## HALLMARK_TGF_BETA_SIGNALING 0.15915374 0.300603909
+## GSM917226 GSM917290
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB -0.1329156 -0.385841741
+## HALLMARK_HYPOXIA -0.2641157 -0.145480093
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.2136088 -0.267519873
+## HALLMARK_MITOTIC_SPINDLE -0.3753805 -0.001471942
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3570903 -0.006265662
+## HALLMARK_TGF_BETA_SIGNALING -0.1973818 -0.130123427
+We can see that they are in a wide format with the GSVA scores for each sample spread across a row associated with each pathway.
+Now let’s convert these results into a data frame and into a long format, using the tidyr::pivot_longer()
function.
annotated_results_df <- gsva_results %>%
+ # Make this into a data frame
+ data.frame() %>%
+ # Gene set names are row names
+ tibble::rownames_to_column("pathway") %>%
+ # Get into long format using the `tidyr::pivot_longer()` function
+ tidyr::pivot_longer(
+ cols = -pathway,
+ names_to = "sample",
+ values_to = "gsva_score"
+ )
+
+# Preview the annotated results object
+head(annotated_results_df)
Now let’s filter to include only the scores associated with our most significant pathway, HALLMARK_UNFOLDED_PROTEIN_RESPONSE
, and join the relevant group labels from the metadata for plotting.
top_pathway_annotated_results_df <- annotated_results_df %>%
+ # Filter for only scores associated with our most significant pathway
+ dplyr::filter(pathway == "HALLMARK_UNFOLDED_PROTEIN_RESPONSE") %>%
+ # Join the column with the group labels that we would like to plot
+ dplyr::left_join(metadata %>% dplyr::select(
+ # Select the variables relevant to your data
+ refinebio_accession_code,
+ subgroup
+ ),
+ # Tell the join what columns are equivalent and should be used as a key
+ by = c("sample" = "refinebio_accession_code")
+ )
+
+# Preview the filtered annotated results object
+head(top_pathway_annotated_results_df)
Now let’s make a sina plot so we can look at the differences between the subgroup
groups using our GSVA scores.
# Sina plot comparing GSVA scores for `HALLMARK_UNFOLDED_PROTEIN_RESPONSE`
+# the `subgroup` groups in our dataset
+sina_plot <-
+ ggplot(
+ top_pathway_annotated_results_df, # Supply our annotated data frame
+ aes(
+ x = subgroup, # Replace with a grouping variable relevant to your data
+ y = gsva_score, # Column we previously created to store the GSVA scores
+ color = subgroup # Let's make the groups different colors too
+ )
+ ) +
+ # Add a boxplot that will have summary stats
+ geom_boxplot(outlier.shape = NA) +
+ # Make a sina plot that shows individual values
+ ggforce::geom_sina() +
+ # Rename the y-axis label
+ labs(y = "HALLMARK_UNFOLDED_PROTEIN_RESPONSE_score") +
+ # Adjust the plot background for better visualization
+ theme_bw()
+
+# Display plot
+sina_plot
Looks like the Group 4
samples have lower GSVA scores for HALLMARK_UNFOLDED_PROTEIN_RESPONSE
as compared to the SHH
and Group 3
scores.
Let’s save this plot to PNG.
+ggsave(
+ file.path(
+ plots_dir,
+ "GSE37382_gsva_HALLMARK_UNFOLDED_PROTEIN_RESPONSE_sina_plot.png"
+ ),
+ plot = sina_plot
+)
## Saving 7 x 5 in image
+At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
+ +## ─ Session info ─────────────────────────────────────────────────────
+## setting value
+## version R version 4.0.2 (2020-06-22)
+## os Ubuntu 20.04 LTS
+## system x86_64, linux-gnu
+## ui X11
+## language (EN)
+## collate en_US.UTF-8
+## ctype en_US.UTF-8
+## tz Etc/UTC
+## date 2020-12-21
+##
+## ─ Packages ─────────────────────────────────────────────────────────
+## package * version date lib source
+## annotate 1.68.0 2020-10-27 [1] Bioconductor
+## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor
+## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
+## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
+## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
+## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
+## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
+## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2)
+## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2)
+## bitops 1.0-6 2013-08-17 [1] RSPM (R 4.0.0)
+## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
+## coda 0.19-4 2020-09-30 [1] RSPM (R 4.0.2)
+## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
+## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
+## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0)
+## DelayedArray 0.16.0 2020-10-27 [1] Bioconductor
+## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
+## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
+## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0)
+## emmeans 1.5.1 2020-09-18 [1] RSPM (R 4.0.2)
+## estimability 1.3 2018-02-11 [1] RSPM (R 4.0.0)
+## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0)
+## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
+## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0)
+## fastmatch 1.1-0 2017-01-28 [1] RSPM (R 4.0.0)
+## fftw 1.0-6 2020-02-24 [1] RSPM (R 4.0.2)
+## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
+## GenomeInfoDb 1.26.1 2020-11-20 [1] Bioconductor
+## GenomeInfoDbData 1.2.4 2020-12-01 [1] Bioconductor
+## GenomicRanges 1.42.0 2020-10-27 [1] Bioconductor
+## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
+## ggforce 0.3.2 2020-06-23 [1] RSPM (R 4.0.2)
+## ggplot2 * 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
+## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
+## graph 1.68.0 2020-10-27 [1] Bioconductor
+## GSEABase 1.52.1 2020-12-11 [1] Bioconductor
+## GSVA * 1.38.0 2020-10-27 [1] Bioconductor
+## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
+## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
+## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
+## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.2)
+## IRanges * 2.24.0 2020-10-27 [1] Bioconductor
+## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
+## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
+## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0)
+## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
+## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0)
+## limma * 3.46.0 2020-10-27 [1] Bioconductor
+## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
+## MASS 7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+## Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2)
+## MatrixGenerics 1.2.0 2020-10-27 [1] Bioconductor
+## matrixStats 0.57.0 2020-09-25 [1] RSPM (R 4.0.2)
+## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0)
+## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
+## mvtnorm 1.1-1 2020-06-09 [1] RSPM (R 4.0.0)
+## nlme 3.1-148 2020-05-24 [2] CRAN (R 4.0.2)
+## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
+## org.Hs.eg.db * 3.12.0 2020-12-01 [1] Bioconductor
+## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
+## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## polyclip 1.10-0 2019-03-14 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
+## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
+## qusage * 2.24.0 2020-10-27 [1] Bioconductor
+## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
+## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2)
+## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
+## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
+## RCurl 1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
+## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
+## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
+## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2)
+## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
+## S4Vectors * 0.28.0 2020-10-27 [1] Bioconductor
+## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
+## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
+## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
+## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
+## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
+## SummarizedExperiment 1.20.0 2020-10-27 [1] Bioconductor
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
+## tidyr 1.1.2 2020-08-27 [1] RSPM (R 4.0.2)
+## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
+## tweenr 1.0.1 2018-12-14 [1] RSPM (R 4.0.2)
+## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
+## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
+## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2)
+## XML 3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2)
+## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.0.0)
+## XVector 0.30.0 2020-10-27 [1] Bioconductor
+## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
+## zlibbioc 1.36.0 2020-10-27 [1] Bioconductor
+##
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
+Carlson M., 2020 org.Hs.eg.db: Genome wide annotation for human. http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html
+Hänzelmann S., R. Castelo, and J. Guinney, 2013 Biases in Illumina transcriptome sequencing caused by random hexamer priming. BMC Bioinformatics 14. https://doi.org/10.1186/1471-2105-14-7
+Hänzelmann S., R. Castelo, and J. Guinney, 2020 GSVA: The gene set variation analysis package for microarray and rna-seq data. https://www.bioconductor.org/packages/release/bioc/vignettes/GSVA/inst/doc/GSVA.pdf
+Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375
+Liberzon A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, and J. P. Mesirov et al., 2015 The molecular signatures database hallmark gene set collection. Cell Systems 1. https://doi.org/10.1016/j.cels.2015.12.004
+Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, and P. Tamayo et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260
+Malhotra S., 2018 Decoding gene set variation analysis. https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3
+Northcott P., D. Shih, J. Peacock, L. Garzia, and S. Morrissy et al., 2012 Subgroup specific structural variation across 1,000 medulloblastoma genomes. Nature 488. https://doi.org/10.1038/nature11327
+Ritchie M. E., B. Phipson, D. Wu, Y. Hu, and C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007
+Robinson D., Understanding empirical Bayes estimation (using baseball statistics). http://varianceexplained.org/r/empirical_bayes_baseball/
+Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102
+Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660
+.Rmd
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
- dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
data/
folderRefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
-# Define the file path to the data directory
-data_dir <- file.path("data", "SRP070849") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "SRP070849.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv") # Replace with file path to your metadata
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "SRP070849")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "SRP070849.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
@@ -3072,87 +3903,37 @@See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
-In this analysis, we will be using the R package DESeq2
(Love et al. 2014) for normalization and the R package pheatmap
(Slowikowski 2017) for clustering and creating a heatmap.
if (!("pheatmap" %in% installed.packages())) {
- # Install pheatmap
- install.packages("pheatmap", update = FALSE)
-}
-
-if (!("DESeq2" %in% installed.packages())) {
- # Install DESeq2
- BiocManager::install("DESeq2", update = FALSE)
-}
+In this analysis, we will be using the R package DESeq2
(Love et al. 2014) for normalization and the R package pheatmap
(Slowikowski 2017) for clustering and creating a heatmap.
if (!("pheatmap" %in% installed.packages())) {
+ # Install pheatmap
+ install.packages("pheatmap", update = FALSE)
+}
+
+if (!("DESeq2" %in% installed.packages())) {
+ # Install DESeq2
+ BiocManager::install("DESeq2", update = FALSE)
+}
Attach the pheatmap
and DESeq2
libraries:
# Attach the `pheatmap` library
-library(pheatmap)
-
-# Attach the `DESeq2` library
-library(DESeq2)
-## Loading required package: S4Vectors
-## Loading required package: stats4
-## Loading required package: BiocGenerics
-## Loading required package: parallel
-##
-## Attaching package: 'BiocGenerics'
-## The following objects are masked from 'package:parallel':
-##
-## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
-## clusterExport, clusterMap, parApply, parCapply, parLapply,
-## parLapplyLB, parRapply, parSapply, parSapplyLB
-## The following objects are masked from 'package:stats':
-##
-## IQR, mad, sd, var, xtabs
-## The following objects are masked from 'package:base':
-##
-## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
-## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
-## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
-## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
-## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
-## union, unique, unsplit, which, which.max, which.min
-##
-## Attaching package: 'S4Vectors'
-## The following object is masked from 'package:base':
-##
-## expand.grid
-## Loading required package: IRanges
-## Loading required package: GenomicRanges
-## Loading required package: GenomeInfoDb
-## Loading required package: SummarizedExperiment
-## Loading required package: Biobase
-## Welcome to Bioconductor
-##
-## Vignettes contain introductory material; view with
-## 'browseVignettes()'. To cite Bioconductor, see
-## 'citation("Biobase")', and for packages 'citation("pkgname")'.
-## Loading required package: DelayedArray
-## Loading required package: matrixStats
-##
-## Attaching package: 'matrixStats'
-## The following objects are masked from 'package:Biobase':
-##
-## anyMissing, rowMedians
-##
-## Attaching package: 'DelayedArray'
-## The following objects are masked from 'package:matrixStats':
-##
-## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
-## The following objects are masked from 'package:base':
-##
-## aperm, apply, rowsum
-# We will need this so we can use the pipe: %>%
-library(magrittr)
-
-# Set the seed so our results are reproducible:
-set.seed(12345)
+
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read in both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-## Parsed with column specification:
+
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_logical(),
## refinebio_accession_code = col_character(),
@@ -3164,113 +3945,115 @@ 4.2 Import and set up data
## refinebio_subject = col_character(),
## refinebio_title = col_character(),
## refinebio_treatment = col_character()
-## )
-## See spec(...) for full column specifications.
-# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
- # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
- tibble::column_to_rownames("Gene")
-## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+# Read in data TSV file
+expression_df <- readr::read_tsv(data_file) %>%
+ # Here we are going to store the gene IDs as row names so that
+ # we can have only numeric values to perform calculations on later
+ tibble::column_to_rownames("Gene")
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
-## See spec(...) for full column specifications.
+## ℹ Use `spec()` for the full column specifications.
Let’s take a look at the metadata object that we read into the R environment.
-head(metadata)
+
Now let’s ensure that the metadata and data are in the same sample order.
-# Make the data in the order of the metadata
-df <- df %>% dplyr::select(metadata$refinebio_accession_code)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$refinebio_accession_code)
+# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+ dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(expression_df), metadata$refinebio_accession_code)
## [1] TRUE
Now we are going to use a combination of functions from the DESeq2
and pheatmap
packages to look at how are samples and genes are clustering.
We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis. We are going to do some pre-filtering to keep only genes with 10 or more reads total. Note that rows represent gene data and the columns represent sample data in our dataset.
+# Define a minimum counts cutoff and filter the data to include
+# only rows (genes) that have total counts above the cutoff
+filtered_expression_df <- expression_df %>%
+ dplyr::filter(rowSums(.) >= 10)
We also need our counts to be rounded before we can use them with the DESeqDataSetFromMatrix()
function.
We will be using the DESeq2
package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet
object. We turn the data frame (or matrix) into a DESeqDataSet
object. ) and specify which variable labels our experimental groups using the design
argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design
argument because we are not performing a differential expression analysis.
# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers
-df <- df %>%
- # Mutate numeric variables to be integers
- dplyr::mutate_if(is.numeric, round)
-
-# Create a `DESeqDataSet` object
-dds <- DESeqDataSetFromMatrix(
- countData = df, # This is the data.frame with the counts values for all replicates in our dataset
- colData = metadata, # This is the data.frame with the annotation data for the replicates in the counts data.frame
- design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis
-)
+We will be using the DESeq2
package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet
object. We turn the data frame (or matrix) into a DESeqDataSet
object. ) and specify which variable labels our experimental groups using the design
argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design
argument because we are not performing a differential expression analysis.
# Create a `DESeqDataSet` object
+dds <- DESeqDataSetFromMatrix(
+ countData = filtered_expression_df, # the counts values for all samples
+ colData = metadata, # annotation data for the samples
+ design = ~1 # Here we are not specifying a model
+ # Replace with an appropriate design variable for your analysis
+)
## converting counts to integer mode
We want to filter out the genes that have not been expressed or that have low expression counts because we want to remove any possible noise from our data before we normalize the data and create our heatmap. We are going to do some pre-filtering to keep only genes with 10 or more reads total. Note that rows represent gene data and the columns represent sample data in our dataset.
-# Define a minimum counts cutoff and filter `DESeqDataSet` object to include
-# only rows that have counts above the cutoff
-genes_to_keep <- rowSums(counts(dds)) >= 10
-dds <- dds[genes_to_keep, ]
-We are going to use the rlog()
function from the DESeq2
package to normalize and transform the data. For more information about these transformation methods, see here.
# Normalize the data in the `DESeqDataSet` object using the `rlog()` function from the `DESEq2` R package
-dds_norm <- rlog(dds)
+
Although you may want to create a heatmap including all of the genes in the set, alternatively, the heatmap could be created using only genes of interest. For this example, we will sort genes by variance, but there are many alternative criterion by which you may want to sort your genes e.g. fold change, t-statistic, membership to a particular gene ontology, so on.
-# Calculate the variance for each gene
-variances <- apply(assay(dds_norm), 1, var)
-
-# Determine the upper quartile variance cutoff value
-upper_var <- quantile(variances, 0.75)
-
-# Subset the data choosing only genes whose variances are in the upper quartile
-df_by_var <- data.frame(assay(dds_norm)) %>%
- dplyr::filter(variances > upper_var)
+Although you may want to create a heatmap including all of the genes in the dataset, this can produce a very large image that is hard to interpret. Alternatively, the heatmap could be created using only genes of interest. For this example, we will sort genes by variance and select genes in the upper quartile, but there are many alternative criterion by which you may want to sort your genes, e.g. fold change, t-statistic, membership in a particular gene ontology, so on.
+# Calculate the variance for each gene
+variances <- apply(assay(dds_norm), 1, var)
+
+# Determine the upper quartile variance cutoff value
+upper_var <- quantile(variances, 0.75)
+
+# Filter the data choosing only genes whose variances are in the upper quartile
+df_by_var <- data.frame(assay(dds_norm)) %>%
+ dplyr::filter(variances > upper_var)
To further customize the heatmap, see a vignette for a guide at this link (Slowikowski 2017).
-# Create and store the heatmap object
-pheatmap <-
- pheatmap(
- df_by_var,
- cluster_rows = TRUE, # We want to cluster the heatmap by rows (genes in this case)
- cluster_cols = TRUE, # We also want to cluster the heatmap by columns (samples in this case),
- show_rownames = FALSE, # We don't want to show the rownames because there are too many genes for the labels to be clearly seen
- main = "Non-Annotated Heatmap",
- colorRampPalette(c(
- "deepskyblue",
- "black",
- "yellow"
- ))(25
- ),
- scale = "row" # Scale values in the direction of genes (rows)
- )
-
+To further customize the heatmap, see a vignette for a guide at this link (Slowikowski 2017).
+# Create and store the heatmap object
+heatmap <- pheatmap(
+ df_by_var,
+ cluster_rows = TRUE, # Cluster the rows of the heatmap (genes in this case)
+ cluster_cols = TRUE, # Cluster the columns of the heatmap (samples),
+ show_rownames = FALSE, # There are too many genes to clearly show the labels
+ main = "Non-Annotated Heatmap",
+ colorRampPalette(c(
+ "deepskyblue",
+ "black",
+ "yellow"
+ ))(25
+ ),
+ scale = "row" # Scale values in the direction of genes (rows)
+)
We’ve created a heatmap but although our genes and samples are clustered, there is not much information that we can gather here because we did not provide the pheatmap()
function with annotation labels for our samples.
First let’s save our clustered heatmap.
You can easily switch this to save to a JPEG or tiff by changing the function and file name within the function to the respective file suffix.
-# Open a PNG file
-png(file.path(
- plots_dir,
- "SRP070849_heatmap_non_annotated.png" # Replace file name with a relevant output plot name
-))
-
-# Print your heatmap
-pheatmap
-
-# Close the PNG file:
-dev.off()
+# Open a PNG file
+png(file.path(
+ plots_dir,
+ "SRP070849_heatmap_non_annotated.png" # Replace with a relevant file name
+))
+
+# Print your heatmap
+heatmap
+
+# Close the PNG file:
+dev.off()
## png
## 2
Now, let’s add some annotation bars to our heatmap.
@@ -3278,64 +4061,68 @@From the accompanying paper, we know that the mice with IDH2
mutant AML were treated with vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials) and the mice with TET2
mutant AML were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent). (Shih et al. 2017) We are going to manipulate the metadata and add variables with the information for each sample, from the experimental design briefly described above, that we would like to use to annotate the heatmap.
# Let's prepare the annotation data.frame for the uncollapsed `DESeqData` set object which will be used to create the technical replicates heatmap
-annotation_df <- metadata %>%
- # Create a variable to store the cancer type information
- dplyr::mutate(
- mutation = dplyr::case_when(
- startsWith(refinebio_title, "TET2") ~ "TET2",
- startsWith(refinebio_title, "IDH2") ~ "IDH2",
- startsWith(refinebio_title, "WT") ~ "WT",
- TRUE ~ "unknown" # If none of the above criteria are true, we mark the `mutation` variable as "unknown"
- )
- ) %>%
- # We want to select the variables that we want for annotating the technical replicates heatmap
- dplyr::select(
- refinebio_accession_code,
- mutation,
- refinebio_treatment
- ) %>%
- # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our `DESeaDataSet` object
- tibble::column_to_rownames("refinebio_accession_code")
+From the accompanying paper, we know that the mice with IDH2
mutant AML were treated with vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials) and the mice with TET2
mutant AML were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent). (Shih et al. 2017) We are going to manipulate the metadata and add variables with the information for each sample, from the experimental design briefly described above, that we would like to use to annotate the heatmap.
# Let's prepare the annotation for the uncollapsed `DESeqData` set object
+# which will be used to annotate the heatmap
+annotation_df <- metadata %>%
+ # Create a variable to store the cancer type information
+ dplyr::mutate(
+ mutation = dplyr::case_when(
+ startsWith(refinebio_title, "TET2") ~ "TET2",
+ startsWith(refinebio_title, "IDH2") ~ "IDH2",
+ startsWith(refinebio_title, "WT") ~ "WT",
+ # If none of the above criteria are satisfied,
+ # we mark the `mutation` variable as "unknown"
+ TRUE ~ "unknown"
+ )
+ ) %>%
+ # select only the columns we need for annotation
+ dplyr::select(
+ refinebio_accession_code,
+ mutation,
+ refinebio_treatment
+ ) %>%
+ # The `pheatmap()` function requires that the row names of our annotation
+ # data frame match the column names of our `DESeaDataSet` object
+ tibble::column_to_rownames("refinebio_accession_code")
You can create an annotated heatmap by providing our annotation object to the annotation_col
argument of the pheatmap()
function.
# Create and store the annotated heatmap object
-pheatmap_annotated <-
- pheatmap(
- df_by_var,
- cluster_rows = TRUE,
- cluster_cols = TRUE,
- show_rownames = FALSE,
- annotation_col = annotation_df,
- main = "Annotated Heatmap",
- colorRampPalette(c(
- "deepskyblue",
- "black",
- "yellow"
- ))(25
- ),
- scale = "row" # Scale values in the direction of genes (rows)
- )
-
+# Create and store the annotated heatmap object
+heatmap_annotated <-
+ pheatmap(
+ df_by_var,
+ cluster_rows = TRUE,
+ cluster_cols = TRUE,
+ show_rownames = FALSE,
+ annotation_col = annotation_df,
+ main = "Annotated Heatmap",
+ colorRampPalette(c(
+ "deepskyblue",
+ "black",
+ "yellow"
+ ))(25
+ ),
+ scale = "row" # Scale values in the direction of genes (rows)
+ )
Now that we have annotation bars on our heatmap, we have a better idea of the sample variable groups that appear to cluster together.
Let’s save our annotated heatmap.
You can easily switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.
-# Open a PNG file
-png(file.path(
- plots_dir,
- "SRP070849_heatmap_annotated.png" # Replace file name with a relevant output plot name
-))
-
-# Print your heatmap
-pheatmap_annotated
-
-# Close the PNG file:
-dev.off()
+You can switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.
+# Open a PNG file
+png(file.path(
+ plots_dir,
+ "SRP070849_heatmap_annotated.png" # Replace with a relevant file name
+))
+
+# Print your heatmap
+heatmap_annotated
+
+# Close the PNG file:
+dev.off()
## png
## 2
pheatmap
package allow, see the ComplexHeatmap Complete Reference Manual (Gu et al. 2016)pheatmap
package allow, see the ComplexHeatmap Complete Reference Manual (Gu et al. 2016)At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessioninfo::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -3363,46 +4150,47 @@ 6 Session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-10-16
+## date 2020-12-18
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
+## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
-## annotate 1.66.0 2020-04-27 [1] Bioconductor
-## AnnotationDbi 1.50.3 2020-07-25 [1] Bioconductor
+## annotate 1.68.0 2020-10-27 [1] Bioconductor
+## AnnotationDbi 1.52.0 2020-10-27 [1] Bioconductor
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
-## Biobase * 2.48.0 2020-04-27 [1] Bioconductor
-## BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor
-## BiocParallel 1.22.0 2020-04-27 [1] Bioconductor
+## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
+## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
+## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2)
## bitops 1.0-6 2013-08-17 [1] RSPM (R 4.0.0)
## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0)
-## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0)
-## DelayedArray * 0.14.1 2020-07-14 [1] Bioconductor
-## DESeq2 * 1.28.1 2020-05-12 [1] Bioconductor
+## DelayedArray 0.16.0 2020-10-27 [1] Bioconductor
+## DESeq2 * 1.30.0 2020-10-27 [1] Bioconductor
## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0)
## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0)
-## genefilter 1.70.0 2020-04-27 [1] Bioconductor
-## geneplotter 1.66.0 2020-04-27 [1] Bioconductor
+## genefilter 1.72.0 2020-10-27 [1] Bioconductor
+## geneplotter 1.68.0 2020-10-27 [1] Bioconductor
## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
-## GenomeInfoDb * 1.24.2 2020-06-15 [1] Bioconductor
-## GenomeInfoDbData 1.2.3 2020-10-06 [1] Bioconductor
-## GenomicRanges * 1.40.0 2020-04-27 [1] Bioconductor
+## GenomeInfoDb * 1.26.2 2020-12-08 [1] Bioconductor
+## GenomeInfoDbData 1.2.4 2020-12-16 [1] Bioconductor
+## GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## ggplot2 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
-## IRanges * 2.22.2 2020-05-21 [1] Bioconductor
+## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.2)
+## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
@@ -3410,6 +4198,7 @@ 6 Session info
## locfit 1.5-9.4 2020-03-25 [1] RSPM (R 4.0.0)
## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
## Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2)
+## MatrixGenerics * 1.2.0 2020-10-27 [1] Bioconductor
## matrixStats * 0.57.0 2020-09-25 [1] RSPM (R 4.0.2)
## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
@@ -3417,6 +4206,7 @@ 6 Session info
## pheatmap * 1.0.12 2019-01-04 [1] RSPM (R 4.0.0)
## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
@@ -3426,30 +4216,30 @@ 6 Session info
## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0)
## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
## RCurl 1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0)
-## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
-## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
-## RSQLite 2.2.0 2020-01-07 [1] RSPM (R 4.0.2)
+## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
-## S4Vectors * 0.26.1 2020-05-16 [1] Bioconductor
+## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
-## SummarizedExperiment * 1.18.2 2020-07-09 [1] Bioconductor
+## SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor
## survival 3.1-12 2020-04-10 [2] CRAN (R 4.0.2)
-## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2)
## XML 3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2)
## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.0.0)
-## XVector 0.28.0 2020-04-27 [1] Bioconductor
+## XVector 0.30.0 2020-10-27 [1] Bioconductor
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
-## zlibbioc 1.34.0 2020-04-27 [1] Bioconductor
+## zlibbioc 1.36.0 2020-10-27 [1] Bioconductor
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
@@ -3458,20 +4248,25 @@ 6 Session info
References
-Gu Z., R. Eils, and M. Schlesner, 2016 Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics.
+Gu Z., R. Eils, and M. Schlesner, 2016 Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btw313
-Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
+Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
-Shih A. H., C. Meydan, K. Shank, F. E. Garrett-Bakelman, and P. S. Ward et al., 2017 Combination targeted therapy to disrupt aberrant oncogenic signaling and reverse epigenetic dysfunction in idh2- and tet2-mutant acute myeloid leukemia. Cancer Discovery 7. https://doi.org/10.1158/2159-8290.CD-16-1049
+Shih A. H., C. Meydan, K. Shank, F. E. Garrett-Bakelman, and P. S. Ward et al., 2017 Combination targeted therapy to disrupt aberrant oncogenic signaling and reverse epigenetic dysfunction in IDH2- and TET2-mutant acute myeloid leukemia. Cancer Discovery 7. https://doi.org/10.1158/2159-8290.CD-16-1049
-Slowikowski K., 2017 Make heatmaps in r with pheatmap
+Slowikowski K., 2017 Make heatmaps in R with pheatmap. https://slowkow.com/notes/pheatmap-tutorial/
This notebook takes RNA-seq data and metadata from refine.bio and identifies differentially expressed genes between experimental groups.
+This notebook takes RNA-seq expression data and metadata from refine.bio and identifies differentially expressed genes between two experimental groups.
+Differential expression analysis identifies genes with significantly varying expression among experimental groups by comparing the variation among samples within a group to the variation between groups. The simplest version of this analysis is comparing two groups where one of those groups is a control group.
+Our refine.bio RNA-seq examples use DESeq2 for these analyses because it handles RNA-seq data well and has great documentation.
+Read more about DESeq2 and why we like it on our Getting Started page.
.Rmd
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
- dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
-Go to this dataset’s page on refine.bio.
+Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
@@ -3007,8 +3836,7 @@For this example analysis, we will use this acute myeloid leukemia (AML) dataset (Micol et al. 2017)
-Micol et al. (2017) performed RNA-seq on primary peripheral blood and bone marrow samples from AML patients with and without ASXL1/2 mutations.
+For this example analysis, we are using RNA-seq data from an acute lymphoblastic leukemia (ALL) mouse lymphoid cell model (Kampen et al. 2019). All of the lymphoid mouse cell samples in this experiment have a human RPL10 gene; three with a reference (wild-type) RPL10 gene and three with the R98S mutation. We will perform our differential expression using these knock-in and wild-type mice designations.
data/
folderFor more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the SRP078441
folder into your newly created data/
folder.
Copy and paste the SRP123625
folder into your newly created data/
folder.
SRP078441
folder which contains:
+SRP123625
folder which contains:
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
-# Define the file path to the data directory
-data_dir <- file.path("data", "SRP078441") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "SRP078441.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_SRP078441.tsv") # Replace with file path to your metadata
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "SRP123625")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "SRP123625.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_SRP123625.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
@@ -3072,89 +3905,39 @@See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
-In this analysis, we will be using DESeq2
(Love et al. 2014) for the differential expression testing. We will also use EnhancedVolcano
for plotting and apeglm
for some log fold change estimates in the results table (Zhu et al. 2018; Blighe et al. 2020)
if (!("DESeq2" %in% installed.packages())) {
- # Install this package if it isn't installed yet
- BiocManager::install("DESeq2", update = FALSE)
-}
-if (!("EnhancedVolcano" %in% installed.packages())) {
- # Install this package if it isn't installed yet
- BiocManager::install("EnhancedVolcano", update = FALSE)
-}
-if (!("apeglm" %in% installed.packages())) {
- # Install this package if it isn't installed yet
- BiocManager::install("apeglm", update = FALSE)
-}
+In this analysis, we will be using DESeq2
(Love et al. 2014) for the differential expression testing. We will also use EnhancedVolcano
(Blighe et al. 2020) for plotting and apeglm
(Zhu et al. 2018) for some log fold change estimates in the results table
if (!("DESeq2" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("DESeq2", update = FALSE)
+}
+if (!("EnhancedVolcano" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("EnhancedVolcano", update = FALSE)
+}
+if (!("apeglm" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("apeglm", update = FALSE)
+}
Attach the libraries we need for this analysis:
-# Attach the DESeq2 library
-library(DESeq2)
-## Loading required package: S4Vectors
-## Loading required package: stats4
-## Loading required package: BiocGenerics
-## Loading required package: parallel
-##
-## Attaching package: 'BiocGenerics'
-## The following objects are masked from 'package:parallel':
-##
-## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
-## clusterExport, clusterMap, parApply, parCapply, parLapply,
-## parLapplyLB, parRapply, parSapply, parSapplyLB
-## The following objects are masked from 'package:stats':
-##
-## IQR, mad, sd, var, xtabs
-## The following objects are masked from 'package:base':
-##
-## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
-## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
-## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
-## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
-## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
-## union, unique, unsplit, which, which.max, which.min
-##
-## Attaching package: 'S4Vectors'
-## The following object is masked from 'package:base':
-##
-## expand.grid
-## Loading required package: IRanges
-## Loading required package: GenomicRanges
-## Loading required package: GenomeInfoDb
-## Loading required package: SummarizedExperiment
-## Loading required package: Biobase
-## Welcome to Bioconductor
-##
-## Vignettes contain introductory material; view with
-## 'browseVignettes()'. To cite Bioconductor, see
-## 'citation("Biobase")', and for packages 'citation("pkgname")'.
-## Loading required package: DelayedArray
-## Loading required package: matrixStats
-##
-## Attaching package: 'matrixStats'
-## The following objects are masked from 'package:Biobase':
-##
-## anyMissing, rowMedians
-##
-## Attaching package: 'DelayedArray'
-## The following objects are masked from 'package:matrixStats':
-##
-## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
-## The following objects are masked from 'package:base':
-##
-## aperm, apply, rowsum
-# Attach the ggplot2 library for plotting
-library(ggplot2)
-
-# We will need this so we can use the pipe: %>%
-library(magrittr)
+# Attach the DESeq2 library
+library(DESeq2)
+
+# Attach the ggplot2 library for plotting
+library(ggplot2)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
The jitter plot we make later on with the DESeq2::plotCounts()
function involves some randomness. As is good practice when our analysis involves randomness, we will set the seed.
set.seed(12345)
+
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-## Parsed with column specification:
+
+##
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
## cols(
## .default = col_logical(),
## refinebio_accession_code = col_character(),
@@ -3165,212 +3948,207 @@ 4.2 Import data and metadata
## refinebio_specimen_part = col_character(),
## refinebio_subject = col_character(),
## refinebio_title = col_character()
-## )
-## See spec(...) for full column specifications.
-# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
- tibble::column_to_rownames("Gene")
-## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+# Read in data TSV file
+expression_df <- readr::read_tsv(data_file) %>%
+ tibble::column_to_rownames("Gene")
+##
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
## cols(
## Gene = col_character(),
-## SRR3895734 = col_double(),
-## SRR3895735 = col_double(),
-## SRR3895736 = col_double(),
-## SRR3895737 = col_double(),
-## SRR3895738 = col_double(),
-## SRR3895739 = col_double(),
-## SRR3895740 = col_double(),
-## SRR3895741 = col_double(),
-## SRR3895742 = col_double(),
-## SRR3895743 = col_double(),
-## SRR3895744 = col_double(),
-## SRR3895745 = col_double(),
-## SRR3895746 = col_double(),
-## SRR3895747 = col_double(),
-## SRR3895748 = col_double(),
-## SRR3895749 = col_double()
+## SRR6255584 = col_double(),
+## SRR6255585 = col_double(),
+## SRR6255586 = col_double(),
+## SRR6255587 = col_double(),
+## SRR6255588 = col_double(),
+## SRR6255589 = col_double()
## )
Let’s ensure that the metadata and data are in the same sample order.
-# Make the data in the order of the metadata
-df <- df %>%
- dplyr::select(metadata$refinebio_accession_code)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$refinebio_accession_code)
+# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+ dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(expression_df), metadata$refinebio_accession_code)
## [1] TRUE
The information we need to make the comparison is in the refinebio_title
column of the metadata data.frame.
-head(metadata$refinebio_title)
-## [1] "CBF54-BM-ASXLwt" "49060-2010-BM-ASXLwt" "CBF234-BM-ASXLwt"
-## [4] "CBF124-BM-ASXLwt" "41267-BM-ASXL2" "45565-BM-ASXL1"
+
+## [1] "R98S11_mRNA_Suppl" "R98S13_mRNA_Suppl" "R98S35_mRNA_Suppl"
+## [4] "WT28_mRNA_Suppl" "WT29_mRNA_Suppl" "WT36_mRNA_Suppl"
This dataset includes data from patients with and without ASXL gene mutations. The authors of this data have ASXL mutation status along with other information is stored all in one string (this is not very convenient for us). We need to extract the mutation status information into its own column to make it easier to use.
-metadata <- metadata %>%
- # The last bit of the title, separated by "-" contains the mutation
- # information that we want to extract
- dplyr::mutate(asxl_mutation_status = stringr::word(refinebio_title,
- -1,
- sep = "-"
- )) %>%
- # Now let's summarized the ASXL1 mutation status from this variable
- dplyr::mutate(asxl_mutation_status = dplyr::case_when(
- grepl("ASXL1|ASXL2", asxl_mutation_status) ~ "asxl_mutation",
- grepl("ASXLwt", asxl_mutation_status) ~ "no_mutation"
- ))
-Let’s take a look at metadata_df
to see if this worked.
# looking at the first 6 rows of the metadata_df and only at the columns that
-# contain the title and the mutation status we extracted from the title
-head(dplyr::select(metadata, refinebio_title, asxl_mutation_status))
+This dataset includes data from mouse lymphoid cells with human RPL10, with and without a R98S
mutation. The mutation status is stored along with other information in a single string (this is not very convenient for us). We need to extract the mutation status information into its own column to make it easier to use.
metadata <- metadata %>%
+ # Let's get the RPL10 mutation status from this variable
+ dplyr::mutate(mutation_status = dplyr::case_when(
+ stringr::str_detect(refinebio_title, "R98S") ~ "R98S",
+ stringr::str_detect(refinebio_title, "WT") ~ "reference"
+ ))
Let’s take a look at metadata
to see if this worked by looking at the refinebio_title
and mutation_status
columns.
# Let's take a look at the original metadata column's info
+# and our new `mutation_status` column
+dplyr::select(metadata, refinebio_title, mutation_status)
Before we set up our model in the next step, we want to check if our modeling variable is set correctly. We want our “control” to to be set as the first level in the variable we provide as our experimental variable. Here we will use the str()
function to print out a preview of the structure of our variable
# Print out a preview of `asxl_mutation_status`
-str(metadata$asxl_mutation_status)
-## chr [1:16] "no_mutation" "no_mutation" "no_mutation" "no_mutation" ...
-Currently, asxl_mutation_status
is a character. To make sure it is set how we want for the DESeq
object and subsequent testing, let’s mutate it to a factor so we can explicitly set the levels.
# Make asxl_mutation_status a factor and set the levels appropriately
-metadata <- metadata %>%
- dplyr::mutate(
- # Here we will set up the factor aspect of our new variable.
- asxl_mutation_status = factor(asxl_mutation_status, levels = c("no_mutation", "asxl_mutation"))
- )
+
+## chr [1:6] "R98S" "R98S" "R98S" "reference" "reference" ...
+Currently, mutation_status
is stored as a character, which is not necessarily what we want. To make sure it is set how we want for the DESeq
object and subsequent testing, let’s change it to a factor so we can explicitly set the levels.
In the levels
argument, we will list reference
first since that is our control group.
# Make mutation_status a factor and set the levels appropriately
+metadata <- metadata %>%
+ dplyr::mutate(
+ # Here we define the values our factor variable can have and their order.
+ mutation_status = factor(mutation_status, levels = c("reference", "R98S"))
+ )
Note if you don’t specify levels
, the factor()
function will set levels in alphabetical order – which sometimes means your control group will not be listed first!
Let’s double check if the levels are what we want using the levels()
function.
levels(metadata$asxl_mutation_status)
-## [1] "no_mutation" "asxl_mutation"
-Yes! no_mutation
is the first level as we want it to be. We’re all set and ready to move on to making our DESeq2Dataset
object.
We will be using the DESeq2
package for differential expression testing, which requires us to format our data into a DESeqDataSet
object. First we need to prep our gene expression data frame so it’s in the format that is compatible with the DESeqDataSetFromMatrix()
function in the next step.
# We are making our data frame into a matrix and rounding the numbers
-gene_matrix <- round(as.matrix(df))
-Now we need to create DESeqDataSet
from our expression dataset. We use the asxl_mutation_status
variable we created in the design formula because that will allow us to model the presence/absence of ASXL1/2 mutation.
ddset <- DESeqDataSetFromMatrix(
- countData = gene_matrix,
- colData = metadata,
- design = ~asxl_mutation_status
-)
-## converting counts to integer mode
+
+## [1] "reference" "R98S"
+Yes! reference
is the first level as we want it to be. We’re all set and ready to move on to making our DESeq2Dataset
object.
We want to filter out the genes that have not been expressed or that have low expression counts, since these do not have high enough counts to yield reliable differential expression results. Removing these genes saves on memory usage during the tests. We are going to do some pre-filtering to keep only genes with 10 or more reads in total across the samples.
-# Define a minimum counts cutoff and filter `DESeqDataSet` object to include
-# only rows that have counts above the cutoff
-genes_to_keep <- rowSums(counts(ddset)) >= 10
-ddset <- ddset[genes_to_keep, ]
+# Define a minimum counts cutoff and filter the data to include
+# only rows (genes) that have total counts above the cutoff
+filtered_expression_df <- expression_df %>%
+ dplyr::filter(rowSums(.) >= 10)
If you have a bigger dataset, you will probably want to make this cutoff larger.
+We will be using the DESeq2
package for differential expression testing, which requires us to format our data into a DESeqDataSet
object. First we need to prep our gene expression data frame so that all of the count values are integers, making it compatible with the DESeqDataSetFromMatrix()
function in the next step.
Now we need to create a DESeqDataSet
from our expression dataset. We use the mutation_status
variable we created in the design formula because that will allow us to model the presence/absence of R98S mutation.
ddset <- DESeqDataSetFromMatrix(
+ # Here we supply non-normalized count data
+ countData = gene_matrix,
+ # Supply the `colData` with our metadata data frame
+ colData = metadata,
+ # Supply our experimental variable to `design`
+ design = ~mutation_status
+)
## converting counts to integer mode
We’ll use the wrapper function DESeq()
to do our differential expression analysis. In our DESeq2
object we designated our asxl_mutation_status
variable as the model
argument. Because of this, the DESeq
function will use groups defined by asxl_mutation_status
to test for differential expression.
deseq_object <- DESeq(ddset)
+We’ll use the wrapper function DESeq()
to do our differential expression analysis. In our DESeq2
object we designated our mutation_status
variable as the model
argument. Because of this, the DESeq
function will use groups defined by mutation_status
to test for differential expression.
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
-## -- replacing outliers and refitting for 745 genes
-## -- DESeq argument 'minReplicatesForReplace' = 7
-## -- original counts are preserved in counts(dds)
-## estimating dispersions
-## fitting model and testing
Let’s extract the results table from the DESeq
object.
deseq_results <- results(deseq_object)
-Here we will use lfcShrink()
function to obtain shrunken log fold change estimates based on negative binomial distribution. This will add the estimates to your results table. Using lfcShrink()
can help decrease noise and preserve large differences between groups (it requires that apeglm
package be installed).
deseq_results <- lfcShrink(deseq_object, # This is the original DESeq2 object with DESeq() already having been ran
- coef = 2, # This is based on what log fold change coefficient was used in DESeq(), the default is 2.
- res = deseq_results # This needs to be the DESeq2 results table
-)
+
+Here we will use lfcShrink()
function to obtain shrunken log fold change estimates based on negative binomial distribution. This will add the estimates to your results table. Using lfcShrink()
can help decrease noise and preserve large differences between groups (it requires that apeglm
package be installed) (Zhu et al. 2018).
deseq_results <- lfcShrink(
+ deseq_object, # The original DESeq2 object after running DESeq()
+ coef = 2, # The log fold change coefficient used in DESeq(); the default is 2.
+ res = deseq_results # The original DESeq2 results table
+)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
-Now let’s take a peek at what our results table looks like.
-head(deseq_results)
-## log2 fold change (MAP): asxl mutation status asxl mutation vs no mutation
-## Wald test p-value: asxl mutation status asxl mutation vs no mutation
+Now let’s take a peek at what our new results table looks like.
+
+## log2 fold change (MAP): mutation status R98S vs reference
+## Wald test p-value: mutation status R98S vs reference
## DataFrame with 6 rows and 5 columns
-## baseMean log2FoldChange lfcSE pvalue padj
-## <numeric> <numeric> <numeric> <numeric> <numeric>
-## ENSG00000000003 52.852059 9.64776e-07 0.00144269 0.6604581 0.998525
-## ENSG00000000005 0.260056 2.03308e-07 0.00144269 0.5791283 NA
-## ENSG00000000419 406.161355 -1.38160e-06 0.00144268 0.7076357 0.998525
-## ENSG00000000457 564.784021 -2.36497e-06 0.00144268 0.3719972 0.998525
-## ENSG00000000460 401.130684 3.87280e-06 0.00144269 0.1627644 0.998525
-## ENSG00000000938 1500.527448 3.07435e-06 0.00144269 0.0354596 0.720601
-Note it is not filtered or sorted, so we will use tidyverse to do this before saving our results to a file. Sort and filter the results.
-# this is of class DESeqResults -- we want a data frame
-deseq_df <- deseq_results %>%
- # make into data.frame
- as.data.frame() %>%
- # the gene names are rownames -- let's make this it's own column for easy
- # display
- tibble::rownames_to_column("Gene") %>%
- dplyr::mutate(threshold = padj < 0.05) %>%
- # let's sort by statistic -- the highest values should be what is up in the
- # ASXL mutated samples
- dplyr::arrange(dplyr::desc(log2FoldChange))
-Let’s print out what the top results are.
-head(deseq_df)
+## baseMean log2FoldChange lfcSE pvalue
+## <numeric> <numeric> <numeric> <numeric>
+## ENSMUSG00000000001 9579.0571 -0.4349384 0.160640 2.59595e-03
+## ENSMUSG00000000028 1199.7333 0.0647514 0.134708 6.04429e-01
+## ENSMUSG00000000056 1287.5086 0.3243824 0.272978 1.02032e-01
+## ENSMUSG00000000058 20.1703 5.0170059 1.515508 6.85780e-05
+## ENSMUSG00000000078 4939.6277 -0.9574237 0.234363 4.75060e-06
+## ENSMUSG00000000085 1150.9626 0.0929495 0.126941 4.32755e-01
+## padj
+## <numeric>
+## ENSMUSG00000000001 0.019791734
+## ENSMUSG00000000028 0.808664075
+## ENSMUSG00000000056 0.283225795
+## ENSMUSG00000000058 0.001074535
+## ENSMUSG00000000078 0.000113951
+## ENSMUSG00000000085 0.682936007
+Note it is not filtered or sorted, so we will use tidyverse to do this before saving our results to a file.
+# this is of class DESeqResults -- we want a data frame
+deseq_df <- deseq_results %>%
+ # make into data.frame
+ as.data.frame() %>%
+ # the gene names are row names -- let's make them a column for easy display
+ tibble::rownames_to_column("Gene") %>%
+ # add a column for significance threshold results
+ dplyr::mutate(threshold = padj < 0.05) %>%
+ # sort by statistic -- the highest values will be genes with
+ # higher expression in RPL10 mutated samples
+ dplyr::arrange(dplyr::desc(log2FoldChange))
Let’s print out the top results.
+To double check what a differentially expressed gene looks like, we can plot one with DESeq2::plotCounts()
function.
plotCounts(ddset, gene = "ENSG00000196074", intgroup = "asxl_mutation_status")
-
-The mutation
group samples have higher expression of this gene than the control group, which helps assure us that the results are showing us what we are looking for.
The R98S
mutated samples have higher expression of this gene than the control group, which helps assure us that the results are showing us what we are looking for.
Write the results table to file.
-readr::write_tsv(
- deseq_df,
- file.path(
- results_dir,
- "SRP078441_differential_expression_results.tsv" # Replace with a relevant output file name
- )
-)
+
We’ll use the EnhancedVolcano
package’s main function to plot our data (Zhu et al. 2018). Here we are plotting the log2FoldChange
(which was estimated by lfcShrink
step) on the x axis and padj
on the y axis. The padj
variable are the p values corrected with Benjamini-Hochberg
(the default from the results()
step).
EnhancedVolcano::EnhancedVolcano(
- deseq_df,
- lab = deseq_df$Gene, # A vector that contains our gene names
- x = "log2FoldChange", # The variable in `deseq_df` you want to be plotted on the x axis
- y = "padj" # The variable in `deseq_df` you want to be plotted on the y axis
-)
-
-Here the red point is the gene that meets both the default p value and log2 fold change cutoff (which are 10e-6 and 1 respectively).
-We used the adjusted p values for our plot above, so you may want to loosen this cutoff with the pCutoff
argument (Take a look at all the options for tailoring this plot using ?EnhancedVolcano
).
Let’s make the same plot again, but adjust the pCutoff
since we are using multiple-testing corrected p values and this time we will assign the plot to our environment as volcano_plot
.
# We'll assign this as `volcano_plot` this time
-volcano_plot <- EnhancedVolcano::EnhancedVolcano(
- deseq_df,
- lab = deseq_df$Gene,
- x = "log2FoldChange",
- y = "padj",
- pCutoff = 0.01 # Loosen the cutoff since we supplied corrected p-values
-)
-
-# Print out plot here
-volcano_plot
-
-This looks pretty good. Let’s save it to a PNG.
-ggsave(
- plot = volcano_plot,
- file.path(plots_dir, "SRP078441_volcano_plot.png")
-) # Replace with a plot name relevant to your data
+We’ll use the EnhancedVolcano
package’s main function to plot our data (Blighe et al. 2020).
Here we are plotting the log2FoldChange
(which was estimated by lfcShrink
step) on the x axis and padj
on the y axis. The padj
variable are the p values corrected with Benjamini-Hochberg
(the default from the results()
step).
Because we are using adjusted p values we can feel safe in making our pCutoff
argument 0.01
(default is 1e-05
).
+Take a look at all the options for tailoring this plot using ?EnhancedVolcano
.
We will save the plot to our environment as volcano_plot
to make it easier to save the figure separately later.
# We'll assign this as `volcano_plot`
+volcano_plot <- EnhancedVolcano::EnhancedVolcano(
+ deseq_df,
+ lab = deseq_df$Gene,
+ x = "log2FoldChange",
+ y = "padj",
+ pCutoff = 0.01 # Loosen the cutoff since we supplied corrected p-values
+)
## Registered S3 methods overwritten by 'ggalt':
+## method from
+## grid.draw.absoluteGrob ggplot2
+## grobHeight.absoluteGrob ggplot2
+## grobWidth.absoluteGrob ggplot2
+## grobX.absoluteGrob ggplot2
+## grobY.absoluteGrob ggplot2
+
+
+This looks pretty good! Let’s save it to a PNG.
+ggsave(
+ plot = volcano_plot,
+ file.path(plots_dir, "SRP123625_volcano_plot.png")
+) # Replace with a plot name relevant to your data
## Saving 7 x 5 in image
Heatmaps are also a pretty common way to show differential expression results. You can take your results from this example and make a heatmap following our heatmap module.
DESeq2
vignetteEnhancedVolcano
vignette has more examples on how to tailor your volcano plot (Blighe et al. 2020).EnhancedVolcano
vignette has more examples on how to tailor your volcano plot (Blighe et al. 2020).At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessioninfo::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -3399,62 +4177,73 @@ 6 Session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-10-16
+## date 2020-12-16
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
+## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
-## annotate 1.66.0 2020-04-27 [1] Bioconductor
-## AnnotationDbi 1.50.3 2020-07-25 [1] Bioconductor
-## apeglm 1.10.0 2020-04-27 [1] Bioconductor
+## annotate 1.68.0 2020-10-27 [1] Bioconductor
+## AnnotationDbi 1.52.0 2020-10-27 [1] Bioconductor
+## apeglm 1.12.0 2020-10-27 [1] Bioconductor
+## ash 1.0-15 2015-09-01 [1] RSPM (R 4.0.0)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
## bbmle 1.0.23.1 2020-02-03 [1] RSPM (R 4.0.0)
## bdsmatrix 1.3-4 2020-01-13 [1] RSPM (R 4.0.0)
-## Biobase * 2.48.0 2020-04-27 [1] Bioconductor
-## BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor
-## BiocParallel 1.22.0 2020-04-27 [1] Bioconductor
+## beeswarm 0.2.3 2016-04-25 [1] RSPM (R 4.0.0)
+## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
+## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
+## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2)
## bitops 1.0-6 2013-08-17 [1] RSPM (R 4.0.0)
## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0)
-## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
## coda 0.19-4 2020-09-30 [1] RSPM (R 4.0.2)
## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0)
-## DelayedArray * 0.14.1 2020-07-14 [1] Bioconductor
-## DESeq2 * 1.28.1 2020-05-12 [1] Bioconductor
+## DelayedArray 0.16.0 2020-10-27 [1] Bioconductor
+## DESeq2 * 1.30.0 2020-10-27 [1] Bioconductor
## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0)
## emdbook 1.3.12 2020-02-19 [1] RSPM (R 4.0.0)
-## EnhancedVolcano 1.6.0 2020-04-27 [1] Bioconductor
+## EnhancedVolcano 1.8.0 2020-10-27 [1] Bioconductor
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0)
+## extrafont 0.17 2014-12-08 [1] RSPM (R 4.0.0)
+## extrafontdb 1.0 2012-06-11 [1] RSPM (R 4.0.0)
## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0)
-## genefilter 1.70.0 2020-04-27 [1] Bioconductor
-## geneplotter 1.66.0 2020-04-27 [1] Bioconductor
+## genefilter 1.72.0 2020-10-27 [1] Bioconductor
+## geneplotter 1.68.0 2020-10-27 [1] Bioconductor
## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
-## GenomeInfoDb * 1.24.2 2020-06-15 [1] Bioconductor
-## GenomeInfoDbData 1.2.3 2020-10-06 [1] Bioconductor
-## GenomicRanges * 1.40.0 2020-04-27 [1] Bioconductor
+## GenomeInfoDb * 1.26.2 2020-12-08 [1] Bioconductor
+## GenomeInfoDbData 1.2.4 2020-12-16 [1] Bioconductor
+## GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
+## ggalt 0.4.0 2017-02-15 [1] RSPM (R 4.0.0)
+## ggbeeswarm 0.6.0 2017-08-07 [1] RSPM (R 4.0.0)
## ggplot2 * 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
+## ggrastr 0.2.1 2020-09-14 [1] RSPM (R 4.0.2)
## ggrepel 0.8.2 2020-03-08 [1] RSPM (R 4.0.2)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
-## IRanges * 2.22.2 2020-05-21 [1] Bioconductor
+## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.2)
+## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
+## KernSmooth 2.23-17 2020-04-26 [2] CRAN (R 4.0.2)
## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0)
## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0)
## locfit 1.5-9.4 2020-03-25 [1] RSPM (R 4.0.0)
## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
+## maps 3.3.0 2018-04-03 [1] RSPM (R 4.0.0)
## MASS 7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
## Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2)
+## MatrixGenerics * 1.2.0 2020-10-27 [1] Bioconductor
## matrixStats * 0.57.0 2020-09-25 [1] RSPM (R 4.0.2)
## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
@@ -3464,6 +4253,8 @@ 6 Session info
## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
## plyr 1.8.6 2020-03-03 [1] RSPM (R 4.0.2)
+## proj4 1.0-10 2020-03-02 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
@@ -3473,30 +4264,32 @@ 6 Session info
## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0)
## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
## RCurl 1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0)
-## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
-## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
-## RSQLite 2.2.0 2020-01-07 [1] RSPM (R 4.0.2)
+## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
-## S4Vectors * 0.26.1 2020-05-16 [1] Bioconductor
+## Rttf2pt1 1.3.8 2020-01-10 [1] RSPM (R 4.0.0)
+## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
-## SummarizedExperiment * 1.18.2 2020-07-09 [1] Bioconductor
+## SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor
## survival 3.1-12 2020-04-10 [2] CRAN (R 4.0.2)
-## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
+## vipor 0.4.5 2017-03-22 [1] RSPM (R 4.0.0)
## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2)
## XML 3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2)
## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.0.0)
-## XVector 0.28.0 2020-04-27 [1] Bioconductor
+## XVector 0.30.0 2020-10-27 [1] Bioconductor
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
-## zlibbioc 1.34.0 2020-04-27 [1] Bioconductor
+## zlibbioc 1.36.0 2020-10-27 [1] Bioconductor
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
@@ -3505,13 +4298,13 @@ 6 Session info
References
-Blighe K., S. Rana, and M. Lewis, 2020 EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling.
+Blighe K., S. Rana, and M. Lewis, 2020 EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling. https://github.com/kevinblighe/EnhancedVolcano
-
-Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
+
+Kampen K. R., L. Fancello, T. Girardi, G. Rinaldi, and M. Planque et al., 2019 Translatome analysis reveals altered serine and glycine metabolism in t-cell acute lymphoblastic leukemia cells. Nature Communications 10. https://doi.org/10.1038/s41467-019-10508-2
-
-Micol J. B., A. Pastore, D. Inoue, N. Duployez, and E. Kim et al., 2017 ASXL2 is essential for haematopoiesis and acts as a haploinsufficient tumour suppressor in leukemia. Nat Commun 8: 15429.
+
+Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
Zhu A., J. G. Ibrahim, and M. I. Love, 2018 Heavy-tailed prior distributions for sequence count data: Removing the noise and preserving large differences. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
@@ -3519,6 +4312,11 @@ References
+
diff --git a/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd b/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd
index e9fe5c7a..3be5192e 100644
--- a/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd
+++ b/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd
@@ -38,16 +38,13 @@ Run this next chunk to set up your folders!
If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations.
```{r}
-# Define the file path to the data directory
-data_dir <- file.path("data", "SRP133573") # Replace with path to desired data directory
-
# Create the data folder if it doesn't exist
-if (!dir.exists(data_dir)) {
- dir.create(data_dir)
+if (!dir.exists("data")) {
+ dir.create("data")
}
# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
+plots_dir <- "plots"
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
@@ -55,7 +52,7 @@ if (!dir.exists(plots_dir)) {
}
# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
+results_dir <- "results"
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
@@ -73,17 +70,15 @@ Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP
Click the "Download Now" button on the right side of this screen.
-
-
-Fill out the pop up window with your email and our Terms and Conditions:
-
-
+
+Fill out the pop up window with your email and our Terms and Conditions:
+
We are going to use non-quantile normalized data for this analysis.
To get this data, you will need to check the box that says "Skip quantile normalization for RNA-seq samples".
Note that this option will only be available for RNA-seq datasets.
-
+
It may take a few minutes for the dataset to process.
You will get an email when it is ready.
@@ -99,11 +94,11 @@ Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples includ
## Place the dataset in your new `data/` folder
-Refine.bio will send you a download button in the email when it is ready.
+refine.bio will send you a download button in the email when it is ready.
Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`.
Double clicking should unzip this for you and create a folder of the same name.
-
+
For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#rna-seq-sample-compendium-download-folder).
@@ -125,7 +120,7 @@ Your new analysis folder should contain:
- A folder for `results` (currently empty)
Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):
-
+
In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis.
These chunks will declare your file paths and double check that your files are in the right place.
@@ -135,19 +130,24 @@ This is handy to do because if we want to switch the dataset (see next section f
```{r}
# Define the file path to the data directory
-data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "SRP133573")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "SRP133573.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv")
```
Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above.
```{r}
-# Check if the gene expression matrix file is at the file path stored in `data_file`
+# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
# Check if the metadata file is at the file path stored in `metadata_file`
@@ -186,7 +186,7 @@ if (!("DESeq2" %in% installed.packages())) {
Attach the `DESeq2` and `ggplot2` libraries:
-```{r}
+```{r message=FALSE}
# Attach the `DESeq2` library
library(DESeq2)
@@ -212,105 +212,115 @@ We stored our file paths as objects named `metadata_file` and `data_file` in [th
metadata <- readr::read_tsv(metadata_file)
# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
- # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
+expression_df <- readr::read_tsv(data_file) %>%
+ # Tuck away the gene ID column as row names, leaving only numeric values
tibble::column_to_rownames("Gene")
```
Let's ensure that the metadata and data are in the same sample order.
```{r}
-# Make the data in the order of the metadata
-df <- df %>% dplyr::select(metadata$refinebio_accession_code)
+# Make the sure the columns (samples) are in the same order as the metadata
+expression_df <- expression_df %>%
+ dplyr::select(metadata$refinebio_accession_code)
# Check if this is in the same order
-all.equal(colnames(df), metadata$refinebio_accession_code)
+all.equal(colnames(expression_df), metadata$refinebio_accession_code)
```
-
Now we are going to use a combination of functions from the `DESeq2` and `ggplot2` packages to perform and visualize the results of the Principal Component Analysis (PCA) dimension reduction technique on our pre-ADT and post-ADT samples.
-### Prepare data for `DESeq2`
-
-We need to make sure all of the values in our data are converted to integers as required by a `DESeq2` function we will use later.
-
-```{r}
-# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers
-df <- df %>%
- # Mutate numeric variables to be integers
- dplyr::mutate_if(is.numeric, round)
-```
-
### Prepare metadata for `DESEq2`
-We need to make sure all of the metadata column variables, that we would like to use to annotate our plot, are converted into factors.
+We need to make sure all of the metadata column variables that we would like to use to annotate our plot, are converted into factors.
```{r}
-# We need to also format the variables from the metadata, that we will be using for annotation of the PCA plot, into factors
+# convert the columns we will be using for annotation into factors
metadata <- metadata %>%
dplyr::mutate(
- refinebio_treatment = as.factor(refinebio_treatment),
+ refinebio_treatment = factor(
+ refinebio_treatment,
+ # specify the possible levels in the order we want them to appear
+ levels = c("pre-adt", "post-adt")
+ ),
refinebio_disease = as.factor(refinebio_disease)
)
```
+## Define a minimum counts cutoff
+
+We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis.
+We are going to do some pre-filtering to keep only genes with 10 or more reads total.
+This threshold might vary depending on the number of samples and expression patterns across your data set.
+Note that rows represent gene data and the columns represent sample data in our dataset.
+
+```{r}
+# Define a minimum counts cutoff and filter the data to include
+# only rows (genes) that have total counts above the cutoff
+filtered_expression_df <- expression_df %>%
+ dplyr::filter(rowSums(.) >= 10)
+```
+
+We also need our counts to be rounded before we can use them with the `DESeqDataSetFromMatrix()` function.
+
+```{r}
+# The `DESeqDataSetFromMatrix()` function needs the values to be integers
+filtered_expression_df <- round(filtered_expression_df)
+```
+
## Create a DESeqDataset
We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object.
-We turn the data frame (or matrix) into a [`DESeqDataSet` object](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2). ) and specify which variable labels our experimental groups using the [`design` argument](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs) [@Love2014].
+We turn the data frame (or matrix) into a [`DESeqDataSet` object](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2) and specify which variable labels our experimental groups using the [`design` argument](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs) [@Love2014].
In this chunk of code, we will not provide a specific model to the `design` argument because we are not performing a differential expression analysis.
```{r}
# Create a `DESeqDataSet` object
dds <- DESeqDataSetFromMatrix(
- countData = df, # This is the data frame with the counts values for all replicates in our dataset
- colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame
- design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis
+ countData = filtered_expression_df, # the counts values for all samples in our dataset
+ colData = metadata, # annotation data for the samples in the counts data frame
+ design = ~1 # Here we are not specifying a model
+ # Replace with an appropriate design variable for your analysis
)
```
-## Define a minimum counts cutoff
-
-We want to filter out the genes that have not been expressed or that have low expression counts because we want to remove any possible noise from our data before we normalize the data and create our plot.
-We are going to do some pre-filtering to keep only genes with 10 or more reads total.
-Note that rows represent gene data and the columns represent sample data in our dataset.
-
-```{r}
-# Define a minimum counts cutoff and filter `DESeqDataSet` object to include
-# only rows that have counts above the cutoff
-genes_to_keep <- rowSums(counts(dds)) >= 10
-dds <- dds[genes_to_keep, ]
-```
-
## Perform DESeq2 normalization and transformation
We are going to use the `vst()` function from the `DESeq2` package to normalize and transform the data.
For more information about these transformation methods, [see here](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods).
```{r}
-# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package
+# Normalize and transform the data in the `DESeqDataSet` object
+# using the `vst()` function from the `DESeq2` R package
dds_norm <- vst(dds)
```
## Create PCA plot using DESeq2
-In this code chunk, the variable `refinebio_treatment` is given to the `plotPCA()` function as part of the goal of the experiment is to analyze the sample transcriptional responses to androgen deprivation therapy (ADT).
+DESeq2 has built-in functions to calculate and plot PCA values, which we will use here.
+The `plotPCA()` function allows us to specify our group of interest with the `intgroup` argument, which will be used to color the points in our plot.
+In this code chunk, we are using `refinebio_treatment` as the grouping variable,
+ as part of the goal of the experiment was to analyze the sample transcriptional responses to androgen deprivation therapy (ADT).
```{r}
-plotPCA(dds_norm,
+plotPCA(
+ dds_norm,
intgroup = "refinebio_treatment"
)
```
-In this chunk, we are going to add another variable to our plot for annotation.
+In the next chunk, we are going to add another variable to our plot for annotation.
Now we'll plot the PCA using both `refinebio_treatment` and `refinebio_disease` variables for labels since they are central to the androgen deprivation therapy (ADT) based hypothesis in the [original paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6210624/) [@Sharma2018].
```{r}
-plotPCA(dds_norm,
- intgroup = c("refinebio_treatment", "refinebio_disease") # Note that we are able to add another variable to the intgroup argument here by providing a vector of the variable names with `c()` function
+plotPCA(
+ dds_norm,
+ intgroup = c("refinebio_treatment", "refinebio_disease")
+ # We are able to add another variable to the intgroup argument
+ # by providing a vector of the variable names with `c()` function
)
```
@@ -322,7 +332,7 @@ First let's use `plotPCA()` to receive and store the PCA values for plotting.
```{r}
# We first have to save the results of the `plotPCA()` function for use with `ggplot2`
-pcaData <-
+pca_results <-
plotPCA(
dds_norm,
intgroup = c("refinebio_treatment", "refinebio_disease"),
@@ -330,19 +340,26 @@ pcaData <-
)
```
-Now let's plot our `pcaData` using `ggplot2` functionality.
+Now let's plot our `pca_results` using `ggplot2` functionality.
```{r}
-# Plot using `ggplot()` function
+# Plot using `ggplot()` function and save to an object
annotated_pca_plot <- ggplot(
- pcaData,
+ pca_results,
aes(
x = PC1,
y = PC2,
- color = refinebio_treatment, # This will label points with different colors for each `refinebio_disease` group
- shape = refinebio_disease # This will label points with different shapes for each `refinebio_disease` group
+ # plot points with different colors for each `refinebio_treatment` group
+ color = refinebio_treatment,
+ # plot points with different shapes for each `refinebio_disease` group
+ shape = refinebio_disease
)
-)
+) +
+ # Make a scatter plot
+ geom_point()
+
+# display annotated plot
+annotated_pca_plot
```
## Save annotated PCA plot as a PNG
@@ -351,8 +368,10 @@ You can easily switch this to save to a JPEG or TIFF by changing the file name w
```{r}
# Save plot using `ggsave()` function
-ggsave(file.path(plots_dir, "SRP133573_pca_plot.png"), # Replace with name relevant your plotted data
- plot = annotated_pca_plot # Here we are giving the function the plot object that we want saved to file
+ggsave(
+ file.path(plots_dir, "SRP133573_pca_plot.png"),
+ # Replace with a file name relevant your plotted data
+ plot = annotated_pca_plot # the plot object that we want saved to file
)
```
diff --git a/03-rnaseq/dimension-reduction_rnaseq_01_pca.html b/03-rnaseq/dimension-reduction_rnaseq_01_pca.html
index ccc687a0..8067961b 100644
--- a/03-rnaseq/dimension-reduction_rnaseq_01_pca.html
+++ b/03-rnaseq/dimension-reduction_rnaseq_01_pca.html
@@ -1263,25 +1263,22 @@
};
-
-
+
+ code.sourceCode > span { display: inline-block; line-height: 1.25; }
+ code.sourceCode > span { color: inherit; text-decoration: inherit; }
+ code.sourceCode > span:empty { height: 1.2em; }
+ .sourceCode { overflow: visible; }
+ code.sourceCode { white-space: pre; position: relative; }
+ div.sourceCode { margin: 1em 0; }
+ pre.sourceCode { margin: 0; }
+ @media screen {
+ div.sourceCode { overflow: auto; }
+ }
+ @media print {
+ code.sourceCode { white-space: pre-wrap; }
+ code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
+ }
+ pre.numberSource code
+ { counter-reset: source-line 0; }
+ pre.numberSource code > span
+ { position: relative; left: -4em; counter-increment: source-line; }
+ pre.numberSource code > span > a:first-child::before
+ { content: counter(source-line);
+ position: relative; left: -1em; text-align: right; vertical-align: baseline;
+ border: none; display: inline-block;
+ -webkit-touch-callout: none; -webkit-user-select: none;
+ -khtml-user-select: none; -moz-user-select: none;
+ -ms-user-select: none; user-select: none;
+ padding: 0 4px; width: 4em;
+ color: #aaaaaa;
+ }
+ pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; }
+ div.sourceCode
+ { }
+ @media screen {
+ code.sourceCode > span > a:first-child::before { text-decoration: underline; }
+ }
+ code span.al { color: #ff0000; } /* Alert */
+ code span.an { color: #008000; } /* Annotation */
+ code span.at { } /* Attribute */
+ code span.bu { } /* BuiltIn */
+ code span.cf { color: #0000ff; } /* ControlFlow */
+ code span.ch { color: #008080; } /* Char */
+ code span.cn { } /* Constant */
+ code span.co { color: #008000; } /* Comment */
+ code span.cv { color: #008000; } /* CommentVar */
+ code span.do { color: #008000; } /* Documentation */
+ code span.er { color: #ff0000; font-weight: bold; } /* Error */
+ code span.ex { } /* Extension */
+ code span.im { } /* Import */
+ code span.in { color: #008000; } /* Information */
+ code span.kw { color: #0000ff; } /* Keyword */
+ code span.op { } /* Operator */
+ code span.ot { color: #ff4000; } /* Other */
+ code span.pp { color: #ff4000; } /* Preprocessor */
+ code span.sc { color: #008080; } /* SpecialChar */
+ code span.ss { color: #008080; } /* SpecialString */
+ code span.st { color: #008080; } /* String */
+ code span.va { } /* Variable */
+ code span.vs { color: #008080; } /* VerbatimString */
+ code span.wa { color: #008000; font-weight: bold; } /* Warning */
+
+
+
+
-
-
+
@@ -2874,15 +3686,20 @@
@@ -2971,29 +3797,26 @@ 2.1 Obtain the .Rmd
2.2 Set up your analysis folders
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
-# Define the file path to the data directory
-data_dir <- file.path("data", "SRP133573") # Replace with path to desired data directory
-
-# Create the data folder if it doesn't exist
-if (!dir.exists(data_dir)) {
- dir.create(data_dir)
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
@@ -3001,10 +3824,8 @@ 2.3 Obtain the dataset from refin
For general information about downloading data for these examples, see our ‘Getting Started’ section.
Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
-
-Fill out the pop up window with your email and our Terms and Conditions:
-
-We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples”. Note that this option will only be available for RNA-seq datasets.
+ Fill out the pop up window with your email and our Terms and Conditions:
+ We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples”. Note that this option will only be available for RNA-seq datasets.
It may take a few minutes for the dataset to process. You will get an email when it is ready.
@@ -3016,8 +3837,8 @@ 2.4 About the dataset we are usin
data/
folderRefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the SRP133573
folder into your newly created data/
folder.
results
(currently empty)In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
-# Define the file path to the data directory
-data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "SRP133573")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "SRP133573.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
@@ -3077,82 +3903,32 @@See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
-In this analysis, we will be using the R package DESeq2
(Love et al. 2014) for normalization and production of PCA values and the R package ggplot2
(Prabhakaran 2016) for plotting the PCA values.
if (!("DESeq2" %in% installed.packages())) {
- # Install DESeq2
- BiocManager::install("DESeq2", update = FALSE)
-}
+In this analysis, we will be using the R package DESeq2
(Love et al. 2014) for normalization and production of PCA values and the R package ggplot2
(Prabhakaran 2016) for plotting the PCA values.
if (!("DESeq2" %in% installed.packages())) {
+ # Install DESeq2
+ BiocManager::install("DESeq2", update = FALSE)
+}
Attach the DESeq2
and ggplot2
libraries:
# Attach the `DESeq2` library
-library(DESeq2)
-## Loading required package: S4Vectors
-## Loading required package: stats4
-## Loading required package: BiocGenerics
-## Loading required package: parallel
-##
-## Attaching package: 'BiocGenerics'
-## The following objects are masked from 'package:parallel':
-##
-## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
-## clusterExport, clusterMap, parApply, parCapply, parLapply,
-## parLapplyLB, parRapply, parSapply, parSapplyLB
-## The following objects are masked from 'package:stats':
-##
-## IQR, mad, sd, var, xtabs
-## The following objects are masked from 'package:base':
-##
-## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
-## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
-## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
-## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
-## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
-## union, unique, unsplit, which, which.max, which.min
-##
-## Attaching package: 'S4Vectors'
-## The following object is masked from 'package:base':
-##
-## expand.grid
-## Loading required package: IRanges
-## Loading required package: GenomicRanges
-## Loading required package: GenomeInfoDb
-## Loading required package: SummarizedExperiment
-## Loading required package: Biobase
-## Welcome to Bioconductor
-##
-## Vignettes contain introductory material; view with
-## 'browseVignettes()'. To cite Bioconductor, see
-## 'citation("Biobase")', and for packages 'citation("pkgname")'.
-## Loading required package: DelayedArray
-## Loading required package: matrixStats
-##
-## Attaching package: 'matrixStats'
-## The following objects are masked from 'package:Biobase':
-##
-## anyMissing, rowMedians
-##
-## Attaching package: 'DelayedArray'
-## The following objects are masked from 'package:matrixStats':
-##
-## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
-## The following objects are masked from 'package:base':
-##
-## aperm, apply, rowsum
-# Attach the `ggplot2` library for plotting
-library(ggplot2)
-
-# We will need this so we can use the pipe: %>%
-library(magrittr)
-
-# Set the seed so our results are reproducible:
-set.seed(12345)
+
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-## Parsed with column specification:
+
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_logical(),
@@ -3165,113 +3941,129 @@ 4.2 Import and set up data
## refinebio_source_archive_url = col_logical(),
## refinebio_specimen_part = col_logical(),
## refinebio_time = col_logical()
-## )
-## See spec(...) for full column specifications.
-# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
- # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
- tibble::column_to_rownames("Gene")
-## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+# Read in data TSV file
+expression_df <- readr::read_tsv(data_file) %>%
+ # Tuck away the gene ID column as row names, leaving only numeric values
+ tibble::column_to_rownames("Gene")
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
-## See spec(...) for full column specifications.
+## ℹ Use `spec()` for the full column specifications.
Let’s ensure that the metadata and data are in the same sample order.
-# Make the data in the order of the metadata
-df <- df %>% dplyr::select(metadata$refinebio_accession_code)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$refinebio_accession_code)
+# Make the sure the columns (samples) are in the same order as the metadata
+expression_df <- expression_df %>%
+ dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(expression_df), metadata$refinebio_accession_code)
## [1] TRUE
-
Now we are going to use a combination of functions from the DESeq2
and ggplot2
packages to perform and visualize the results of the Principal Component Analysis (PCA) dimension reduction technique on our pre-ADT and post-ADT samples.
DESeq2
We need to make sure all of the values in our data are converted to integers as required by a DESeq2
function we will use later.
# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers
-df <- df %>%
- # Mutate numeric variables to be integers
- dplyr::mutate_if(is.numeric, round)
-DESEq2
We need to make sure all of the metadata column variables, that we would like to use to annotate our plot, are converted into factors.
-# We need to also format the variables from the metadata, that we will be using for annotation of the PCA plot, into factors
-metadata <- metadata %>%
- dplyr::mutate(
- refinebio_treatment = as.factor(refinebio_treatment),
- refinebio_disease = as.factor(refinebio_disease)
- )
+DESEq2
We need to make sure all of the metadata column variables that we would like to use to annotate our plot, are converted into factors.
+# convert the columns we will be using for annotation into factors
+metadata <- metadata %>%
+ dplyr::mutate(
+ refinebio_treatment = factor(
+ refinebio_treatment,
+ # specify the possible levels in the order we want them to appear
+ levels = c("pre-adt", "post-adt")
+ ),
+ refinebio_disease = as.factor(refinebio_disease)
+ )
We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis. We are going to do some pre-filtering to keep only genes with 10 or more reads total. This threshold might vary depending on the number of samples and expression patterns across your data set. Note that rows represent gene data and the columns represent sample data in our dataset.
+# Define a minimum counts cutoff and filter the data to include
+# only rows (genes) that have total counts above the cutoff
+filtered_expression_df <- expression_df %>%
+ dplyr::filter(rowSums(.) >= 10)
We also need our counts to be rounded before we can use them with the DESeqDataSetFromMatrix()
function.
We will be using the DESeq2
package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet
object. We turn the data frame (or matrix) into a DESeqDataSet
object. ) and specify which variable labels our experimental groups using the design
argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design
argument because we are not performing a differential expression analysis.
# Create a `DESeqDataSet` object
-dds <- DESeqDataSetFromMatrix(
- countData = df, # This is the data frame with the counts values for all replicates in our dataset
- colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame
- design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis
-)
+We will be using the DESeq2
package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet
object. We turn the data frame (or matrix) into a DESeqDataSet
object and specify which variable labels our experimental groups using the design
argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design
argument because we are not performing a differential expression analysis.
# Create a `DESeqDataSet` object
+dds <- DESeqDataSetFromMatrix(
+ countData = filtered_expression_df, # the counts values for all samples in our dataset
+ colData = metadata, # annotation data for the samples in the counts data frame
+ design = ~1 # Here we are not specifying a model
+ # Replace with an appropriate design variable for your analysis
+)
## converting counts to integer mode
We want to filter out the genes that have not been expressed or that have low expression counts because we want to remove any possible noise from our data before we normalize the data and create our plot. We are going to do some pre-filtering to keep only genes with 10 or more reads total. Note that rows represent gene data and the columns represent sample data in our dataset.
-# Define a minimum counts cutoff and filter `DESeqDataSet` object to include
-# only rows that have counts above the cutoff
-genes_to_keep <- rowSums(counts(dds)) >= 10
-dds <- dds[genes_to_keep, ]
-We are going to use the vst()
function from the DESeq2
package to normalize and transform the data. For more information about these transformation methods, see here.
# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package
-dds_norm <- vst(dds)
+
In this code chunk, the variable refinebio_treatment
is given to the plotPCA()
function as part of the goal of the experiment is to analyze the sample transcriptional responses to androgen deprivation therapy (ADT).
plotPCA(dds_norm,
- intgroup = "refinebio_treatment"
-)
-
-In this chunk, we are going to add another variable to our plot for annotation.
-Now we’ll plot the PCA using both refinebio_treatment
and refinebio_disease
variables for labels since they are central to the androgen deprivation therapy (ADT) based hypothesis in the original paper (Sharma et al. 2018).
plotPCA(dds_norm,
- intgroup = c("refinebio_treatment", "refinebio_disease") # Note that we are able to add another variable to the intgroup argument here by providing a vector of the variable names with `c()` function
-)
+DESeq2 has built-in functions to calculate and plot PCA values, which we will use here. The plotPCA()
function allows us to specify our group of interest with the intgroup
argument, which will be used to color the points in our plot. In this code chunk, we are using refinebio_treatment
as the grouping variable, as part of the goal of the experiment was to analyze the sample transcriptional responses to androgen deprivation therapy (ADT).
In the next chunk, we are going to add another variable to our plot for annotation.
+Now we’ll plot the PCA using both refinebio_treatment
and refinebio_disease
variables for labels since they are central to the androgen deprivation therapy (ADT) based hypothesis in the original paper (Sharma et al. 2018).
plotPCA(
+ dds_norm,
+ intgroup = c("refinebio_treatment", "refinebio_disease")
+ # We are able to add another variable to the intgroup argument
+ # by providing a vector of the variable names with `c()` function
+)
In the plot above, it is hard to distinguish the different refinebio_treatment
values which contain the data on whether or not samples have been treated with ADT versus the refinebio_disease
values which refer to the method by which the samples were obtained from patients (i.e. biopsy).
Let’s use the ggplot2
package functionality to customize our plot further and make the annotation labels better distinguishable.
First let’s use plotPCA()
to receive and store the PCA values for plotting.
# We first have to save the results of the `plotPCA()` function for use with `ggplot2`
-pcaData <-
- plotPCA(
- dds_norm,
- intgroup = c("refinebio_treatment", "refinebio_disease"),
- returnData = TRUE # This argument tells R to return the PCA values
- )
-Now let’s plot our pcaData
using ggplot2
functionality.
# Plot using `ggplot()` function
-annotated_pca_plot <- ggplot(
- pcaData,
- aes(
- x = PC1,
- y = PC2,
- color = refinebio_treatment, # This will label points with different colors for each `refinebio_disease` group
- shape = refinebio_disease # This will label points with different shapes for each `refinebio_disease` group
- )
-)
+# We first have to save the results of the `plotPCA()` function for use with `ggplot2`
+pca_results <-
+ plotPCA(
+ dds_norm,
+ intgroup = c("refinebio_treatment", "refinebio_disease"),
+ returnData = TRUE # This argument tells R to return the PCA values
+ )
Now let’s plot our pca_results
using ggplot2
functionality.
# Plot using `ggplot()` function and save to an object
+annotated_pca_plot <- ggplot(
+ pca_results,
+ aes(
+ x = PC1,
+ y = PC2,
+ # plot points with different colors for each `refinebio_treatment` group
+ color = refinebio_treatment,
+ # plot points with different shapes for each `refinebio_disease` group
+ shape = refinebio_disease
+ )
+) +
+ # Make a scatter plot
+ geom_point()
+
+# display annotated plot
+annotated_pca_plot
You can easily switch this to save to a JPEG or TIFF by changing the file name within the ggsave()
function to the respective file suffix.
# Save plot using `ggsave()` function
-ggsave(file.path(plots_dir, "SRP133573_pca_plot.png"), # Replace with name relevant your plotted data
- plot = annotated_pca_plot # Here we are giving the function the plot object that we want saved to file
-)
+# Save plot using `ggsave()` function
+ggsave(
+ file.path(plots_dir, "SRP133573_pca_plot.png"),
+ # Replace with a file name relevant your plotted data
+ plot = annotated_pca_plot # the plot object that we want saved to file
+)
## Saving 7 x 5 in image
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessioninfo::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -3299,46 +4091,47 @@ 6 Print session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-10-16
+## date 2020-12-18
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
+## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
-## annotate 1.66.0 2020-04-27 [1] Bioconductor
-## AnnotationDbi 1.50.3 2020-07-25 [1] Bioconductor
+## annotate 1.68.0 2020-10-27 [1] Bioconductor
+## AnnotationDbi 1.52.0 2020-10-27 [1] Bioconductor
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
-## Biobase * 2.48.0 2020-04-27 [1] Bioconductor
-## BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor
-## BiocParallel 1.22.0 2020-04-27 [1] Bioconductor
+## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
+## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
+## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2)
## bitops 1.0-6 2013-08-17 [1] RSPM (R 4.0.0)
## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0)
-## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0)
-## DelayedArray * 0.14.1 2020-07-14 [1] Bioconductor
-## DESeq2 * 1.28.1 2020-05-12 [1] Bioconductor
+## DelayedArray 0.16.0 2020-10-27 [1] Bioconductor
+## DESeq2 * 1.30.0 2020-10-27 [1] Bioconductor
## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0)
## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0)
-## genefilter 1.70.0 2020-04-27 [1] Bioconductor
-## geneplotter 1.66.0 2020-04-27 [1] Bioconductor
+## genefilter 1.72.0 2020-10-27 [1] Bioconductor
+## geneplotter 1.68.0 2020-10-27 [1] Bioconductor
## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
-## GenomeInfoDb * 1.24.2 2020-06-15 [1] Bioconductor
-## GenomeInfoDbData 1.2.3 2020-10-06 [1] Bioconductor
-## GenomicRanges * 1.40.0 2020-04-27 [1] Bioconductor
+## GenomeInfoDb * 1.26.2 2020-12-08 [1] Bioconductor
+## GenomeInfoDbData 1.2.4 2020-12-16 [1] Bioconductor
+## GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## ggplot2 * 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
-## IRanges * 2.22.2 2020-05-21 [1] Bioconductor
+## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.2)
+## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0)
## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
@@ -3346,12 +4139,14 @@ 6 Print session info
## locfit 1.5-9.4 2020-03-25 [1] RSPM (R 4.0.0)
## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
## Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2)
+## MatrixGenerics * 1.2.0 2020-10-27 [1] Bioconductor
## matrixStats * 0.57.0 2020-09-25 [1] RSPM (R 4.0.2)
## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
@@ -3361,48 +4156,48 @@ 6 Print session info
## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0)
## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
## RCurl 1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0)
-## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
-## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
-## RSQLite 2.2.0 2020-01-07 [1] RSPM (R 4.0.2)
+## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
-## S4Vectors * 0.26.1 2020-05-16 [1] Bioconductor
+## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
-## SummarizedExperiment * 1.18.2 2020-07-09 [1] Bioconductor
+## SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor
## survival 3.1-12 2020-04-10 [2] CRAN (R 4.0.2)
-## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2)
## XML 3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2)
## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.0.0)
-## XVector 0.28.0 2020-04-27 [1] Bioconductor
+## XVector 0.30.0 2020-10-27 [1] Bioconductor
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
-## zlibbioc 1.34.0 2020-04-27 [1] Bioconductor
+## zlibbioc 1.36.0 2020-10-27 [1] Bioconductor
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
-Freytag S., 2019 Workshop: Dimension reduction with r
+Freytag S., 2019 Workshop: Dimension reduction with R. https://rpubs.com/Saskia/520216
-Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
+Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
Nguyen L. H., and S. Holmes, 2019 Ten quick tips for effective dimensionality reduction. PLOS Computational Biology 15. https://doi.org/10.1371/journal.pcbi.1006907
-Powell V., and L. Lehe, Principal component analysis explained visually
+Powell V., and L. Lehe, Principal component analysis explained visually. https://setosa.io/ev/principal-component-analysis/
-Prabhakaran S., 2016 The complete ggplot2 tutorial.
+Prabhakaran S., 2016 The complete ggplot2 tutorial. http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html
Sharma N. V., K. L. Pellegrini, V. Ouellet, F. O. Giuste, and S. Ramalingam et al., 2018 Identification of the transcription factor relationships associated with androgen deprivation therapy response and metastatic progression in prostate cancer. Cancers 10. https://doi.org/10.3390/cancers10100379
@@ -3410,6 +4205,11 @@ 6 Print session info
+
.Rmd
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
- dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
- +We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples”. Note that this option will only be available for RNA-seq datasets.
It may take a few minutes for the dataset to process. You will get an email when it is ready.
@@ -3008,13 +3834,12 @@For this example analysis, we will use this prostate cancer dataset.
-The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens.
data/
folderRefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the SRP133573
folder into your newly created data/
folder.
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
-# Define the file path to the data directory
-data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "SRP133573")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "SRP133573.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
@@ -3073,90 +3903,40 @@See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
-In this analysis, we will be using the R package DESeq2
(Love et al. 2014) for normalization, the R package umap
(Konopka 2020) for the production of UMAP dimension reduction values and the R package , and the R package ggplot2
(Prabhakaran 2016) for plotting the UMAP values.
if (!("DESeq2" %in% installed.packages())) {
- # Install DESeq2
- BiocManager::install("DESeq2", update = FALSE)
-}
-
-if (!("umap" %in% installed.packages())) {
- # Install umap package
- BiocManager::install("umap", update = FALSE)
-}
+In this analysis, we will be using the R package DESeq2
(Love et al. 2014) for normalization, the R package umap
(Konopka 2020) for the production of UMAP dimension reduction values and the R package , and the R package ggplot2
(Prabhakaran 2016) for plotting the UMAP values.
if (!("DESeq2" %in% installed.packages())) {
+ # Install DESeq2
+ BiocManager::install("DESeq2", update = FALSE)
+}
+
+if (!("umap" %in% installed.packages())) {
+ # Install umap package
+ BiocManager::install("umap", update = FALSE)
+}
Attach the packages we need for this analysis:
-# Attach the `DESeq2` library
-library(DESeq2)
-## Loading required package: S4Vectors
-## Loading required package: stats4
-## Loading required package: BiocGenerics
-## Loading required package: parallel
-##
-## Attaching package: 'BiocGenerics'
-## The following objects are masked from 'package:parallel':
-##
-## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
-## clusterExport, clusterMap, parApply, parCapply, parLapply,
-## parLapplyLB, parRapply, parSapply, parSapplyLB
-## The following objects are masked from 'package:stats':
-##
-## IQR, mad, sd, var, xtabs
-## The following objects are masked from 'package:base':
-##
-## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
-## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
-## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
-## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
-## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
-## union, unique, unsplit, which, which.max, which.min
-##
-## Attaching package: 'S4Vectors'
-## The following object is masked from 'package:base':
-##
-## expand.grid
-## Loading required package: IRanges
-## Loading required package: GenomicRanges
-## Loading required package: GenomeInfoDb
-## Loading required package: SummarizedExperiment
-## Loading required package: Biobase
-## Welcome to Bioconductor
-##
-## Vignettes contain introductory material; view with
-## 'browseVignettes()'. To cite Bioconductor, see
-## 'citation("Biobase")', and for packages 'citation("pkgname")'.
-## Loading required package: DelayedArray
-## Loading required package: matrixStats
-##
-## Attaching package: 'matrixStats'
-## The following objects are masked from 'package:Biobase':
-##
-## anyMissing, rowMedians
-##
-## Attaching package: 'DelayedArray'
-## The following objects are masked from 'package:matrixStats':
-##
-## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
-## The following objects are masked from 'package:base':
-##
-## aperm, apply, rowsum
-# Attach the `umap` library
-library(umap)
-
-# Attach the `ggplot2` library for plotting
-library(ggplot2)
-
-# We will need this so we can use the pipe: %>%
-library(magrittr)
-
-# Set the seed so our results are reproducible:
-set.seed(12345)
+
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-## Parsed with column specification:
+
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_character(),
## refinebio_age = col_logical(),
@@ -3169,142 +3949,157 @@ 4.2 Import and set up data
## refinebio_source_archive_url = col_logical(),
## refinebio_specimen_part = col_logical(),
## refinebio_time = col_logical()
-## )
-## See spec(...) for full column specifications.
-# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
- # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
- tibble::column_to_rownames("Gene")
-## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+# Read in data TSV file
+expression_df <- readr::read_tsv(data_file) %>%
+ # Tuck away the gene ID column as row names, leaving only numeric values
+ tibble::column_to_rownames("Gene")
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
-## See spec(...) for full column specifications.
+## ℹ Use `spec()` for the full column specifications.
Let’s ensure that the metadata and data are in the same sample order.
-# Make the data in the order of the metadata
-df <- df %>%
- dplyr::select(metadata$refinebio_accession_code)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$refinebio_accession_code)
+# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+ dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(expression_df), metadata$refinebio_accession_code)
## [1] TRUE
-
Now we are going to use a combination of functions from the DESeq2
, umap
, and ggplot2
packages to perform and visualize the results of the Uniform Manifold Approximation and Projection (UMAP) dimension reduction technique on our pre-ADT and post-ADT samples.
DESeq2
We need to make sure all of the values in our data are converted to integers as required by a DESeq2
function we will use later.
# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers
-df <- df %>%
- # Mutate numeric variables to be integers
- dplyr::mutate_if(is.numeric, round)
-DESEq2
DESEq2
We need to make sure all of the metadata column variables, that we would like to use to annotate our plot, are converted into factors.
-# We need to also format the variables from the metadata, that we will be using for annotation of the UMAP plot, into factors
-metadata <- metadata %>%
- dplyr::select( # The metadata has many variables, we want to select only those that we will need for plotting later
- refinebio_accession_code,
- refinebio_treatment,
- refinebio_disease
- ) %>%
- dplyr::mutate( # Now let's convert the annotation variables into factors
- refinebio_treatment = as.factor(refinebio_treatment),
- refinebio_disease = as.factor(refinebio_disease)
- )
+# convert the columns we will be using for annotation into factors
+metadata <- metadata %>%
+ dplyr::select( # select only the columns that we will need for plotting
+ refinebio_accession_code,
+ refinebio_treatment,
+ refinebio_disease
+ ) %>%
+ dplyr::mutate( # Now let's convert the annotation variables into factors
+ refinebio_treatment = factor(
+ refinebio_treatment,
+ # specify the possible levels in the order we want them to appear
+ levels = c("pre-adt", "post-adt")
+ ),
+ refinebio_disease = as.factor(refinebio_disease)
+ )
We want to filter out the genes that have not been expressed or that have low expression counts since these genes are likely to add noise rather than useful signal to our analysis. We are going to do some pre-filtering to keep only genes with 10 or more reads total. Note that rows represent gene data and the columns represent sample data in our dataset.
+# Define a minimum counts cutoff and filter the data to include
+# only rows (genes) that have total counts above the cutoff
+filtered_expression_df <- expression_df %>%
+ dplyr::filter(rowSums(.) >= 10)
We also need our counts to be rounded before we can use them with the DESeqDataSetFromMatrix()
function.
We will be using the DESeq2
package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet
object. We turn the data frame (or matrix) into a DESeqDataSet
object. ) and specify which variable labels our experimental groups using the design
argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design
argument because we are not performing a differential expression analysis.
# Create a `DESeqDataSet` object
-dds <- DESeqDataSetFromMatrix(
- countData = df, # This is the data frame with the counts values for all replicates in our dataset
- colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame
- design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis
-)
+We will be using the DESeq2
package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet
object. We turn the data frame (or matrix) into a DESeqDataSet
object. ) and specify which variable labels our experimental groups using the design
argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design
argument because we are not performing a differential expression analysis.
# Create a `DESeqDataSet` object
+dds <- DESeqDataSetFromMatrix(
+ countData = filtered_expression_df, # the counts values for all samples in our dataset
+ colData = metadata, # annotation data for the samples in the counts data frame
+ design = ~1 # Here we are not specifying a model
+ # Replace with an appropriate design variable for your analysis
+)
## converting counts to integer mode
We want to filter out the genes that have not been expressed or that have low expression counts because we want to remove any possible noise from our data before we normalize the data and create our plot. We are going to do some pre-filtering to keep only genes with 10 or more reads total. Note that rows represent gene data and the columns represent sample data in our dataset.
-# Define a minimum counts cutoff and filter `DESeqDataSet` object to include
-# only rows that have counts above the cutoff
-genes_to_keep <- rowSums(counts(dds)) >= 10
-dds <- dds[genes_to_keep, ]
-We are going to use the vst()
function from the DESeq2
package to normalize and transform the data. For more information about these transformation methods, see here.
# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package
-dds_norm <- vst(dds)
+
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique proposed by McInnes et al. (2018) (See associated paper). While PCA assumes that the variation we care about has a particular distribution (normal, broadly speaking), UMAP allows more complicated distributions that it learns from the data. The advantage of this feature is that UMAP can do a better job separating clusters, especially when some of those clusters may be more similar to each other than others (CCDL 2020).
-In this code chunk, we are going to extract the normalized counts data from the DESeqDataSet
object and perform UMAP on the normalized data using umap()
from the umap
package. We are using the default parameters when we run the umap::umap()
function. Here’s some guidance about choosing parameters when executing umap::umap()
(R CRAN Team 2019). You can also run the following in the RStudio console to get more information on the function and its default parameters: ?umap::umap
or ?umap::umap.defaults
.
# First we are going to retrieve the normalized data from the `DESeqDataSet` object using the `assay()` function
-normalized_counts <- assay(dds_norm) %>%
- t() # We need to transpose this data in preparation for the `umap()` function
-
-# Now let's tell R to perform UMAP on the normalized data
-umap_results <- umap::umap(normalized_counts)
+Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique proposed by McInnes et al. (2018) (See associated paper). While PCA assumes that the variation we care about has a particular distribution (normal, broadly speaking), UMAP allows more complicated distributions that it learns from the data. The advantage of this feature is that UMAP can do a better job separating clusters, especially when some of those clusters may be more similar to each other than others (Childhood Cancer Data Lab 2020).
+In this code chunk, we are going to extract the normalized counts data from the DESeqDataSet
object and perform UMAP on the normalized data using umap()
from the umap
package. We are using the default parameters when we run the umap::umap()
function. Here’s some guidance about choosing parameters when executing umap::umap()
(R CRAN Team 2019). You can also run the following in the RStudio console to get more information on the function and its default parameters: ?umap::umap
or ?umap::umap.defaults
.
# First we are going to retrieve the normalized data
+# from the `DESeqDataSet` object using the `assay()` function
+normalized_counts <- assay(dds_norm) %>%
+ t() # We need to transpose this data so each row is a sample
+
+# Now perform UMAP on the normalized data
+umap_results <- umap::umap(normalized_counts)
Now that we have the results from UMAP, we need to extract the counts data from the umap_results
object and merge the variables from the metadata that we will use for annotating our plot.
# We need to tell R to create a data frame with the umap values and annotation data in preparation for plotting with `ggplot2`
-umap_plot_df <- data.frame(umap_results$layout) %>%
- tibble::rownames_to_column("refinebio_accession_code") %>% # Let's get the rownames into a column so we can merge the annotation data
- dplyr::inner_join(metadata, by = "refinebio_accession_code") # Let's merge the annotation data using the `refinebio_accession_code` column
+# Make into data frame for plotting with `ggplot2`
+# The UMAP values we need for plotting are stored in the `layout` element
+umap_plot_df <- data.frame(umap_results$layout) %>%
+ # Turn sample IDs stored as row names into a column
+ tibble::rownames_to_column("refinebio_accession_code") %>%
+ # Add the metadata into this data frame; match by sample IDs
+ dplyr::inner_join(metadata, by = "refinebio_accession_code")
Let’s take a look at the data frame we created in the chunk above.
+ +Here we can see that UMAP took the data from thousands of genes, and reduced it to just two variables, X1
and X2
.
Now we can use the ggplot()
function to plot our normalized UMAP scores.
# Plot using `ggplot()` function
-ggplot(
- umap_plot_df,
- aes(
- x = X1,
- y = X2
- )
-) +
- geom_point() # This tells R that we want a scatterplot
-
+# Plot using `ggplot()` function
+ggplot(
+ umap_plot_df,
+ aes(
+ x = X1,
+ y = X2
+ )
+) +
+ geom_point() # Plot individual points to make a scatterplot
Let’s try adding a variable to our plot for annotation.
In this code chunk, the variable refinebio_treatment
is given to the ggplot()
function so we can label by androgen deprivation therapy (ADT) status.
# Plot using `ggplot()` function
-ggplot(
- umap_plot_df,
- aes(
- x = X1,
- y = X2,
- color = refinebio_treatment # This will label points with different colors for each `refinebio_treatment` group
- )
-) +
- geom_point() # This tells R that we want a scatterplot
-
+# Plot using `ggplot()` function
+ggplot(
+ umap_plot_df,
+ aes(
+ x = X1,
+ y = X2,
+ color = refinebio_treatment # label points with different colors for each `subgroup`
+ )
+) +
+ geom_point() # This tells R that we want a scatterplot
In the next code chunk, we are going to add another variable to our plot for annotation.
-We’ll plot using both refinebio_treatment
and refinebio_disease
variables for labels since they are central to the androgen deprivation therapy (ADT) based hypothesis in the original paper (Sharma et al. 2018).
# Plot using `ggplot()` function
-final_annotated_umap_plot <- ggplot(
- umap_plot_df,
- aes(
- x = X1,
- y = X2,
- color = refinebio_treatment, # This will label points with different colors for each `refinebio_treatment` group
- shape = refinebio_disease # This will label points with different shapes for each `refinebio_disease` group
- )
-) +
- geom_point() # This tells R that we want a scatterplot
-
-# Display the plot that we saved above
-final_annotated_umap_plot
-
+We’ll plot using both refinebio_treatment
and refinebio_disease
variables for labels since they are central to the androgen deprivation therapy (ADT) based hypothesis in the original paper (Sharma et al. 2018).
# Plot using `ggplot()` function and save to an object
+final_annotated_umap_plot <- ggplot(
+ umap_plot_df,
+ aes(
+ x = X1,
+ y = X2,
+ # plot points with different colors for each `refinebio_treatment` group
+ color = refinebio_treatment,
+ # plot points with different shapes for each `refinebio_disease` group
+ shape = refinebio_disease
+ )
+) +
+ geom_point() # make a scatterplot
+
+# Display the plot that we saved above
+final_annotated_umap_plot
Although it does appear that majority of the pre-ADT and post-ADT appear to cluster together, there are still questions remaining as we look at outliers.
In summary, a good rule of thumb to remember is: if the results of an analysis can be completely changed by changing its parameters, you should be very cautious when it comes to the conclusions you draw from it as well as having good rationale for the parameters you choose (adapted from CCDL (2020) training materials).
+In summary, a good rule of thumb to remember is: if the results of an analysis can be completely changed by changing its parameters, you should be very cautious when it comes to the conclusions you draw from it as well as having good rationale for the parameters you choose (adapted from Childhood Cancer Data Lab (2020) training materials).
You can easily switch this to save to a JPEG or TIFF by changing the file name within the ggsave()
function to the respective file suffix.
# Save plot using `ggsave()` function
-ggsave(file.path(plots_dir, "SRP133573_umap_plot.png"), # Replace with name relevant your plotted data
- plot = final_annotated_umap_plot
-)
+# Save plot using `ggsave()` function
+ggsave(
+ file.path(
+ plots_dir,
+ "SRP133573_umap_plot.png" # Replace with a good file name for your plot
+ ),
+ plot = final_annotated_umap_plot
+)
## Saving 7 x 5 in image
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessioninfo::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -3351,47 +4150,48 @@ 6 Print session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-10-16
+## date 2020-12-18
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
+## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
-## annotate 1.66.0 2020-04-27 [1] Bioconductor
-## AnnotationDbi 1.50.3 2020-07-25 [1] Bioconductor
+## annotate 1.68.0 2020-10-27 [1] Bioconductor
+## AnnotationDbi 1.52.0 2020-10-27 [1] Bioconductor
## askpass 1.1 2019-01-13 [1] RSPM (R 4.0.0)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
-## Biobase * 2.48.0 2020-04-27 [1] Bioconductor
-## BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor
-## BiocParallel 1.22.0 2020-04-27 [1] Bioconductor
+## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
+## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
+## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2)
## bitops 1.0-6 2013-08-17 [1] RSPM (R 4.0.0)
## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0)
-## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0)
-## DelayedArray * 0.14.1 2020-07-14 [1] Bioconductor
-## DESeq2 * 1.28.1 2020-05-12 [1] Bioconductor
+## DelayedArray 0.16.0 2020-10-27 [1] Bioconductor
+## DESeq2 * 1.30.0 2020-10-27 [1] Bioconductor
## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0)
## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0)
-## genefilter 1.70.0 2020-04-27 [1] Bioconductor
-## geneplotter 1.66.0 2020-04-27 [1] Bioconductor
+## genefilter 1.72.0 2020-10-27 [1] Bioconductor
+## geneplotter 1.68.0 2020-10-27 [1] Bioconductor
## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
-## GenomeInfoDb * 1.24.2 2020-06-15 [1] Bioconductor
-## GenomeInfoDbData 1.2.3 2020-10-06 [1] Bioconductor
-## GenomicRanges * 1.40.0 2020-04-27 [1] Bioconductor
+## GenomeInfoDb * 1.26.2 2020-12-08 [1] Bioconductor
+## GenomeInfoDbData 1.2.4 2020-12-16 [1] Bioconductor
+## GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## ggplot2 * 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
-## IRanges * 2.22.2 2020-05-21 [1] Bioconductor
+## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.2)
+## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0)
@@ -3400,6 +4200,7 @@ 6 Print session info
## locfit 1.5-9.4 2020-03-25 [1] RSPM (R 4.0.0)
## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
## Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2)
+## MatrixGenerics * 1.2.0 2020-10-27 [1] Bioconductor
## matrixStats * 0.57.0 2020-09-25 [1] RSPM (R 4.0.2)
## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
@@ -3407,6 +4208,7 @@ 6 Print session info
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
@@ -3416,23 +4218,23 @@ 6 Print session info
## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0)
## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
## RCurl 1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0)
-## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
## reticulate 1.16 2020-05-27 [1] RSPM (R 4.0.2)
-## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
## RSpectra 0.16-0 2019-12-01 [1] RSPM (R 4.0.2)
-## RSQLite 2.2.0 2020-01-07 [1] RSPM (R 4.0.2)
+## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
-## S4Vectors * 0.26.1 2020-05-16 [1] Bioconductor
+## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
-## SummarizedExperiment * 1.18.2 2020-07-09 [1] Bioconductor
+## SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor
## survival 3.1-12 2020-04-10 [2] CRAN (R 4.0.2)
-## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
## umap * 0.2.6.0 2020-06-16 [1] RSPM (R 4.0.2)
## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
@@ -3440,9 +4242,9 @@ 6 Print session info
## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2)
## XML 3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2)
## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.0.0)
-## XVector 0.28.0 2020-04-27 [1] Bioconductor
+## XVector 0.30.0 2020-10-27 [1] Bioconductor
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
-## zlibbioc 1.34.0 2020-04-27 [1] Bioconductor
+## zlibbioc 1.36.0 2020-10-27 [1] Bioconductor
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
@@ -3451,28 +4253,28 @@ 6 Print session info
References
-CCDL, 2020 ScRNA-seq dimension reduction.
+Childhood Cancer Data Lab, 2020 scRNA-seq dimension reduction. https://github.com/AlexsLemonade/training-modules/blob/3dbc6f3f53c680ec6aa2f513851c1cd4635cc31c/scRNA-seq/05-dimension_reduction_scRNA-seq.Rmd#L382
-Freytag S., 2019 Workshop: Dimension reduction with r
+Freytag S., 2019 Workshop: Dimension reduction with R. https://rpubs.com/Saskia/520216
-Konopka T., 2020 Uniform manifold approximation and projection.
+Konopka T., 2020 Uniform manifold approximation and projection. https://cran.r-project.org/web/packages/umap/umap.pdf
-Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
+Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
-McInnes L., J. Healy, and J. Melville, 2018 UMAP: Uniform manifold approximation and projection for dimension reduction
+McInnes L., J. Healy, and J. Melville, 2018 UMAP: Uniform manifold approximation and projection for dimension reduction. https://arxiv.org/abs/1802.03426
Nguyen L. H., and S. Holmes, 2019 Ten quick tips for effective dimensionality reduction. PLOS Computational Biology 15. https://doi.org/10.1371/journal.pcbi.1006907
-Prabhakaran S., 2016 The complete ggplot2 tutorial.
+Prabhakaran S., 2016 The complete ggplot2 tutorial. http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html
-R CRAN Team, 2019 Uniform manifold approximation and projection in r.
+R CRAN Team, 2019 Uniform manifold approximation and projection in R. https://cran.r-project.org/web/packages/umap/vignettes/umap.html
Sharma N. V., K. L. Pellegrini, V. Ouellet, F. O. Giuste, and S. Ramalingam et al., 2018 Identification of the transcription factor relationships associated with androgen deprivation therapy response and metastatic progression in prostate cancer. Cancers 10. https://doi.org/10.3390/cancers10100379
@@ -3480,6 +4282,11 @@ References
+
The purpose of this notebook is to provide an example of mapping gene IDs for RNA-seq data obtained from refine.bio using AnnotationDbi
packages (Carlson 2020a).
The purpose of this notebook is to provide an example of mapping gene IDs for RNA-seq data obtained from refine.bio using AnnotationDbi
packages (Pagès et al. 2020).
.Rmd
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
- dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
-# Define the file path to the data directory
-data_dir <- file.path("data", "SRP040561") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "SRP040561.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_SRP040561.tsv") # Replace with file path to your metadata
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "SRP040561")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "SRP040561.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_SRP040561.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
@@ -3063,73 +3894,39 @@If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
refine.bio data comes with gene level data with Ensembl IDs. Although this example script uses Ensembl IDs from Zebrafish, (Danio rerio), to obtain Entrez IDs this script can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers.
-For different species, wherever the abbreviation org.Dr.eg.db
or Dr
is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens org.Hs.eg.db
or Hs
would be used. In the case of our microarray gene identifier annotation example notebook, a Mouse (Mus musculus) dataset is used, meaning org.Mm.eg.db
or Mm
would also need to be used there. A full list of the annotation R packages from Bioconductor is at this link (R Bioconductor Team 2003).
refine.bio data comes with gene level data identified by Ensembl IDs. Although this example notebook uses Ensembl IDs from Zebrafish, (Danio rerio), to obtain Entrez IDs this script can be easily converted for use with different species or annotation types e.g. protein IDs, gene ontology, accession numbers.
+For different species, wherever the abbreviation org.Dr.eg.db
or Dr
is written, it must be replaced with the respective species abbreviation e.g. for Homo sapiens org.Hs.eg.db
or Hs
would be used. In the case of our microarray gene identifier annotation example notebook, a Mouse (Mus musculus) dataset is used, meaning org.Mm.eg.db
or Mm
would also need to be used there. A full list of the annotation R packages from Bioconductor is at this link.
Ensembl IDs can be used to obtain various different annotations at the gene/transcript level. Let’s get ready to use the Ensembl IDs from our zebrafish dataset to obtain the associated Entrez IDs.
+refine.bio uses Ensembl IDs as the primary gene identifier in its data sets. While this is a consistent and useful identifier, a string of apparently random letters and numbers is not the most user-friendly or informative for interpretation. Luckily, we can use the Ensembl IDs that we have to obtain various different annotations at the gene/transcript level. Let’s get ready to use the Ensembl IDs from our zebrafish dataset to obtain the associated Entrez IDs.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
-In this analysis, we will be using the org.Dr.eg.db
R package (Carlson 2019).
# Install the Zebrafish package
-if (!("org.Dr.eg.db" %in% installed.packages())) {
- # Install this package if it isn't installed yet
- BiocManager::install("org.Dr.eg.db", update = FALSE)
-}
-Attach the packages we need for this analysis.
-# Attach the library
-library(org.Dr.eg.db)
-## Loading required package: AnnotationDbi
-## Loading required package: stats4
-## Loading required package: BiocGenerics
-## Loading required package: parallel
-##
-## Attaching package: 'BiocGenerics'
-## The following objects are masked from 'package:parallel':
-##
-## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
-## clusterExport, clusterMap, parApply, parCapply, parLapply,
-## parLapplyLB, parRapply, parSapply, parSapplyLB
-## The following objects are masked from 'package:stats':
-##
-## IQR, mad, sd, var, xtabs
-## The following objects are masked from 'package:base':
-##
-## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
-## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
-## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
-## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
-## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
-## union, unique, unsplit, which, which.max, which.min
-## Loading required package: Biobase
-## Welcome to Bioconductor
-##
-## Vignettes contain introductory material; view with
-## 'browseVignettes()'. To cite Bioconductor, see
-## 'citation("Biobase")', and for packages 'citation("pkgname")'.
-## Loading required package: IRanges
-## Loading required package: S4Vectors
-##
-## Attaching package: 'S4Vectors'
-## The following object is masked from 'package:base':
-##
-## expand.grid
-##
-# We will need this so we can use the pipe: %>%
-library(magrittr)
+In this analysis, we will be using the org.Dr.eg.db
R package (Carlson 2019), which is part of the Bioconductor AnnotationDbi
framework (Pagès et al. 2020). Bioconductor compiles annotations from various sources, and these packages provide convenient methods to access and translate among those annotations. Other species can be used.
# Install the Zebrafish package
+if (!("org.Dr.eg.db" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("org.Dr.eg.db", update = FALSE)
+}
Attach the packages we need for this analysis. Note that attaching org.Mm.eg.db
will automatically also attach AnnotationDbi
.
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-## Parsed with column specification:
+
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_logical(),
## refinebio_accession_code = col_character(),
@@ -3138,49 +3935,50 @@ 4.2 Import and set up data
## refinebio_platform = col_character(),
## refinebio_source_database = col_character(),
## refinebio_title = col_character()
-## )
-## See spec(...) for full column specifications.
-# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
- # Tuck away the Gene ID column as rownames
- tibble::column_to_rownames("Gene")
-## Parsed with column specification:
+## )
+## ℹ Use `spec()` for the full column specifications.
+# Read in data TSV file
+expression_df <- readr::read_tsv(data_file) %>%
+ # Tuck away the Gene ID column as row names
+ tibble::column_to_rownames("Gene")
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
-## See spec(...) for full column specifications.
+## ℹ Use `spec()` for the full column specifications.
Let’s ensure that the metadata and data are in the same sample order.
-# Make the data in the order of the metadata
-df <- df %>%
- dplyr::select(metadata$refinebio_accession_code)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$refinebio_accession_code)
+# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+ dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(expression_df), metadata$refinebio_accession_code)
## [1] TRUE
-# Bring back the "Gene" column in preparation for mapping
-df <- df %>%
- tibble::rownames_to_column("Gene")
+
The mapIds()
function has a multiVals
argument which denotes what to do when there are multiple mapped values for a single gene identifier. The default behavior is to return just the first mapped value. It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. Use ?mapIds
to see more options or strategies.
In the next chunk, we will run the mapIds()
function and supply the multiVals
argument with the "list"
option in order to get a large list with all the mapped values found for each gene identifier.
# Map Ensembl IDs to their associated Entrez IDs
-mapped_list <- mapIds(
- org.Dr.eg.db, # Replace with annotation package for the organism relevant to your data
- keys = df$Gene,
- column = "ENTREZID", # Replace with the type of gene identifiers you would like to map to
- keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
- multiVals = "list"
-)
+# Map Ensembl IDs to their associated Entrez IDs
+mapped_list <- mapIds(
+ org.Dr.eg.db, # Replace with annotation package for your organism
+ keys = expression_df$Gene,
+ keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
+ column = "ENTREZID", # The type of gene identifiers you would like to map to
+ multiVals = "list"
+)
## 'select()' returned 1:many mapping between keys and columns
Now, let’s take a look at our mapped object to see how the mapping went.
-# Let's use the `head()` function to take a preview at our mapped list
-head(mapped_list)
+
## $ENSDARG00000000001
## [1] "368418"
##
@@ -3199,58 +3997,38 @@ 4.4 Explore gene ID conversion
It looks like we have Entrez IDs that were successfully mapped to the Ensembl IDs we provided. However, the data is now in a list
object, making it a little more difficult to explore. We are going to turn our list object into a data frame object in the next chunk.
# Let's make our object a bit more manageable for exploration by turning it into a data frame
-mapped_df <- mapped_list %>%
- tibble::enframe(name = "Ensembl", value = "Entrez") %>%
- # enframe makes a `list` column, so we will convert that to simpler format with `unnest()
- # This will result in one row of our data frame per list item
- tidyr::unnest(cols = Entrez)
+# Let's make our list a bit more manageable by turning it into a data frame
+mapped_df <- mapped_list %>%
+ tibble::enframe(name = "Ensembl", value = "Entrez") %>%
+ # enframe() makes a `list` column; we will simplify it with unnest()
+ # This will result in one row of our data frame per list item
+ tidyr::unnest(cols = Entrez)
Now let’s take a peek at our data frame.
-head(mapped_df)
+
We can see that our data frame has a new column Entrez
. Let’s get a summary of the values returned in the Entrez
column of our mapped data frame.
# We can use the `summary()` function to get a better idea of the distribution of values in the `Entrez` column
-summary(as.factor(mapped_df$Entrez)) # We need to use the `as.factor()` function here in order to get the count of each unique value returned
-## 100126027 100150038 100331412 794549 554097 794406 100536331 100148591
-## 28 28 28 28 11 7 6 5
-## 100329290 100334559 100333446 393541 562179 795394 100000956 100002012
-## 5 5 3 3 3 3 2 2
-## 100002312 100002647 100002917 100004335 100005482 100006972 100007523 100007836
-## 2 2 2 2 2 2 2 2
-## 100034531 100037365 100124612 100126123 100144555 100150193 100150195 100151764
-## 2 2 2 2 2 2 2 2
-## 100170660 100170822 100191016 100318255 100330526 100330579 100334190 100334824
-## 2 2 2 2 2 2 2 2
-## 100422734 100535828 100536566 100537891 101883724 101885111 101885541 101887034
-## 2 2 2 2 2 2 2 2
-## 101887190 103908643 103908654 103909414 108182725 108182861 108190101 108190662
-## 2 2 2 2 2 2 2 2
-## 108191494 108191539 110437858 110437953 110438456 110438861 110438889 110438898
-## 2 2 2 2 2 2 2 2
-## 110438962 110438964 110439267 110439798 110440092 30459 323832 323951
-## 2 2 2 2 2 2 2 2
-## 327020 336702 368925 378480 393135 393320 393518 405787
-## 2 2 2 2 2 2 2 2
-## 415097 445235 449796 541321 550342 550507 553461 553501
-## 2 2 2 2 2 2 2 2
-## 559771 561678 563398 563587 564790 566030 566190 568680
-## 2 2 2 2 2 2 2 2
-## 569452 570815 (Other) NA's
-## 2 2 22821 9102
-There are 9102 NAs in our data frame, which means that 9102 out of the 31882 Ensembl IDs did not map to Entrez IDs. This means if you are depending on Entrez
IDs for your downstream analyses, you may not be able to use the 9102 unmapped genes.
# Use the `summary()` function to show the distribution of Entrez values
+# We need to use `as.factor()` here to get the count of unique values
+# `maxsum = 10` limits the summary to 10 distinct values
+summary(as.factor(mapped_df$Entrez), maxsum = 10)
## 100126027 100150038 100331412 794549 554097 794406 100536331
+## 28 28 28 28 11 7 6
+## 100148591 (Other) NA's
+## 5 23089 9026
+There are 9026 NA
s in our data frame, which means that 9026 out of the 31882 Ensembl IDs did not map to Entrez IDs. This means if you are depending on Entrez
IDs for your downstream analyses, you may not be able to use the 9026 unmapped genes.
Now let’s check to see how many genes we have that were mapped to multiple IDs.
-multi_mapped_df <- mapped_df %>%
- # Let's count the number of times each Ensembl ID appears in `Ensembl` column
- dplyr::count(Ensembl, name = "entrez_id_count") %>%
- # Arrange by the genes with the highest number of Entrez IDs mapped
- dplyr::arrange(desc(entrez_id_count))
-
-# Let's look at the first 6 rows of our `multi_mapped` object
-head(multi_mapped_df)
+multi_mapped <- mapped_df %>%
+ # Let's count the number of times each Ensembl ID appears in `Ensembl` column
+ dplyr::count(Ensembl, name = "entrez_id_count") %>%
+ # Arrange by the genes with the highest number of Entrez IDs mapped
+ dplyr::arrange(desc(entrez_id_count))
+
+# Let's look at the first 6 rows of our `multi_mapped` object
+head(multi_mapped)
Now let’s write our mapped results and data to file!
@@ -3314,26 +4095,26 @@# Write mapped and annotated data frame to output file
-readr::write_tsv(final_mapped_df, file.path(
- results_dir,
- "SRP040561_Entrez_IDs.tsv" # Replace with a relevant output file name
-))
+
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessioninfo::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -3343,19 +4124,19 @@ 6 Session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-10-16
+## date 2020-12-21
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
+## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
-## AnnotationDbi * 1.50.3 2020-07-25 [1] Bioconductor
+## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
-## Biobase * 2.48.0 2020-04-27 [1] Bioconductor
-## BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor
+## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
+## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2)
## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0)
-## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0)
## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
@@ -3368,16 +4149,17 @@ 6 Session info
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
-## IRanges * 2.22.2 2020-05-21 [1] Bioconductor
+## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0)
## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0)
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
-## org.Dr.eg.db * 3.11.4 2020-10-06 [1] Bioconductor
+## org.Dr.eg.db * 3.12.0 2020-12-16 [1] Bioconductor
## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
@@ -3385,18 +4167,18 @@ 6 Session info
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
-## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
-## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
-## RSQLite 2.2.0 2020-01-07 [1] RSPM (R 4.0.2)
+## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
-## S4Vectors * 0.26.1 2020-05-16 [1] Bioconductor
+## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
-## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
## tidyr 1.1.2 2020-08-27 [1] RSPM (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
@@ -3411,23 +4193,22 @@ 6 Session info
References
-Carlson M., 2019 Genome wide annotation for zebrafish
-
-
-Carlson M., 2020a AnnotationDbi
+Carlson M., 2019 Genome wide annotation for zebrafish. https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html
-Carlson M., 2020b AnnotationDbi: Introduction to bioconductor annotation packages
+Carlson M., 2020 AnnotationDbi: Introduction to bioconductor annotation packages. https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf
-
-CCDL, 2020 Obtaining annotation for ensembl ids - microarray.
-
-
-R Bioconductor Team, 2003 Packages found under annotationdata
+
+Pagès H., M. Carlson, S. Falcon, and N. Li, 2020 AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor. https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html
+
.Rmd
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
- dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
- dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
- dir.create(results_dir)
-}
+# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For this example analysis, we will use this acute myeloid leukemia sample dataset.
-The data that we downloaded from refine.bio for this analysis has 19 samples (obtained from 19 acute myeloid leukemia (AML) mouse models), containing RNA-sequencing results for types of AML under controlled treatment conditions. More specifically, IDH2-mutant AML mouse models were treated with either vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials). While, the TET2-mutant AML mouse models were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent).
+The data that we downloaded from refine.bio for this analysis has 19 samples (obtained from 19 acute myeloid leukemia (AML) mouse models), containing RNA-sequencing results for types of AML under controlled treatment conditions. More specifically, IDH2-mutant AML mouse models were treated with either vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials). The TET2-mutant AML mouse models were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent).
data/
folderIn order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
-# Define the file path to the data directory
-data_dir <- file.path("data", "SRP070849") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "SRP070849.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv") # Replace with file path to your metadata
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "SRP070849")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "SRP070849.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
-# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
@@ -3074,31 +3905,43 @@See our Getting Started page with instructions for package installation for a list of the software you will need, as well as more tips and resources.
Attach a package we need for this analysis.
-# We will need this so we can use the pipe: %>%
-library(magrittr)
+
The HUGO Gene Nomenclature Committee (HGNC) assigns a unique and ideally meaningful name and symbol to every human gene. The HGNC database currently contains over 39,000 public records containing approved human gene nomenclature and associated gene information (Gray et al. 2015).
-The HGNC Comparison of Orthology Predictions (HCOP) is a search tool that combines orthology predictions for a specified human gene, or set of human genes from a variety of sources, including Ensembl Compara, HGNC, and NCBI Gene Orthology (Wright et al. 2005). In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes (HGNC team 2020).
-First, we need to download the file from the server holding the HGNC data. Go to this directory page of the HGNC Comparison of Orthology Predictions (HCOP) files.
-This is where the files that reflect the data provided via the HGNC database are maintained. Ortholog species files with the ‘6 Column’ output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the ‘15 Column’ output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.
-Note: If you are using Safari (or the above FTP server link does not open in a web browser), you may need to go to the link for the HCOP search tool and scroll down to “Bulk Downloads” to choose a file to download. Here, you can find the same files you would find at the server linked above.
-To download a file, click the file name. For this notebook, you will want to download the file named human_mouse_hcop_fifteen_column.txt.gz
. If you are using a different dataset, you can replace mouse
in human_mouse_hcop_fifteen_column.txt.gz
with the name of the species you have data for, and click on that file to download.
Next, move the human_mouse_hcop_fifteen_column.txt.gz
file into your data/
folder.
Note: If you are using Safari, this file will automatically be decompressed, so the name of the file would instead be human_mouse_hcop_fifteen_column.txt
(don’t forget to change the file name in the chunk below if this is the case).
Now let’s double check that the file is in the right place.
-# Define the file path to organism orthology file downloaded from the HGNC database
-mouse_hgnc_file <- file.path("data", "human_mouse_hcop_fifteen_column.txt.gz")
-
-# Check if the organism orthology file file is in the `data` directory
-file.exists(mouse_hgnc_file)
+The HUGO Gene Nomenclature Committee (HGNC) assigns a unique and ideally meaningful name and symbol to every human gene. The HGNC database currently contains over 39,000 public records containing approved human gene nomenclature and associated gene information (Gray et al. 2015).
+The HGNC Comparison of Orthology Predictions (HCOP) is a search tool that combines orthology predictions for a specified human gene, or set of human genes from a variety of sources, including Ensembl Compara, HGNC, and NCBI Gene Orthology (Wright et al. 2005). In general, an orthology prediction where most of the databases concur would be considered the reliable, and we will use this to prioritize mapping in cases where there is more than one possible ortholog for a gene. HCOP was originally designed to show orthology predictions between human and mouse, but has been expanded to include data from 18 genomes, including mouse, which we will use in this notebook (HGNC Team 2020).
+We can download the human mouse file we need for this example using download.file()
command. For this notebook, we want to download the file named human_mouse_hcop_fifteen_column.txt.gz
.
First we’ll declare a sensible file path for this.
+# Declare what we want the downloaded file to be called and its location
+mouse_hgnc_file <- file.path(
+ data_dir,
+ # The name the file will have locally
+ "human_mouse_hcop_fifteen_column.txt.gz"
+)
Using the file path we just declared, we can use the destfile
argument to download the file we need to this directory and use this file name.
We are downloading this orthology predictions file from the HGNC database. If you are looking for a different species, see the directory page of the HGNC Comparison of Orthology Predictions (HCOP) files and find the file name of the species you are looking for.
+download.file(
+ paste0(
+ "http://ftp.ebi.ac.uk/pub/databases/genenames/hcop/",
+ # Replace with the file name for the species conversion you want
+ "human_mouse_hcop_fifteen_column.txt.gz"
+ ),
+ # The file will be saved to the name and location we defined earlier
+ destfile = mouse_hgnc_file
+)
If you are using a different dataset, in the last chunk you can replace mouse
in human_mouse_hcop_fifteen_column.txt.gz
with the name of the species you have data for (if you see it listed in the directory).
Ortholog species files with the ‘6 Column’ output returns the raw assertions, Ensembl gene IDs and Entrez Gene IDs for human and one other species, while the ‘15 Column’ output includes additional information such as the chromosomal location, accession numbers and the databases that support the assertions.
+Now let’s double check that the mouse ortholog file is in the right place.
+# Check if the organism orthology file file is in the `data` directory
+file.exists(mouse_hgnc_file)
## [1] TRUE
-In the next chunk, we will read in the orthology file that was just downloaded.
-# Read in the data from HGNC
-mouse <- readr::read_tsv(mouse_hgnc_file)
-## Parsed with column specification:
+Now we can read in the orthology file that we downloaded.
+
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## human_entrez_gene = col_character(),
## human_ensembl_gene = col_character(),
@@ -3117,104 +3960,107 @@ 4.2 Import data from HGNC
## support = col_character()
## )
Let’s take a look at what mouse
looks like.
-mouse
+
We are going to manipulate some of the column names to make things easier when calling them downstream.
-mouse <- mouse %>%
- set_names(names(.) %>%
- # Removing extra word in some of the column names
- gsub("_gene$", "", .))
+
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the data TSV file and add it as an object to your environment.
We stored our data file path as an object named data_file
in this previous step.
# Read in data TSV file
-mouse_genes <- readr::read_tsv(data_file) %>%
- # We only want the gene IDs so let's pull the `Gene` column
- dplyr::pull("Gene")
-## Parsed with column specification:
+# Read in data TSV file
+mouse_genes <- readr::read_tsv(data_file) %>%
+ # We only want the gene IDs so let's keep only the `Gene` column
+ dplyr::select("Gene")
+##
+## ── Column specification ──────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
-## )
-## See spec(...) for full column specifications.
+## )
+## ℹ Use `spec()` for the full column specifications.
refine.bio data uses Ensembl gene identifiers, which will be in the first column.
-# Let's take a look at the first 6 items of `mouse_genes`
-head(mouse_genes)
-## [1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028"
-## [4] "ENSMUSG00000000031" "ENSMUSG00000000037" "ENSMUSG00000000049"
+
+Ensembl gene identifiers have different species-specific prefixes. In mouse, Ensembl gene identifiers begin with ENSMUSG
(in human, ENSG
, etc.).
Now let’s do the mapping!
We’re interested in the human_ensembl
, mouse_ensembl
, and support
columns specifically. The support
column contains a list of associated databases that support each assertion. This column may assist with addressing some of the multi-mappings that we will talk about later.
human_mouse_key <- mouse %>%
- # We'll want to subset mouse to only the columns we're interested in
- dplyr::select(mouse_ensembl, human_ensembl, support)
-
-# Since we ignored the additional columns in `mouse`, let's check to see if we
-# have any duplicates in our `human_mouse_key`
-any(duplicated(human_mouse_key))
+human_mouse_key <- mouse %>%
+ # We'll want to subset mouse to only the columns we're interested in
+ dplyr::select(mouse_ensembl, human_ensembl, support)
+
+# Since we ignored the additional columns in `mouse`, let's check to see if we
+# have any duplicates in our `human_mouse_key`
+any(duplicated(human_mouse_key))
## [1] TRUE
We do have duplicates! We don’t want to handle duplicate data, so let’s remove those duplicates before moving forward.
-human_mouse_key <- human_mouse_key %>%
- # We need to use the `distinct()` function to remove duplicates resulted from
- # ignoring the additional columns in the `mouse` object
- dplyr::distinct()
+human_mouse_key <- human_mouse_key %>%
+ # Use the `distinct()` function to remove duplicates resulting from
+ # dropping the additional columns in the `mouse` data frame
+ dplyr::distinct()
Now let’s join the mapped data from human_mouse_key
with the gene data in mouse_genes
.
# First, we need to convert our vector of mouse genes into a data frame
-human_mouse_mapped_df <- data.frame("Gene" = mouse_genes) %>%
- # Now we can join the mapped data
- dplyr::left_join(human_mouse_key, by = c("Gene" = "mouse_ensembl"))
+# First, we need to convert our vector of mouse genes into a data frame
+human_mouse_mapped_df <- mouse_genes %>%
+ # Now we can join the mapped data
+ dplyr::left_join(human_mouse_key, by = c("Gene" = "mouse_ensembl"))
Here’s what the new data frame looks like:
-head(human_mouse_mapped_df, n = 25)
+
Looks like we have mapped IDs!
+So now we have all the mouse genes mapped to human, but there might be places where there are multiple mouse genes that are orthologous to the same human gene, or vice versa.
Let’s get a summary of the Ensembl IDs returned in the human_ensembl
column of our mapped data frame.
# We can use this `count()` function after `group_by()`to get a count of how many
-# `mouse_ensembl` IDs there are per `human_ensembl` IDs
-human_mouse_mapped_df %>%
- dplyr::group_by(human_ensembl) %>%
- dplyr::count() %>%
- # Sort by highest `n` which would be the human Ensembl ID with the most mapped
- # mouse Ensembl IDs
- dplyr::arrange(desc(n))
+# We can use this `count()` function to get a tally of how many
+# `mouse_ensembl` IDs there are per `human_ensembl` IDs
+human_mouse_mapped_df %>%
+ # Count the number of rows per human gene
+ dplyr::count(human_ensembl) %>%
+ # Sort by highest `n` which will be the human Ensembl ID with the most mapped
+ # mouse Ensembl IDs
+ dplyr::arrange(desc(n))
Looks like we have mapped IDs!
-Now, let’s get an idea of how many mouse Ensembl IDs we have that were not mapped to human Ensembl IDs.
-sum(is.na(human_mouse_mapped_df$human_ensembl))
-## [1] 18801
-We have 18,801 NAs, which means we have 18,801 mouse Ensembl IDs out of the 64,816 in human_mouse_mapped_df
, that were not mapped to human Ensembl IDs. This is okay because we do not expect everything to map across species.
There are certainly a good number of places where we mapped multiple mouse Ensembl IDs to the same human symbol! We’ll look at this in a bit.
+We can also see that there 19,126 mouse Ensembl IDs that did not map to a human Ensembl ID. These are the rows with a value of NA. This seems like a lot, but most of these are likely to be non-protein-coding genes that do not map easily across species.
If a mouse Ensembl gene ID maps to multiple human Ensembl IDs, the associated values will get duplicated. Let’s look at the ENSMUSG00000000290
example below.
human_mouse_mapped_df %>%
- dplyr::filter(Gene == "ENSMUSG00000000290")
+
On the other hand, if you were to look at the original data associated to the mouse Ensembl IDs, when a human Ensembl ID maps to multiple mouse Ensembl IDs, the values will not get duplicated, but you will have multiple rows associated with that human Ensembl ID. Let’s look at the ENSG00000001617
example below.
human_mouse_mapped_df %>%
- dplyr::filter(human_ensembl == "ENSG00000001617")
+
Remember that if a mouse Ensembl gene ID maps to multiple human symbols, the values get duplicated. We can collapse the multi-mapped values into a list for each Ensembl ID as to not have duplicate values in our data frame.
In the next chunk, we show how we can collapse all the human Ensembl IDs into one column where we store them all for each unique mouse Ensembl ID.
-collapsed_human_ensembl_df <- human_mouse_mapped_df %>%
- # Group by mouse Ensembl IDs
- dplyr::group_by(Gene) %>%
- # Collapse the mapped values in `human_mouse_mapped_df` into one column named
- # `all_human_ensembls` -- note that we will lose the `support` column in this summarizing step
- dplyr::summarize(all_human_ensembls = paste(human_ensembl, collapse = ";"))
+collapsed_human_ensembl_df <- human_mouse_mapped_df %>%
+ # Group by mouse Ensembl IDs
+ dplyr::group_by(Gene) %>%
+ # Collapse the mapped values in `human_mouse_mapped_df` to a
+ # `all_human_ensembl` column, removing any duplicated human symbols
+ # note that we will lose the `support` column in this summarizing step
+ dplyr::summarize(
+ # combine unique symbols with semicolons between them
+ all_human_ensembl = paste(
+ sort(unique(human_ensembl)),
+ collapse = ";"
+ )
+ )
## `summarise()` ungrouping output (override with `.groups` argument)
-head(collapsed_human_ensembl_df)
+
Now let’s write our results to file.
-readr::write_tsv(
- collapsed_human_ensembl_df,
- file.path(
- results_dir,
- # Replace with a relevant output file name
- "SRP070849_mouse_ensembl_to_collapsed_human_gene_symbol.tsv"
- )
-)
+
Since multiple mouse Ensembl gene IDs map to the same human Ensembl gene ID, we may want to identify which one of these mappings represents the “true” ortholog, i.e. which mouse gene is most similar to the human gene we are interested in. This is not at all straightforward! (see this paper for just one example) (Stamboulian et al. 2020). Gene duplications along the mouse lineage may result in complicated relationships among genes, especially with regard to divisions of function.
+Since multiple mouse Ensembl gene IDs map to the same human Ensembl gene ID, we may want to identify which one of these mappings represents the “true” ortholog, i.e. which mouse gene is most similar to the human gene we are interested in. This is not at all straightforward! (see this paper for just one example) (Stamboulian et al. 2020). Gene duplications along the mouse lineage may result in complicated relationships among genes, especially with regard to divisions of function.
Simply combining values across mouse transcripts using an average may result in the loss of a lot of data and will likely not be representative of the mouse biology. One thing we might do to make the problem somewhat simpler is to reduce the number of multi-mapped genes by requiring a certain level of support for each mapping from across the various databases included in HCOP
. This will not fully solve the problem (and may not even be desirable in some cases), but we present it here as an example of an approach one might take.
Therefore, we will use the support
column to decide which mappings to retain. Let’s take a look at support
.
head(human_mouse_mapped_df$support)
-## [1] "OrthoMCL,OrthoDB"
-## [2] "OrthoDB"
-## [3] "Inparanoid,PhylomeDB,NCBI,HomoloGene,HGNC,Treefam,OrthoMCL,OMA,Panther,Ensembl,OrthoDB"
-## [4] NA
-## [5] "Inparanoid,PhylomeDB,HomoloGene,EggNOG,NCBI,HGNC,Treefam,OMA,Panther,Ensembl,OrthoDB"
+
+## [1] "OrthoDB,OrthoMCL"
+## [2] "OrthoDB,OrthoMCL"
+## [3] "HomoloGene,Inparanoid,PhylomeDB,Ensembl,Treefam,OMA,Panther,HGNC,NCBI,OrthoDB,OrthoMCL"
+## [4] NA
+## [5] "HomoloGene,Inparanoid,PhylomeDB,Ensembl,EggNOG,Treefam,OMA,Panther,HGNC,NCBI,OrthoDB,OrthoMCL"
## [6] "NCBI"
Looks like we have a variety of databases for multiple mappings, but we do have some instances of only one database reported in support of the mapping. As we noted earlier, an orthology prediction where more than one of the databases concur would be considered reliable. Therefore, where we have multi-mapped mouse Ensembl gene IDs, we will take the mappings with more than one database to support the assertion.
Before we do, let’s take a look how many multi-mapped genes there are in the data frame.
-human_mouse_mapped_df %>%
- # Group by human Ensembl IDs
- dplyr::group_by(human_ensembl) %>%
- # Count the number of rows in the data frame for each Ensembl ID
- dplyr::count() %>%
- # Filter out the symbols without multimapped genes
- dplyr::filter(n > 1)
+human_mouse_mapped_df %>%
+ # Count the number of rows in the data frame for each Ensembl ID
+ dplyr::count(human_ensembl) %>%
+ # Filter out the symbols without multimapped genes
+ dplyr::filter(n > 1)
-Looks like we have 6,608 human gene Ensembl IDs with multiple mappings.
+Looks like we have 6,971 human gene Ensembl IDs with multiple mappings.
Now let’s filter out the less reliable mappings.
-filtered_mouse_ensembl_df <- human_mouse_mapped_df %>%
- # Count the number of databases in the support column for each prediction
- dplyr::mutate(n_databases = stringr::str_count(support, ",") + 1) %>%
- # Group by human Ensembl IDs
- dplyr::group_by(human_ensembl) %>%
- # Now filter for the rows with more than one database in support for each
- # human Ensembl ID
- dplyr::filter(n_databases > 1)
-
-head(filtered_mouse_ensembl_df)
+filtered_mouse_ensembl_df <- human_mouse_mapped_df %>%
+ # Count the number of databases in the support column
+ # by using the number of commas that separate the databases
+ dplyr::mutate(n_databases = stringr::str_count(support, ",") + 1) %>%
+ # Now filter to the rows where more than one database supports the mapping
+ dplyr::filter(n_databases > 1)
+
+head(filtered_mouse_ensembl_df)
Let’s count how many multi-mapped genes we have now.
-filtered_mouse_ensembl_df %>%
- # Group by human Ensembl IDs
- dplyr::group_by(human_ensembl) %>%
- # Count the number of rows in the data frame for each Ensembl ID
- dplyr::count() %>%
- # Filter out the symbols without multimapped genes
- dplyr::filter(n > 1)
+filtered_mouse_ensembl_df %>%
+ # Group by human Ensembl IDs
+ dplyr::group_by(human_ensembl) %>%
+ # Count the number of rows in the data frame for each Ensembl ID
+ dplyr::count() %>%
+ # Filter out the symbols without multimapped genes
+ dplyr::filter(n > 1)
-Now we only have 2,532 multi-mapped genes, compared to the 6,608 that we began with. Although we haven’t filtered down to zero multi-mapped genes, we have hopefully removed some of the less reliable mappings.
+Now we only have 2,702 multi-mapped genes, compared to the 6,608 that we began with. Although we haven’t filtered down to zero multi-mapped genes, we have hopefully removed some of the less reliable mappings.
4.7.1 Write results to file
Now let’s write our filtered_mouse_ensembl_df
object, with the reliable mouse Ensembl IDs for each unique human gene Ensembl ID, to file.
-readr::write_tsv(
- filtered_mouse_ensembl_df,
- file.path(
- results_dir,
- # Replace with a relevant output file name
- "SRP070849_filtered_mouse_ensembl_to_human_gene_symbol.tsv"
- )
-)
+
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-# Print session info
-sessioninfo::session_info()
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -3343,13 +4192,13 @@ 6 Session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-10-16
+## date 2020-12-21
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
+## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
-## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
@@ -3368,23 +4217,23 @@ 6 Session info
## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2)
## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
-## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
-## readr 1.3.1 2018-12-21 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
-## rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
-## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
@@ -3398,20 +4247,25 @@ 6 Session info
References
-Gray K. A., B. Yates, R. L. Seal, M. W. Wright, and E. A. Bruford, 2015 Genenames.org: The hgnc resources in 2015. Nucleic Acids Res 43. https://doi.org/10.1038/nature11327
+Gray K. A., B. Yates, R. L. Seal, M. W. Wright, and E. A. Bruford, 2015 Genenames.org: The HGNC resources in 2015. Nucleic Acids Research 43. https://doi.org/10.1038/nature11327
-HGNC team, 2020 HCOP help
+HGNC Team, 2020 HCOP help. https://www.genenames.org/help/hcop/
Stamboulian M., R. F. Guerrero, M. W. Hahn, and P. Radivojac, 2020 The ortholog conjecture revisited: The value of orthologs and paralogs in function prediction. Bioinformatics 36: i219–i226. https://doi.org/10.1093/bioinformatics/btaa468
-Wright M. W., T. A. Eyre, M. J. Lush, S. Povey, and E. A. Bruford, 2005 HCOP: The hgnc comparison of orthology predictions search tool. Mammalian Genome 16: 827–8. https://doi.org/10.1007/s00335-005-0103-2
+Wright M. W., T. A. Eyre, M. J. Lush, S. Povey, and E. A. Bruford, 2005 HCOP: The HGNC comparison of orthology predictions search tool. Mammalian Genome 16: 827–8. https://doi.org/10.1007/s00335-005-0103-2
This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.
+This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes shares more or fewer genes with gene sets/pathways than we would expect by chance.
+ORA is a broadly applicable technique that may be good in scenarios where your dataset or scientific questions don’t fit the requirements of other pathway analyses methods. It also does not require any particular sample size, since the only input from your dataset is a set of genes of interest (Yaari et al. 2013).
+If you have differential expression results or something with a gene-level ranking and a two-group comparison, we recommend considering GSEA for your pathway analysis questions. ⬇️ Jump to the analysis code ⬇️
+Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.
+We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning
section.
This table summarizes the pathway analyses examples in this module.
+Analysis | +What is required for input | +What output looks like | +✅ Pros | +⚠️ Cons | +
---|---|---|---|---|
ORA (Over-representation Analysis) | +A list of gene IDs (no stats needed) | +A per-pathway hypergeometric test result | +- Simple - Inexpensive computationally to calculate p-values |
+- Requires arbitrary thresholds and ignores any statistics associated with a gene - Assumes independence of genes and pathways |
+
GSEA (Gene Set Enrichment Analysis) | +A list of genes IDs with gene-level summary statistics | +A per-pathway enrichment score | +- Includes all genes (no arbitrary threshold!) - Attempts to measure coordination of genes |
+- Permutations can be expensive - Does not account for pathway overlap - Two-group comparisons not always appropriate/feasible |
+
GSVA (Gene Set Variation Analysis) | +A gene expression matrix (like what you get from refine.bio directly) | +Pathway-level scores on a per-sample basis | +- Does not require two groups to compare upfront - Normally distributed scores |
+- Scores are not a good fit for gene sets that contain genes that go up AND down - Method doesn’t assign statistical significance itself - Recommended sample size n > 10 |
+
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
+.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
+If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For this example analysis, we will use this acute viral bronchiolitis dataset. The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated “AV”) and their recovery, their post-convalescence visit (abbreviated “CV”).
+We used this dataset to identify modules of co-expressed genes in an example analysis using WGCNA
(Langfelder and Horvath 2008).
We have provided this file for you and the code in this notebook will read in the results that are stored online. If you’d like to follow the steps for creating this results file from the refine.bio data, we suggest going through that WGCNA example.
+Module 19 was the most differentially expressed between the datasets’ two time points (during illness and recovering from illness).
+The heatmap below summarizes the expression of the genes that make up module 19.
+ +Each row is a gene that is a member of module 19, and the composite expression of these genes, called an eigengene, is shown in the barplot below. This plot demonstrates how these genes, together as a module, are differentially expressed between the two time points.
+Your new analysis folder should contain:
+.Rmd
you downloadeddata
(currently empty)plots
(currently empty)results
(currently empty)Your example analysis folder should contain your .Rmd
and three empty folders (which won’t be empty for long!).
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
+The file we use here has two columns from our WGCNA module results: the id of each gene and the module it is part of. If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend replacing the gene_module_url
with a different file path to a read in a similar table of genes with the information that you are interested in. If your gene table differs, many steps will need to be changed or deleted entirely depending on the contents of that file (particularly in the Determine our genes of interest list
section).
We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
clusterProfiler
- RNA-seqSee our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
+In this analysis, we will be using clusterProfiler
package to perform ORA and the msigdbr
package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler
(Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005; Liberzon et al. 2011).
We will also need the org.Hs.eg.db
package (Carlson 2020) to perform gene identifier conversion and ggupset
to make an UpSet plot (Ahlmann-Eltze 2020).
if (!("clusterProfiler" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("clusterProfiler", update = FALSE)
+}
+
+# This is required to make one of the plots that clusterProfiler will make
+if (!("ggupset" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("ggupset", update = FALSE)
+}
+
+if (!("msigdbr" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("msigdbr", update = FALSE)
+}
+
+if (!("org.Hs.eg.db" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("org.Hs.eg.db", update = FALSE)
+}
Attach the packages we need for this analysis.
+ +For ORA, we only need a list of genes of interest and a background gene set as our input, so this example can work for any situations where you have gene list and want to know more about what biological pathways are significantly represented.
+For this example, we will read in results from a co-expression network analysis that we have already performed. Rather than reading from a local file, we will download the results table directly from a URL. These results are from a acute bronchiolitis experiment we used for an example WGCNA analysis (Langfelder and Horvath 2008).
+The table contains two columns, one with Ensembl gene IDs, and the other with the name of the module they are a part of. We will perform ORA on one of the modules we identified in the WGCNA analysis but the rest of the genes will be used as “background genes”.
+Instead of using this URL below, you can use a file path to a TSV file with your desired gene list. First we will assign the URL to its own variable called, gene_module_url
.
# Define the url to your gene list file
+gene_module_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/04-advanced-topics/results/SRP140558_wgcna_gene_to_module.tsv"
We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.
+ +Using the URL (gene_module_url
) and file path (gene_module_file
) we can download the file and use the destfile
argument to specify where it should be saved.
download.file(
+ gene_module_url,
+ # The file will be saved to this location and with this name
+ destfile = gene_module_file
+)
Now let’s double check that the results file is in the right place.
+ +## [1] TRUE
+Read in the file that has WGCNA gene and module results.
+# Read in the contents of the WGCNA gene modules file
+gene_module_df <- readr::read_tsv(gene_module_file)
##
+## ── Column specification ──────────────────────────────────────────────
+## cols(
+## gene = col_character(),
+## module = col_character()
+## )
+Note that read_tsv()
can also read TSV files directly from a URL and doesn’t necessarily require you download the file first. If you wanted to use that feature, you could replace the call above with readr::read_tsv(gene_module_url)
and skip the download steps.
Let’s take a look at this gene module table.
+ +msigdbr
The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). We can use the msigdbr
package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler
(Yu et al. 2012; Dolgalev 2020).
The gene sets available directly from MSigDB are applicable to human studies. msigdbr
also supports commonly studied model organisms.
Let’s take a look at what organisms the package supports with msigdbr_species()
.
The data we’re interested in here comes from human samples, so we can obtain only the gene sets relevant to H. sapiens with the species
argument to msigdbr()
.
MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated). In this example, we will use pathways that are gene sets considered to be “canonical representations of a biological process compiled by domain experts” and are a subset of C2: curated gene sets
(Subramanian et al. 2005; Liberzon et al. 2011).
Specifically, we will use the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways (Kanehisa and Goto 2000).
+First, let’s take a look at what information is included in this data frame.
+ +We will need to use gs_cat
and gs_subcat
columns to construct a filter step that will only keep curated gene sets and KEGG pathways.
# Filter the human data frame to the KEGG pathways that are included in the
+# curated gene sets
+hs_kegg_df <- hs_msigdb_df %>%
+ dplyr::filter(
+ gs_cat == "C2", # This is to filter only to the C2 curated gene sets
+ gs_subcat == "CP:KEGG" # This is because we only want KEGG pathways
+ )
The clusterProfiler()
function we will use requires a data frame with two columns, where one column contains the term identifier or name and one column contains gene identifiers that match our gene lists we want to check for enrichment.
Our data frame with KEGG terms contains Entrez IDs and gene symbols.
+In our differential expression results data frame, gene_module_df
we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs.
We’re going to convert our identifiers in gene_module_df
to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!
The annotation package org.Hs.eg.db
contains information for different identifiers (Carlson 2020). org.Hs.eg.db
is specific to Homo sapiens – this is what the Hs
in the package name is referencing.
Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.
+We can see what types of IDs are available to us in an annotation package with keytypes()
.
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
+## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
+## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
+## [13] "IPI" "MAP" "OMIM" "ONTOLOGY"
+## [17] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
+## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
+## [25] "UNIGENE" "UNIPROT"
+Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL
) to Entrez IDs (ENTREZID
), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS
) to gene symbols (SYMBOL
).
The function we will use to map from Ensembl gene IDs to Entrez IDs is called mapIds()
and comes from the AnnotationDbi
package.
# This returns a named vector which we can convert to a data frame, where
+# the keys (Ensembl IDs) are the names
+entrez_vector <- mapIds(
+ # Replace with annotation package for the organism relevant to your data
+ org.Hs.eg.db,
+ # The vector of gene identifiers we want to map
+ keys = gene_module_df$gene,
+ # Replace with the type of gene identifiers in your data
+ keytype = "ENSEMBL",
+ # Replace with the type of gene identifiers you would like to map to
+ column = "ENTREZID",
+ # In the case of 1:many mappings, return the
+ # first one. This is default behavior!
+ multiVals = "first"
+)
## 'select()' returned 1:many mapping between keys and columns
+This message is letting us know that sometimes Ensembl gene identifiers will map to multiple Entrez IDs. In this case, it’s also possible that an Entrez ID will map to multiple Ensembl IDs. For more about how to explore this, take a look at our RNA-seq gene ID conversion example.
+Let’s create a two column data frame that shows the gene symbols and their Ensembl IDs side-by-side.
+# We would like a data frame we can join to the differential expression stats
+gene_key_df <- data.frame(
+ ensembl_id = names(entrez_vector),
+ entrez_id = entrez_vector,
+ stringsAsFactors = FALSE
+) %>%
+ # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
+ # from the data frame
+ dplyr::filter(!is.na(entrez_id))
Let’s see a preview of entrez_id
.
Now we are ready to add the gene_key_df
to our data frame with the module labels, gene_module_df
. Here we’re using a dplyr::left_join()
because we only want to retain the genes that have Entrez IDs and this will filter out anything in our gene_module_df
that does not have an Entrez ID when we join using the Ensembl gene identifiers.
module_annot_df <- gene_key_df %>%
+ # Using a left join removes the rows without gene symbols because those rows
+ # have already been removed in `gene_key_df`
+ dplyr::left_join(gene_module_df,
+ # The name of the column that contains the Ensembl gene IDs
+ # in the left data frame and right data frame
+ by = c("ensembl_id" = "gene")
+ )
Let’s take a look at what this data frame looks like.
+ +Over-representation testing using clusterProfiler
is based on a hypergeometric test (often referred to as Fisher’s exact test) (Yu 2020). For more background on hypergeometric tests, this handy tutorial explains more about how hypergeometric tests work (Puthier and van Helden 2015).
We will need to provide to clusterProfiler
two genes lists:
This step is highly variable depending on the format of your gene table, what information it contains and what your goals are. You may want to delete this next chunk entirely if you supply an already determined list of genes OR you may need to introduce cutoffs and filters that we don’t need here, given the nature of our data.
+Here, we will focus on one module, module 19, to identify pathways associated with it. We previously identified this module as differentially expressed between our dataset’s two time points (during acute illness and during recovery). See the previous section for more background on the structure and content of the data table we are using.
+module_19_genes <- module_annot_df %>%
+ dplyr::filter(module == "ME19") %>%
+ dplyr::pull("entrez_id")
Because one entrez_id
may map to multiple Ensembl IDs, we need to make sure we have no repeated Entrez IDs in this list.
# Reduce to only unique Entrez IDs
+genes_of_interest <- unique(as.character(module_19_genes))
+
+# Let's print out some of these genes
+head(genes_of_interest)
## [1] "5704" "578" "23471" "5255" "4171" "8898"
+Sometimes folks consider genes from the entire genome to comprise the background, but for this RNA-seq example, we will consider all detectable genes as our background set. The dataset that these genes were selected from already had unreliably detected, low count genes removed. Because of this, we can obtain our detected genes list from our data frame, module_annot_df
(which we have not done any further filtering on in this notebook).
enricher()
functionNow that we have our background set, our genes of interest, and our pathway information, we’re ready to run ORA using the enricher()
function.
kegg_ora_results <- enricher(
+ gene = genes_of_interest, # A vector of your genes of interest
+ pvalueCutoff = 0.1, # Can choose a FDR cutoff
+ pAdjustMethod = "BH", # Method to be used for multiple testing correction
+ universe = background_set, # A vector containing your background set genes
+ # The pathway information should be a data frame with a term name or
+ # identifier and the gene identifiers
+ TERM2GENE = dplyr::select(
+ hs_kegg_df,
+ gs_name,
+ human_entrez_gene
+ )
+)
Note: using enrichKEGG()
is a shortcut for doing ORA using KEGG, but the approach we covered here can be used with any gene sets you’d like!
The information we’re most likely interested in is in the results
slot. Let’s convert this into a data frame that we can write to file.
Let’s print out a sneak peek of the results here and take a look at how many gene sets we have using an FDR cutoff of 0.1
.
Looks like there are four KEGG sets returned as significant at FDR of 0.1
.
We can use a dot plot to visualize our significant enrichment results. The enrichplot::dotplot()
function will only plot gene sets that are significant according to the multiple testing corrected p values (in the p.adjust
column) and the pvalueCutoff
you provided in the enricher()
step.
## wrong orderBy parameter; set to default `orderBy = "x"`
+
+
+Use ?enrichplot::dotplot
to see the help page for more about how to use this function.
This plot is arguably more useful when we have a large number of significant pathways.
+Let’s save it to a PNG.
+ggplot2::ggsave(file.path(plots_dir, "SRP140558_ora_enrich_plot_module_19.png"),
+ plot = enrich_plot
+)
## Saving 7 x 5 in image
+We can use an UpSet plot to visualize the overlap between the gene sets that were returned as significant.
+ + +See that KEGG_CELL_CYCLE
and KEGG_OOCYTE_MEIOSIS
have genes in common, as do KEGG_CELL_CYCLE
and KEGG_DNA_REPLICATION
. Gene sets or pathways aren’t independent! Based on the context of your samples, you may be able to narrow down which ones make sense. In this instance, we are dealing with PBMCs, so the oocyte meiosis is not relevant to the biology of the samples at hand, and all of the identified genes in that pathway are also part of the cell cycle pathway.
Let’s also save this to a PNG.
+ggplot2::ggsave(file.path(plots_dir, "SRP140558_ora_upset_plot_module_19.png"),
+ plot = upset_plot
+)
## Saving 7 x 5 in image
+At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
+ +## ─ Session info ─────────────────────────────────────────────────────
+## setting value
+## version R version 4.0.2 (2020-06-22)
+## os Ubuntu 20.04 LTS
+## system x86_64, linux-gnu
+## ui X11
+## language (EN)
+## collate en_US.UTF-8
+## ctype en_US.UTF-8
+## tz Etc/UTC
+## date 2020-12-21
+##
+## ─ Packages ─────────────────────────────────────────────────────────
+## package * version date lib source
+## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor
+## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
+## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
+## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
+## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
+## BiocManager 1.30.10 2019-11-16 [1] RSPM (R 4.0.0)
+## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
+## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2)
+## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2)
+## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
+## clusterProfiler * 3.18.0 2020-10-27 [1] Bioconductor
+## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
+## cowplot 1.1.0 2020-09-08 [1] RSPM (R 4.0.2)
+## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
+## data.table 1.13.0 2020-07-24 [1] RSPM (R 4.0.2)
+## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0)
+## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
+## DO.db 2.9 2020-12-16 [1] Bioconductor
+## DOSE 3.16.0 2020-10-27 [1] Bioconductor
+## downloader 0.4 2015-07-09 [1] RSPM (R 4.0.0)
+## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
+## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0)
+## enrichplot 1.10.1 2020-11-14 [1] Bioconductor
+## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0)
+## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
+## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0)
+## fastmatch 1.1-0 2017-01-28 [1] RSPM (R 4.0.0)
+## fgsea 1.16.0 2020-10-27 [1] Bioconductor
+## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
+## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
+## ggforce 0.3.2 2020-06-23 [1] RSPM (R 4.0.2)
+## ggplot2 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
+## ggraph 2.0.3 2020-05-20 [1] RSPM (R 4.0.2)
+## ggrepel 0.8.2 2020-03-08 [1] RSPM (R 4.0.2)
+## ggupset 0.3.0 2020-05-05 [1] RSPM (R 4.0.0)
+## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
+## GO.db 3.12.1 2020-12-16 [1] Bioconductor
+## GOSemSim 2.16.1 2020-10-29 [1] Bioconductor
+## graphlayouts 0.7.0 2020-04-25 [1] RSPM (R 4.0.2)
+## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.0.0)
+## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
+## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
+## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
+## igraph 1.2.6 2020-10-06 [1] RSPM (R 4.0.2)
+## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
+## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
+## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
+## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0)
+## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
+## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0)
+## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
+## MASS 7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+## Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2)
+## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0)
+## msigdbr * 7.2.1 2020-10-02 [1] RSPM (R 4.0.2)
+## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
+## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
+## org.Hs.eg.db * 3.12.0 2020-12-16 [1] Bioconductor
+## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
+## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## plyr 1.8.6 2020-03-03 [1] RSPM (R 4.0.2)
+## polyclip 1.10-0 2019-03-14 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
+## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
+## qvalue 2.22.0 2020-10-27 [1] Bioconductor
+## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
+## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2)
+## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
+## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0)
+## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
+## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
+## reshape2 1.4.4 2020-04-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
+## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
+## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2)
+## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
+## rvcheck 0.1.8 2020-03-01 [1] RSPM (R 4.0.0)
+## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
+## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
+## scatterpie 0.1.5 2020-09-09 [1] RSPM (R 4.0.2)
+## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
+## shadowtext 0.0.7 2019-11-06 [1] RSPM (R 4.0.0)
+## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
+## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
+## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
+## tidygraph 1.2.0 2020-05-12 [1] RSPM (R 4.0.2)
+## tidyr 1.1.2 2020-08-27 [1] RSPM (R 4.0.2)
+## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
+## tweenr 1.0.1 2018-12-14 [1] RSPM (R 4.0.2)
+## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
+## viridis 0.5.1 2018-03-29 [1] RSPM (R 4.0.0)
+## viridisLite 0.3.0 2018-02-01 [1] RSPM (R 4.0.0)
+## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
+## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2)
+## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
+##
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
+Ahlmann-Eltze C., 2020 ggupset: Combination matrix axis for ’ggplot2’ to create ’upset’ plots. https://github.com/const-ae/ggupset
+Carlson M., 2020 org.Hs.eg.db: Genome wide annotation for human. http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html
+Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html
+Kanehisa M., and S. Goto, 2000 KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28: 27–30. https://doi.org/10.1093/nar/28.1.27
+Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375
+Langfelder P., and S. Horvath, 2008 WGCNA: An r package for weighted correlation network analysis. BMC Bioinformatics 9. https://doi.org/10.1186/1471-2105-9-559
+Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, and P. Tamayo et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260
+Puthier D., and J. van Helden, 2015 Statistics for Bioinformatics - Practicals - Gene enrichment statistics. https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html
+Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102
+Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660
+Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118
+Yu G., 2020 clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html
+This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.
+This particular example analysis shows how you can use Gene Set Enrichment Analysis (GSEA) to detect situations where genes in a predefined gene set or pathway change in a coordinated way between two conditions (Subramanian et al. 2005). Changes at the pathway-level may be statistically significant, and contribute to phenotypic differences, even if the changes in the expression level of individual genes are small.
+⬇️ Jump to the analysis code ⬇️
+Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.
+We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning
section.
This table summarizes the pathway analyses examples in this module.
+Analysis | +What is required for input | +What output looks like | +✅ Pros | +⚠️ Cons | +
---|---|---|---|---|
ORA (Over-representation Analysis) | +A list of gene IDs (no stats needed) | +A per-pathway hypergeometric test result | +- Simple - Inexpensive computationally to calculate p-values |
+- Requires arbitrary thresholds and ignores any statistics associated with a gene - Assumes independence of genes and pathways |
+
GSEA (Gene Set Enrichment Analysis) | +A list of genes IDs with gene-level summary statistics | +A per-pathway enrichment score | +- Includes all genes (no arbitrary threshold!) - Attempts to measure coordination of genes |
+- Permutations can be expensive - Does not account for pathway overlap - Two-group comparisons not always appropriate/feasible |
+
GSVA (Gene Set Variation Analysis) | +A gene expression matrix (like what you get from refine.bio directly) | +Pathway-level scores on a per-sample basis | +- Does not require two groups to compare upfront - Normally distributed scores |
+- Scores are not a good fit for gene sets that contain genes that go up AND down - Method doesn’t assign statistical significance itself - Recommended sample size n > 10 |
+
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
+.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
+If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots" # Can replace with path to desired output plots directory
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results" # Can replace with path to desired output results directory
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
In this example, we are using the differential expression results table we obtained from an example analysis of an acute lymphoblastic leukemia (ALL) mouse lymphoid cell model using the DESeq2
package (Love et al. 2014). The table contains summary statistics including Ensembl gene IDs, log2 fold change values, and adjusted p-values (FDR in this case).
We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you’d like to follow the steps for obtaining this results file yourself, we suggest going through that differential expression analysis example.
+For this example analysis, we are using RNA-seq data from an acute lymphoblastic leukemia (ALL) mouse lymphoid cell model (Kampen et al. 2019). All lymphoid mouse cells have human RPL10 but three of the mice have a knock-in R98S mutated RPL10 and three have the human reference RPL10. Differential expression was performed using these mutated and reference RPL10 gene designations.
+Your new analysis folder should contain:
+.Rmd
you downloadeddata
(currently empty)plots
(currently empty)results
(currently empty)Your example analysis folder should contain your .Rmd
and three empty folders (which won’t be empty for long!).
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
+If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
+In this analysis, we will be using clusterProfiler
package to perform GSEA and the msigdbr
package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler
(Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005; Liberzon et al. 2011).
We will also need the org.Mm.eg.db
package to perform gene identifier conversion (Carlson 2019).
if (!("clusterProfiler" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("clusterProfiler", update = FALSE)
+}
+
+if (!("msigdbr" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("msigdbr", update = FALSE)
+}
+
+if (!("org.Mm.eg.db" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("org.Mm.eg.db", update = FALSE)
+}
Attach the packages we need for this analysis.
+ +We will read in the differential expression results we will download from online. These results are from an acute lymphoblastic leukemia (ALL) mouse lymphoid cell model we used for differential expression analysis using DESeq2
(Love et al. 2014). The table contains summary statistics including Ensembl gene IDs, log2 fold change values, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to GSEA.
Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. First we will assign the URL to its own variable called, dge_url
.
# Define the url to your differential expression results file
+dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/03-rnaseq/results/SRP123625/SRP123625_differential_expression_results.tsv"
We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.
+ +Using the URL (dge_url
) and file path (dge_results_file
) we can download the file and use the destfile
argument to specify where it should be saved.
download.file(
+ dge_url,
+ # The file will be saved to this location and with this name
+ destfile = dge_results_file
+)
Now let’s double check that the results file is in the right place.
+ +## [1] TRUE
+Read in the file that has differential expression results.
+# Read in the contents of the differential expression results file
+dge_df <- readr::read_tsv(dge_results_file)
##
+## ── Column specification ──────────────────────────────────────────────
+## cols(
+## Gene = col_character(),
+## baseMean = col_double(),
+## log2FoldChange = col_double(),
+## lfcSE = col_double(),
+## pvalue = col_double(),
+## padj = col_double(),
+## threshold = col_logical()
+## )
+Note that read_tsv()
can also read TSV files directly from a URL and doesn’t necessarily require you download the file first. If you wanted to use that feature, you could replace the call above with readr::read_tsv(dge_url)
and skip the download steps.
Let’s take a look at what the contrast results from the differential expression analysis looks like.
+ +msigdbr
The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). We can use the msigdbr
package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler
(Yu et al. 2012; Dolgalev 2020).
The gene sets available directly from MSigDB are applicable to human studies. msigdbr
also supports commonly studied model organisms.
Let’s take a look at what organisms the package supports with msigdbr_species()
.
MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).
+In this example, we will use a collection called Hallmark gene sets for GSEA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:
+++Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.
+
Notably, there are only 50 gene sets included in this collection. The fewer gene sets we test, the lower our multiple hypothesis testing burden.
+The data we’re interested in here comes from mouse samples, so we can obtain only the Hallmarks gene sets relevant to M. musculus by specifying category = "H"
and species = "Mus musculus"
, respectively, to the msigdbr()
function.
mm_hallmark_sets <- msigdbr(
+ species = "Mus musculus", # Replace with species name relevant to your data
+ category = "H"
+)
If you run the chunk above without specifying a category
to the msigdbr()
function, it will return all of the MSigDB gene sets for mouse. See ?msigdbr
for more options.
Let’s preview what’s in mm_hallmark_sets
.
Looks like we have a data frame of gene sets with associated gene symbols and Entrez IDs.
+In our differential expression results data frame, dge_df
we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs for GSEA.
We’re going to convert our identifiers in dge_df
to gene symbols, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!
The annotation package org.Mm.eg.db
contains information for different identifiers (Carlson 2019). org.Mm.eg.db
is specific to Mus musculus – this is what the Mm
in the package name is referencing.
We can see what types of IDs are available to us in an annotation package with keytypes()
.
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
+## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
+## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
+## [13] "IPI" "MGI" "ONTOLOGY" "ONTOLOGYALL"
+## [17] "PATH" "PFAM" "PMID" "PROSITE"
+## [21] "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT"
+Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL
) to gene symbols (SYMBOL
), we could just as easily use it to convert to and from any of these keytypes()
listed above.
The function we will use to map from Ensembl gene IDs to gene symbols is called mapIds()
and comes from the AnnotationDbi
.
Let’s create a data frame that shows the mapped gene symbols along with the differential expression stats for the respective Ensembl IDs.
+# First let's create a mapped data frame we can join to the differential
+# expression stats
+dge_mapped_df <- data.frame(
+ gene_symbol = mapIds(
+ # Replace with annotation package for the organism relevant to your data
+ org.Mm.eg.db,
+ keys = dge_df$Gene,
+ # Replace with the type of gene identifiers in your data
+ keytype = "ENSEMBL",
+ # Replace with the type of gene identifiers you would like to map to
+ column = "SYMBOL",
+ # This will keep only the first mapped value for each Ensembl ID
+ multiVals = "first"
+ )
+) %>%
+ # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
+ # from the data frame
+ dplyr::filter(!is.na(gene_symbol)) %>%
+ # Make an `Ensembl` column to store the rownames
+ tibble::rownames_to_column("Ensembl") %>%
+ # Now let's join the rest of the expression data
+ dplyr::inner_join(dge_df, by = c("Ensembl" = "Gene"))
## 'select()' returned 1:many mapping between keys and columns
+This 1:many mapping between keys and columns
message means that some Ensembl gene identifiers map to multiple gene symbols. In this case, it’s also possible that a gene symbol will map to multiple Ensembl IDs. For the purpose of performing GSEA later in this notebook, we keep only the first mapped IDs. Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.
Let’s see a preview of dge_mapped_df
.
The goal of GSEA is to detect situations where many genes in a gene set change in a coordinated way, even when individual changes are small in magnitude (Subramanian et al. 2005).
+GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic. This score reflects whether or not a gene set or pathway is overrepresented at the top or bottom of the gene rankings (Yu 2020; Subramanian et al. 2005). Specifically, genes are ranked from most positive to most negative based on their statistic and a running sum is calculated by starting with the most highly ranked genes and increasing the score when a gene is in the pathway and decreasing the score when a gene is not. In this example, the enrichment score for a pathway is the running sum’s maximum deviation from zero. GSEA also assesses statistical significance of the scores for each pathway through permutation testing. As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing (Yu 2020; Subramanian et al. 2005).
+The implementation of GSEA we use in this examples requires a gene list ordered by some statistic (here we’ll use log2 fold changes calculated as part of differential gene expression analysis) and input gene sets (Hallmark collection). When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.
+The GSEA()
function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. It requires unique gene identifiers to produce the most accurate results, so we will need to resolve any duplicates found in our dataset. (The GSEA()
function will throw a warning if we do not do this ahead of time.)
Let’s check to see if we have any gene symbols that mapped to multiple Ensembl IDs.
+ +## [1] TRUE
+Looks like we do have duplicated gene symbols. Let’s find out which ones.
+dup_gene_symbols <- dge_mapped_df %>%
+ dplyr::filter(duplicated(gene_symbol)) %>%
+ dplyr::pull(gene_symbol)
Now let’s take a look at the rows associated with the duplicated gene symbols.
+dge_mapped_df %>%
+ dplyr::filter(gene_symbol %in% dup_gene_symbols) %>%
+ dplyr::arrange(gene_symbol)
We can see that the associated values vary for each row.
+As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSEA()
step. Let’s keep the Entrez IDs associated with the higher absolute value of the log2 fold change. GSEA relies on genes’ rankings on the basis of a gene-level statistic and the enrichment score that is calculated reflects the degree to which genes in a gene set are overrepresented in the top or bottom of the rankings (Yu 2020; Subramanian et al. 2005).
Retaining the instance of the Entrez ID with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. We should keep this decision in mind when interpreting our results. For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking.
+We are removing values for 33 out of thousands of genes here, so it is unlikely to have a considerable impact on our results.
+In the next chunk, we are going to filter out the duplicated row using the dplyr::distinct()
function This will keep the first row with the duplicated value thus keeping the row with the highest absolute value of the log2 fold change.
filtered_dge_mapped_df <- dge_mapped_df %>%
+ # Sort so that the highest absolute values of the log2 fold change are at the
+ # top
+ dplyr::arrange(dplyr::desc(abs(log2FoldChange))) %>%
+ # Filter out the duplicated rows using `dplyr::distinct()`
+ dplyr::distinct(gene_symbol, .keep_all = TRUE)
Note that the log2 fold change estimates we use here have been subject to shrinkage to account for genes with low counts or highly variable counts. See the DESeq2
package vignette for more information on how DESeq2
handles the log2 fold change values with the lfcShrink()
function.
Let’s check to see that we removed the duplicate gene symbols and kept the rows with the higher absolute value of the log2 fold change.
+ +## [1] FALSE
+Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher absolute value of the log2 fold change!
+In the next chunk, we will create a named vector ranked based on the gene-level log2 fold change values.
+# Let's create a named vector ranked based on the log2 fold change values
+lfc_vector <- filtered_dge_mapped_df$log2FoldChange
+names(lfc_vector) <- filtered_dge_mapped_df$gene_symbol
+
+# We need to sort the log2 fold change values in descending order here
+lfc_vector <- sort(lfc_vector, decreasing = TRUE)
Let’s preview our pre-ranked named vector.
+ +## Lpgat1 Lgals7 Gm973 Bbs7 Clnk Zfp575
+## 13.34941 12.64196 12.51824 12.19278 11.52481 10.20900
+GSEA()
functionGenes were ranked from most positive to most negative, weighted according to their gene-level statistic, in the previous section. In this section, we will implement GSEA to calculate the enrichment score for each gene set using our pre-ranked gene list.
+The GSEA algorithm utilizes random sampling so we are going to set the seed to make our results reproducible.
+ +We can use the GSEA()
function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., gseKEGG()
).
Significance is assessed by permuting the gene labels of the pre-ranked gene list and recomputing the enrichment scores of the gene set for the permuted data, which generates a null distribution for the enrichment score. The pAdjustMethod
argument to GSEA()
above specifies what method to use for adjusting the p-values to account for multiple hypothesis testing; the pvalueCutoff
argument tells the function to only return pathways with adjusted p-values less than that threshold in the results
slot.
gsea_results <- GSEA(
+ geneList = lfc_vector, # Ordered ranked gene list
+ minGSSize = 25, # Minimum gene set size
+ maxGSSize = 500, # Maximum gene set set
+ pvalueCutoff = 0.05, # p-value cutoff
+ eps = 0, # Boundary for calculating the p value
+ seed = TRUE, # Set seed to make results reproducible
+ pAdjustMethod = "BH", # Benjamini-Hochberg correction
+ TERM2GENE = dplyr::select(
+ mm_hallmark_sets,
+ gs_name,
+ gene_symbol
+ )
+)
## preparing geneSet collections...
+## GSEA analysis...
+## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.19% of the list).
+## The order of those tied genes will be arbitrary, which may produce unexpected results.
+## leading edge analysis...
+## done...
+The warning message above tells us that there are few genes that have the same log2 fold change value and they are therefore ranked equally. fgsea
, the method that underlies GSEA()
, will arbitrarily choose which comes first in the ranked list (Ballereau et al. 2018). This percentage of 0.19
is small, so we are not concerned that it will significantly effect our results. If the percentage much larger on the other hand, we would be concerned about the log2 fold change results.
Let’s take a look at the table in the result
slot of gsea_results
.
# We can access the results from our `gsea_results` object using `@result`
+head(gsea_results@result)
Looks like we have gene sets returned as significant at FDR (false discovery rate) of 0.05
. If we did not have results that met the pvalueCutoff
condition, this table would be empty. If we wanted all results returned we would need to set the pvalueCutoff = 1
.
The NES
column contains the normalized enrichment score, which normalizes for the gene set size, for that pathway.
Let’s convert the contents of result
into a data frame that we can use for further analysis and write to a file later.
We can visualize GSEA results for individual pathways or gene sets using enrichplot::gseaplot()
. Let’s take a look at 2 different pathways – one with a highly positive NES and one with a highly negative NES – to get more insight into how ES are calculated.
Let’s look at the 3 gene sets with the most positive NES.
+gsea_result_df %>%
+ # This returns the 3 rows with the largest NES values
+ dplyr::slice_max(NES, n = 3)
The gene set HALLMARK_MYC_TARGETS_V2
has the most positive NES score.
most_positive_nes_plot <- enrichplot::gseaplot(
+ gsea_results,
+ geneSetID = "HALLMARK_MYC_TARGETS_V2",
+ title = "HALLMARK_MYC_TARGETS_V2",
+ color.line = "#0d76ff"
+)
+most_positive_nes_plot
Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. The red dashed line indicates the enrichment score, which is the maximum deviation from zero. As mentioned earlier, an enrichment is calculated by starting with the most highly ranked genes (according to the gene-level log2 fold changes values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway.
+The plots returned by enrichplot::gseaplot
are ggplots, so we can use ggplot2::ggsave()
to save them to file.
Let’s save to PNG.
+ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_positive_plot.png"),
+ plot = most_positive_nes_plot
+)
## Saving 7 x 5 in image
+Let’s look for the 3 gene sets with the most negative NES.
+gsea_result_df %>%
+ # Return the 3 rows with the smallest (most negative) NES values
+ dplyr::slice_min(NES, n = 3)
The gene set HALLMARK_HYPOXIA
has the most negative NES.
most_negative_nes_plot <- enrichplot::gseaplot(
+ gsea_results,
+ geneSetID = "HALLMARK_HYPOXIA",
+ title = "HALLMARK_HYPOXIA",
+ color.line = "#0d76ff"
+)
+most_negative_nes_plot
This gene set shows the opposite pattern – genes in the pathway tend to be on the right side of the graph. Again, the red dashed line here indicates the maximum deviation from zero, in other words, the enrichment score. A negative enrichment score will be returned when many genes are near the bottom of the ranked list.
+Let’s save this plot to PNG as well.
+ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_negative_plot.png"),
+ plot = most_negative_nes_plot
+)
## Saving 7 x 5 in image
+At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
+ +## ─ Session info ─────────────────────────────────────────────────────
+## setting value
+## version R version 4.0.2 (2020-06-22)
+## os Ubuntu 20.04 LTS
+## system x86_64, linux-gnu
+## ui X11
+## language (EN)
+## collate en_US.UTF-8
+## ctype en_US.UTF-8
+## tz Etc/UTC
+## date 2020-12-21
+##
+## ─ Packages ─────────────────────────────────────────────────────────
+## package * version date lib source
+## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor
+## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
+## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
+## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
+## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
+## BiocManager 1.30.10 2019-11-16 [1] RSPM (R 4.0.0)
+## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
+## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2)
+## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2)
+## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
+## clusterProfiler * 3.18.0 2020-10-27 [1] Bioconductor
+## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
+## cowplot 1.1.0 2020-09-08 [1] RSPM (R 4.0.2)
+## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
+## data.table 1.13.0 2020-07-24 [1] RSPM (R 4.0.2)
+## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0)
+## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
+## DO.db 2.9 2020-12-16 [1] Bioconductor
+## DOSE 3.16.0 2020-10-27 [1] Bioconductor
+## downloader 0.4 2015-07-09 [1] RSPM (R 4.0.0)
+## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
+## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0)
+## enrichplot 1.10.1 2020-11-14 [1] Bioconductor
+## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0)
+## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
+## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0)
+## fastmatch 1.1-0 2017-01-28 [1] RSPM (R 4.0.0)
+## fgsea 1.16.0 2020-10-27 [1] Bioconductor
+## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
+## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
+## ggforce 0.3.2 2020-06-23 [1] RSPM (R 4.0.2)
+## ggplot2 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
+## ggraph 2.0.3 2020-05-20 [1] RSPM (R 4.0.2)
+## ggrepel 0.8.2 2020-03-08 [1] RSPM (R 4.0.2)
+## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
+## GO.db 3.12.1 2020-12-16 [1] Bioconductor
+## GOSemSim 2.16.1 2020-10-29 [1] Bioconductor
+## graphlayouts 0.7.0 2020-04-25 [1] RSPM (R 4.0.2)
+## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.0.0)
+## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
+## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
+## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
+## igraph 1.2.6 2020-10-06 [1] RSPM (R 4.0.2)
+## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
+## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
+## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
+## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0)
+## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
+## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0)
+## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
+## MASS 7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+## Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2)
+## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0)
+## msigdbr * 7.2.1 2020-10-02 [1] RSPM (R 4.0.2)
+## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
+## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
+## org.Mm.eg.db * 3.12.0 2020-12-16 [1] Bioconductor
+## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
+## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## plyr 1.8.6 2020-03-03 [1] RSPM (R 4.0.2)
+## polyclip 1.10-0 2019-03-14 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
+## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
+## qvalue 2.22.0 2020-10-27 [1] Bioconductor
+## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
+## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2)
+## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
+## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0)
+## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
+## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
+## reshape2 1.4.4 2020-04-09 [1] RSPM (R 4.0.2)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
+## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
+## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2)
+## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
+## rvcheck 0.1.8 2020-03-01 [1] RSPM (R 4.0.0)
+## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
+## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
+## scatterpie 0.1.5 2020-09-09 [1] RSPM (R 4.0.2)
+## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
+## shadowtext 0.0.7 2019-11-06 [1] RSPM (R 4.0.0)
+## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
+## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
+## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
+## tidygraph 1.2.0 2020-05-12 [1] RSPM (R 4.0.2)
+## tidyr 1.1.2 2020-08-27 [1] RSPM (R 4.0.2)
+## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
+## tweenr 1.0.1 2018-12-14 [1] RSPM (R 4.0.2)
+## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
+## viridis 0.5.1 2018-03-29 [1] RSPM (R 4.0.0)
+## viridisLite 0.3.0 2018-02-01 [1] RSPM (R 4.0.0)
+## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
+## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2)
+## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
+##
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
+Ballereau S., M. Dunning, A. Edwards, O. Rueda, and A. Sawle, 2018 RNA-seq analysis in R: Gene set testing for RNA-seq. https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2018/RNASeq2018/html/06_Gene_set_testing.nb.html
+Carlson M., 2019 Genome wide annotation for mouse. https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html
+Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html
+Kampen K. R., L. Fancello, T. Girardi, G. Rinaldi, and M. Planque et al., 2019 Translatome analysis reveals altered serine and glycine metabolism in t-cell acute lymphoblastic leukemia cells. Nature Communications 10. https://doi.org/10.1038/s41467-019-10508-2
+Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375
+Liberzon A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, and J. P. Mesirov et al., 2015 The molecular signatures database hallmark gene set collection. Cell Systems 1. https://doi.org/10.1016/j.cels.2015.12.004
+Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, and P. Tamayo et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260
+Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
+Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102
+UC San Diego and Broad Institute Team, GSEA: Gene set enrichment analysis. https://www.gsea-msigdb.org/gsea/index.jsp
+Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118
+Yu G., 2020 clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html
+This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.
+In this example we will cover a method called Gene Set Variation Analysis (GSVA) to calculate gene set or pathway scores on a per-sample basis (Hänzelmann et al. 2013). GSVA transforms a gene by sample gene expression matrix into a gene set by sample pathway enrichment matrix (Hänzelmann et al. 2013). We’ll make a heatmap of the enrichment matrix, but you can use the GSVA scores for a number of other downstream analyses such as differential expression analysis.
+⬇️ Jump to the analysis code ⬇️
+Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.
+We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning
section.
This table summarizes the pathway analyses examples in this module.
+Analysis | +What is required for input | +What output looks like | +✅ Pros | +⚠️ Cons | +
---|---|---|---|---|
ORA (Over-representation Analysis) | +A list of gene IDs (no stats needed) | +A per-pathway hypergeometric test result | +- Simple - Inexpensive computationally to calculate p-values |
+- Requires arbitrary thresholds and ignores any statistics associated with a gene - Assumes independence of genes and pathways |
+
GSEA (Gene Set Enrichment Analysis) | +A list of genes IDs with gene-level summary statistics | +A per-pathway enrichment score | +- Includes all genes (no arbitrary threshold!) - Attempts to measure coordination of genes |
+- Permutations can be expensive - Does not account for pathway overlap - Two-group comparisons not always appropriate/feasible |
+
GSVA (Gene Set Variation Analysis) | +A gene expression matrix (like what you get from refine.bio directly) | +Pathway-level scores on a per-sample basis | +- Does not require two groups to compare upfront - Normally distributed scores |
+- Scores are not a good fit for gene sets that contain genes that go up AND down - Method doesn’t assign statistical significance itself - Recommended sample size n > 10 |
+
For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
+.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
+If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results"
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
+Go to this dataset’s page on refine.bio.
+Click the “Download Now” button on the right side of this screen.
+ +Fill out the pop up window with your email and our Terms and Conditions:
+ +It may take a few minutes for the dataset to process. You will get an email when it is ready.
+For this example analysis, we will use this acute viral bronchiolitis dataset. The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated “AV”) and their recovery, their post-convalescence visit (abbreviated “CV”).
+data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
+The SRP140558
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the SRP140558
folder into your newly created data/
folder.
Your new analysis folder should contain:
+.Rmd
you downloadedSRP140558
folder which contains:
+plots
(currently empty)results
(currently empty)Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):
+ +In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
+First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
+# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", "SRP140558")
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, "SRP140558.tsv")
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv")
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
+If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
+If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
+We will be using DESeq2
to normalize and transform our RNA-seq data before running GSVA, so we will need to install that (Love et al. 2014).
In this analysis, we will be using the GSVA
package to perform GSVA and the qusage
package to read in the GMT file containing the gene set data (Hänzelmann et al. 2013; Yaari et al. 2013).
We will also need the org.Hs.eg.db
package to perform gene identifier conversion (Carlson 2020).
We’ll create a heatmap from our pathway analysis results using pheatmap
(Slowikowski 2017).
if (!("DESeq2" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("DESeq2", update = FALSE)
+}
+
+if (!("GSVA" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("GSVA", update = FALSE)
+}
+
+if (!("qusage" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("qusage", update = FALSE)
+}
+
+if (!("org.Hs.eg.db" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("org.Hs.eg.db", update = FALSE)
+}
+
+if (!("pheatmap" %in% installed.packages())) {
+ # Install pheatmap
+ install.packages("pheatmap", update = FALSE)
+}
Attach the packages we need for this analysis.
+ +## Loading required package: S4Vectors
+## Loading required package: stats4
+## Loading required package: BiocGenerics
+## Loading required package: parallel
+##
+## Attaching package: 'BiocGenerics'
+## The following objects are masked from 'package:parallel':
+##
+## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+## clusterExport, clusterMap, parApply, parCapply, parLapply,
+## parLapplyLB, parRapply, parSapply, parSapplyLB
+## The following objects are masked from 'package:stats':
+##
+## IQR, mad, sd, var, xtabs
+## The following objects are masked from 'package:base':
+##
+## anyDuplicated, append, as.data.frame, basename, cbind,
+## colnames, dirname, do.call, duplicated, eval, evalq,
+## Filter, Find, get, grep, grepl, intersect, is.unsorted,
+## lapply, Map, mapply, match, mget, order, paste, pmax,
+## pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
+## rownames, sapply, setdiff, sort, table, tapply, union,
+## unique, unsplit, which.max, which.min
+##
+## Attaching package: 'S4Vectors'
+## The following object is masked from 'package:base':
+##
+## expand.grid
+## Loading required package: IRanges
+## Loading required package: GenomicRanges
+## Loading required package: GenomeInfoDb
+## Loading required package: SummarizedExperiment
+## Loading required package: MatrixGenerics
+## Loading required package: matrixStats
+##
+## Attaching package: 'MatrixGenerics'
+## The following objects are masked from 'package:matrixStats':
+##
+## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet,
+## colCollapse, colCounts, colCummaxs, colCummins,
+## colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs,
+## colLogSumExps, colMadDiffs, colMads, colMaxs, colMeans2,
+## colMedians, colMins, colOrderStats, colProds,
+## colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+## colSums2, colTabulates, colVarDiffs, colVars,
+## colWeightedMads, colWeightedMeans, colWeightedMedians,
+## colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs,
+## rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts,
+## rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs,
+## rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads,
+## rowMaxs, rowMeans2, rowMedians, rowMins, rowOrderStats,
+## rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs,
+## rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+## rowWeightedSds, rowWeightedVars
+## Loading required package: Biobase
+## Welcome to Bioconductor
+##
+## Vignettes contain introductory material; view with
+## 'browseVignettes()'. To cite Bioconductor, see
+## 'citation("Biobase")', and for packages
+## 'citation("pkgname")'.
+##
+## Attaching package: 'Biobase'
+## The following object is masked from 'package:MatrixGenerics':
+##
+## rowMedians
+## The following objects are masked from 'package:matrixStats':
+##
+## anyMissing, rowMedians
+
+## Loading required package: limma
+##
+## Attaching package: 'limma'
+## The following object is masked from 'package:DESeq2':
+##
+## plotMA
+## The following object is masked from 'package:BiocGenerics':
+##
+## plotMA
+# Attach the `GSVA` library
+library(GSVA)
+
+# Human annotation package we'll use for gene identifier conversion
+library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
+##
+
+Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
+We stored our file paths as objects named metadata_file
and data_file
in this previous step.
##
+## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+## cols(
+## .default = col_logical(),
+## refinebio_accession_code = col_character(),
+## experiment_accession = col_character(),
+## refinebio_organism = col_character(),
+## refinebio_platform = col_character(),
+## refinebio_source_database = col_character(),
+## refinebio_subject = col_character(),
+## refinebio_title = col_character()
+## )
+## ℹ Use `spec()` for the full column specifications.
+# Read in data TSV file
+expression_df <- readr::read_tsv(data_file) %>%
+ # Here we are going to store the gene IDs as row names so that we can have a numeric matrix to perform calculations on later
+ tibble::column_to_rownames("Gene")
##
+## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+## cols(
+## .default = col_double(),
+## Gene = col_character()
+## )
+## ℹ Use `spec()` for the full column specifications.
+Let’s ensure that the metadata and data are in the same sample order.
+# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+ dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(expression_df), metadata$refinebio_accession_code)
## [1] TRUE
+DESeq2
There are two things we need to do to prep our expression data for DESeq2.
+First, we need to make sure all of the values in our data are converted to integers as required by a DESeq2
function we will use later.
Then, we need to filter out the genes that have not been expressed or that have low expression counts since we can not be as confident in those genes being reliably measured. We are going to do some pre-filtering to keep only genes with 50 or more reads in total across the samples.
+ +We will be using the DESeq2
package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet
object. We turn the data frame (or matrix) into a DESeqDataSet
object and specify which variable labels our experimental groups using the design
argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design
argument because we are not performing a differential expression analysis.
# Create a `DESeqDataSet` object
+dds <- DESeqDataSetFromMatrix(
+ countData = expression_df, # Our prepped data frame with counts
+ colData = metadata, # Data frame with annotation for our samples
+ design = ~1 # Here we are not specifying a model
+)
## converting counts to integer mode
+We often suggest normalizing and transforming your data for various applications including for GSVA. We are going to use the vst()
function from the DESeq2
package to normalize and transform the data. For more information about these transformation methods, see here.
# Normalize and transform the data in the `DESeqDataSet` object using the `vst()`
+# function from the `DESEq2` R package
+dds_norm <- vst(dds)
At this point, if your data set has any outlier samples, you should look into removing them as they can affect your results. For this example data set, we will skip this step (there are no obvious outliers) and proceed.
+But now we are ready to format our dataset for input into gsva::gsva()
. We need to extract the normalized counts to a matrix and make it into a data frame so we can use with tidyverse functions later.
# Retrieve the normalized data from the `DESeqDataSet`
+vst_df <- assay(dds_norm) %>%
+ as.data.frame() %>% # Make into a data frame
+ tibble::rownames_to_column("ensembl_id") # Make Gene IDs into their own column
The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).
+In this example, we will use a collection called Hallmark gene sets for GSVA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:
+++Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression.
+
Here we are obtaining the pathway information from the main function of the msigdbr
package (Dolgalev 2020). Because we are using human data in this example, we supply the formal organism name to the species
argument. We will want only the hallmark pathways, so we use the category = "H"
argument.
hallmark_gene_sets <- msigdbr::msigdbr(
+ species = "Homo sapiens", # Can change this to what species you need
+ category = "H" # Only hallmark gene sets
+)
Let’s take a look at the format of hallmarks_gene_set
.
We can see this object is in a tabular format; each row corresponds to a gene and gene set pair. A row exists if that gene (entrez_gene
, gene_symbol
) belongs to a gene set (gs_name
).
The function that we will use to run GSVA wants the gene sets to be in a list, where each entry in the list is a vector of genes that comprise the pathway the element is named for. In the next step, we’ll demonstrate how to go from this data frame format to a list.
+For this example we will use Entrez IDs (but note that there are gene symbols we could use just as easily). The info we need is in two columns: entrez_gene
contains the gene ID and gs_name
contains the name of the pathway that the gene is a part of.
To make this into the list format we need, we can use the split()
function. We want a list where each element of the list is a vector that contains the Entrez gene IDs that are in a particular pathway set.
hallmarks_list <- split(
+ hallmark_gene_sets$entrez_gene, # The genes we want split into pathways
+ hallmark_gene_sets$gs_name # The pathways made as the higher levels of the list
+)
What does this hallmarks_list
look like?
## $HALLMARK_ADIPOGENESIS
+## [1] 19 11194 10449 33 34 35 47 50 51
+## [10] 112 149685 9370 79602 56894 9131 204 217 226
+## [19] 284 51129 334 348 369 10124 64225 483 539
+## [28] 11176 593 23786 604 718 847 284119 8436 901
+## [37] 977 9936 948 1031 400916 1147 1149 134147 51727
+## [46] 1306 1282 51805 84274 57017 1337 1349 1351 1376
+## [55] 1384 1431 1537 1580 1629 1652 1666 8694 1717
+## [64] 51635 25979 1737 1738 4189 29103 128338 1891 1892
+## [73] 84173 79071 5168 2053 2101 23344 2109 2167 2184
+## [82] 8322 9908 1647 2632 27069 57678 137964 2820 10243
+## [91] 2878 2879 80273 3033 26275 26353 3417 3419 3421
+## [100] 3459 10989 3679 80760 6453 84522 3910 3952 3977
+## [109] 3991 10162 4023 4056 8491 56922 4191 4199 11343
+## [118] 4259 84895 56246 29088 54996 23788 4638 64859 4698
+## [127] 4706 4713 4722 28512 4836 4958 5004 27250 10400
+## [136] 5195 5209 5211 5236 23187 5264 415116 123 5447
+## [145] 5468 5495 84919 10935 10113 55037 5733 5860 83871
+## [154] 7905 92840 56729 54884 8780 55177 26994 6239 10313
+## [163] 25813 949 6342 6390 6391 6573 6510 6576 1468
+## [172] 376497 8884 130814 6623 6647 10580 65124 8404 58472
+## [181] 8082 6776 2040 8802 6817 6888 10010 7086 10140
+## [190] 7263 7316 29979 83549 7351 29796 10975 7384 27089
+## [199] 7423 7532
+##
+## $HALLMARK_ALLOGRAFT_REJECTION
+## [1] 16 6059 10006 43 92 207 322 567 586
+## [10] 8915 602 672 717 822 9607 6356 6357 6363
+## [19] 6347 6367 6351 6352 6354 894 896 1230 729230
+## [28] 1234 912 914 919 940 915 916 917 920
+## [37] 958 959 961 924 972 973 941 942 925
+## [46] 926 10225 1029 5199 56253 1435 1445 1520 10563
+## [55] 4283 2833 1615 8560 8444 1956 8661 8664 8669
+## [64] 8672 1984 1991 2000 2069 2113 2147 2149 355
+## [73] 356 2213 2268 2316 2533 2589 2634 2650 11146
+## [82] 8477 3001 3002 3059 9734 3091 3105 3108 3109
+## [91] 3111 3112 3117 3122 3133 3135 3383 23308 3455
+## [100] 3458 3459 3460 10261 3551 3586 3589 3592 3593
+## [109] 3594 3596 3600 3603 3606 8807 3553 3558 9466
+## [118] 3559 3560 3561 3565 3566 3569 3574 3578 3624
+## [127] 3625 3662 3665 3394 3683 3689 3702 3717 3824
+## [136] 3848 3932 3937 3976 4050 4065 9450 4067 6885
+## [145] 11184 4153 4318 11222 4528 4689 4690 9437 114548
+## [154] 4830 4843 4869 5196 5551 5579 5582 5699 5777
+## [163] 5788 5917 8767 6170 6123 6133 6223 6189 6203
+## [172] 27240 8651 9655 6688 5552 7903 23166 6772 6775
+## [181] 6890 6891 6892 7040 7042 7070 7076 7096 7097
+## [190] 7098 10333 7124 7163 7186 50852 7321 7334 7453
+## [199] 7454 7535
+Looks like we have a list of gene sets with associated Entrez IDs.
+In our gene expression data frame, expression_df
we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into Entrez IDs for GSVA.
We’re going to convert our identifiers in expression_df
to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!
The annotation package org.Hs.eg.db
contains information for different identifiers (Carlson 2020). org.Hs.eg.db
is specific to Homo sapiens – this is what the Hs
in the package name is referencing.
We can see what types of IDs are available to us in an annotation package with keytypes()
.
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
+## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
+## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
+## [13] "IPI" "MAP" "OMIM" "ONTOLOGY"
+## [17] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
+## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
+## [25] "UNIGENE" "UNIPROT"
+We’ use this package to convert from Ensembl gene IDs (ENSEMBL
) to Entrez IDs (ENTREZID
) – since this is the IDs we used in our hallmarks_list
in the previous step. But, we could just as easily use it to convert to gene symbols (SYMBOL
) if we had built hallmarks_list
using gene symbols.
The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds()
and comes from the AnnotationDbi
package.
Let’s create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs.
+# First let's create a mapped data frame we can join to the gene expression values
+mapped_df <- data.frame(
+ "entrez_id" = mapIds(
+ # Replace with annotation package for the organism relevant to your data
+ org.Hs.eg.db,
+ keys = vst_df$ensembl_id,
+ # Replace with the type of gene identifiers in your data
+ keytype = "ENSEMBL",
+ # Replace with the type of gene identifiers you would like to map to
+ column = "ENTREZID",
+ # This will keep only the first mapped value for each Ensembl ID
+ multiVals = "first"
+ )
+) %>%
+ # If an Ensembl gene identifier doesn't map to a Entrez gene identifier,
+ # drop that from the data frame
+ dplyr::filter(!is.na(entrez_id)) %>%
+ # Make an `Ensembl` column to store the row names
+ tibble::rownames_to_column("Ensembl") %>%
+ # Now let's join the rest of the expression data
+ dplyr::inner_join(vst_df, by = c("Ensembl" = "ensembl_id"))
## 'select()' returned 1:many mapping between keys and columns
+This 1:many mapping between keys and columns
message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSVA later in this notebook, we keep only the first mapped IDs.
For more info on gene ID conversion, take a look at our other examples: the microarray example and the RNA-seq example.
+Let’s see a preview of mapped_df
.
We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as row names. In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers.
+Let’s count up how many Entrez IDs mapped to multiple Ensembl IDs.
+ +## [1] 68
+Looks like we have 68 duplicated Entrez IDs.
+As we mentioned earlier, we will not want any duplicate gene identifiers in our data frame when we convert it into a matrix in preparation for running GSVA.
+For RNA-seq processing in refine.bio, transcripts were quantified (Ensembl transcript IDs) and aggregated to the gene-level (Ensembl gene IDs). For a single Entrez ID that maps to multiple Ensembl gene IDs, we will use the values associated with the Ensembl gene ID that seems to be most highly expressed. Specifically, we’re going retain the Ensembl gene ID with maximum mean expression value. We expect that this approach may be a better reflection of the reads that were quantified than taking the mean or median of the values for multiple Ensembl gene IDs would be.
+Our example doesn’t contain too many duplicates; ultimately we only are losing 68 rows of data. If you find yourself using a dataset that has large proportion of duplicates, we’d recommend exercising some caution and exploring how well values for multiple gene IDs are correlated and the identity of those genes.
+First, we first need to calculate the gene means, but we’ll need to move our non-numeric variables (the gene ID columns) out of the way for that calculation.
+# First let's determine the gene means
+gene_means <- rowMeans(mapped_df %>% dplyr::select(-Ensembl, -entrez_id))
+
+# Let's add this as a column in our `mapped_df`.
+mapped_df <- mapped_df %>%
+ # Add gene_means as a column called gene_means
+ dplyr::mutate(gene_means) %>%
+ # Reorder the columns so `gene_means` column is upfront
+ dplyr::select(Ensembl, entrez_id, gene_means, dplyr::everything())
Now we can filter out the duplicate gene identifiers using the gene mean values. First, we’ll use dplyr::arrange()
by gene_means
such that the the rows will be in order of highest gene mean to lowest gene mean. For the duplicate values of entrez_id
, the row with the lower index will be the one that’s kept by dplyr::distinct()
. In practice, this means that we’ll keep the instance of the Entrez ID with the highest gene mean value as intended.
filtered_mapped_df <- mapped_df %>%
+ # Sort so that the highest mean expression values are at the top
+ dplyr::arrange(dplyr::desc(gene_means)) %>%
+ # Filter out the duplicated rows using `dplyr::distinct()`
+ dplyr::distinct(entrez_id, .keep_all = TRUE)
Let’s do our check again to see if we still have duplicates.
+ +## [1] 0
+We now have 0
duplicates which is what we want. All set!
Now we should prep this data so GSVA can use it.
+filtered_mapped_matrix <- filtered_mapped_df %>%
+ # GSVA can't the Ensembl IDs so we should drop this column as well as the means
+ dplyr::select(-Ensembl, -gene_means) %>%
+ # We need to store our gene identifiers as row names
+ tibble::column_to_rownames("entrez_id") %>%
+ # Now we can convert our object into a matrix
+ as.matrix()
Note that if we had duplicate gene identifiers here, we would not be able to set them as row names.
+GSVA fits a model and ranks genes based on their expression level relative to the sample distribution (Hänzelmann et al. 2013). The pathway-level score calculated is a way of asking how genes within a gene set vary as compared to genes that are outside of that gene set (Malhotra 2018).
+The idea here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) (Hänzelmann et al. 2013). This means that GSVA scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and ran GSVA again, you would expect the scores to change (Hänzelmann et al. 2013).
+The output is a gene set by sample matrix of GSVA scores.
+Let’s perform GSVA using the gsva()
function. See ?gsva
for more options.
gsva_results <- gsva(
+ filtered_mapped_matrix,
+ hallmarks_list,
+ method = "gsva",
+ # Appropriate for our vst transformed data
+ kcdf = "Gaussian",
+ # Minimum gene set size
+ min.sz = 15,
+ # Maximum gene set size
+ max.sz = 500,
+ # Compute Gaussian-distributed scores
+ mx.diff = TRUE,
+ # Don't print out the progress bar
+ verbose = FALSE
+)
Note that the gsva()
function documentation says we can use kcdf = "Gaussian"
if we have expression values that are continuous such as log-CPMs, log-RPKMs or log-TPMs, but we would use kcdf = "Poisson"
on integer counts. Our vst()
transformed data is on a log2-like scale, so Gaussian
works for us.
Let’s explore what the output of gsva()
looks like.
## SRR7011789 SRR7011790 SRR7011791
+## HALLMARK_ADIPOGENESIS -0.22774528 -0.36395241 0.22999820
+## HALLMARK_ALLOGRAFT_REJECTION 0.22660346 -0.25407049 -0.04169663
+## HALLMARK_ANDROGEN_RESPONSE 0.08568006 -0.13709858 0.32159028
+## HALLMARK_ANGIOGENESIS -0.33111804 -0.25529152 0.60728563
+## HALLMARK_APICAL_JUNCTION -0.11027645 -0.16642244 0.23723265
+## HALLMARK_APICAL_SURFACE 0.01112321 -0.01699534 0.07994730
+## SRR7011792 SRR7011793 SRR7011794
+## HALLMARK_ADIPOGENESIS -0.19727825 -0.2313671 -0.20810271
+## HALLMARK_ALLOGRAFT_REJECTION -0.01823989 -0.1466423 -0.27374622
+## HALLMARK_ANDROGEN_RESPONSE -0.11634752 0.0743458 -0.09111262
+## HALLMARK_ANGIOGENESIS -0.28334284 0.4498812 -0.17887517
+## HALLMARK_APICAL_JUNCTION 0.09654556 -0.2177673 -0.13366769
+## HALLMARK_APICAL_SURFACE 0.25766960 -0.1668598 -0.15017936
+## SRR7011795 SRR7011796 SRR7011797
+## HALLMARK_ADIPOGENESIS -0.00891876 -0.13059319 -0.072872699
+## HALLMARK_ALLOGRAFT_REJECTION -0.14610335 -0.19305512 -0.191220843
+## HALLMARK_ANDROGEN_RESPONSE 0.19100704 0.02244988 0.061162604
+## HALLMARK_ANGIOGENESIS -0.27122034 -0.10532059 0.238517354
+## HALLMARK_APICAL_JUNCTION -0.06955051 -0.07915702 0.005245755
+## HALLMARK_APICAL_SURFACE 0.11007532 -0.08255951 -0.144939542
+## SRR7011798
+## HALLMARK_ADIPOGENESIS -0.09425027
+## HALLMARK_ALLOGRAFT_REJECTION -0.19280670
+## HALLMARK_ANDROGEN_RESPONSE -0.14700865
+## HALLMARK_ANGIOGENESIS 0.22604015
+## HALLMARK_APICAL_JUNCTION -0.24149406
+## HALLMARK_APICAL_SURFACE -0.15649985
+Let’s write all of our GSVA results to file.
+ +Let’s make a heatmap for our pathways!
+We will want our heatmap to include some information about the sample labels, but unfortunately some of the metadata for this dataset are not set up into separate, neat columns.
+The most salient information for these samples is combined into one column, refinebio_title
. Let’s preview what this column looks like.
## [1] "AVB_006_AV_PBMC" "AVB_006_CV_PBMC" "AVB_007_AV_PBMC"
+## [4] "AVB_007_CV_PBMC" "AVB_012_AV_PBMC" "AVB_012_CV_PBMC"
+If we used these labels as is, it wouldn’t be very informative!
+Looking at the author’s descriptions, PBMCs were collected at two time points: during the patients’ first, acute bronchiolitis visit (abbreviated “AV”) and their recovery visit, (called post-convalescence and abbreviated “CV”).
+We can create a new variable, time_point
, that states this info more clearly. This new time_point
variable will have two labels: acute illness
and recovering
based on the AV
or CV
coding located in the refinebio_title
string variable.
annot_df <- metadata %>%
+ # We need the sample IDs and the main column that contains the metadata info
+ dplyr::select(
+ refinebio_accession_code,
+ refinebio_title
+ ) %>%
+ # Create our `time_point` variable based on `refinebio_title`
+ dplyr::mutate(
+ time_point = dplyr::case_when(
+ # Create our new variable based whether the refinebio_title column
+ # contains _AV_ or _CV_
+ stringr::str_detect(refinebio_title, "_AV_") ~ "acute illness",
+ stringr::str_detect(refinebio_title, "_CV_") ~ "recovering"
+ )
+ ) %>%
+ # We don't need the older version of the variable anymore
+ dplyr::select(-refinebio_title)
These time point samples are paired, so you could also add the refinebio_subject
to the labels. For simplicity, we’ve left them off for now.
The pheatmap::pheatmap()
will want the annotation data frame to have matching row names to the data we supply it (which is our gsva_results
).
Great! We’re all set. We can see that they are in a wide format with the GSVA scores for each sample spread across a row associated with each pathway.
+pathway_heatmap <- pheatmap::pheatmap(gsva_results,
+ annotation_col = annot_df, # Add metadata labels!
+ show_colnames = FALSE, # Don't show sample labels
+ fontsize_row = 6 # Shrink the pathway labels a tad
+)
+
+# Print out heatmap here
+pathway_heatmap
Here we’ve used clustering and can see that samples somewhat cluster by time_point
.
We can also see that some pathways that share biology seem to cluster together (e.g. HALLMARK_INTERFERON_ALPHA_RESPONSE
and HALLMARK_INTERFERON_GAMMA_RESPONSE
). Pathways may cluster together, or have similar GSVA scores, because the genes in those pathways overlap.
Taking this example, we can look at how many genes are in common for HALLMARK_INTERFERON_ALPHA_RESPONSE
and HALLMARK_INTERFERON_GAMMA_RESPONSE
.
length(intersect(
+ hallmarks_list$HALLMARK_INTERFERON_ALPHA_RESPONSE,
+ hallmarks_list$HALLMARK_INTERFERON_GAMMA_RESPONSE
+))
## [1] 73
+These 73
genes out of HALLMARK_INTERFERON_ALPHA_RESPONSE
’s 97 and hallmarks_list$HALLMARK_INTERFERON_GAMMA_RESPONSE
’s 200 is probably why those cluster together.
The pathways share genes and are not independent!
+Now, let’s save this plot to PNG.
+# Replace file name with a relevant output plot name
+heatmap_png_file <- file.path(plots_dir, "SRP140558_heatmap.png")
+
+# Open a PNG file - width and height arguments control the size of the output
+png(heatmap_png_file, width = 1000, height = 800)
+
+# Print your heatmap
+pathway_heatmap
+
+# Close the PNG file:
+dev.off()
## png
+## 2
+pheatmap
package allow, see the ComplexHeatmap Complete Reference Manual (Gu et al. 2016)At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
+ +## ─ Session info ─────────────────────────────────────────────────────
+## setting value
+## version R version 4.0.2 (2020-06-22)
+## os Ubuntu 20.04 LTS
+## system x86_64, linux-gnu
+## ui X11
+## language (EN)
+## collate en_US.UTF-8
+## ctype en_US.UTF-8
+## tz Etc/UTC
+## date 2020-12-21
+##
+## ─ Packages ─────────────────────────────────────────────────────────
+## package * version date lib source
+## annotate 1.68.0 2020-10-27 [1] Bioconductor
+## AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor
+## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
+## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
+## Biobase * 2.50.0 2020-10-27 [1] Bioconductor
+## BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
+## BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
+## bit 4.0.4 2020-08-04 [1] RSPM (R 4.0.2)
+## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.0.2)
+## bitops 1.0-6 2013-08-17 [1] RSPM (R 4.0.0)
+## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0)
+## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
+## coda 0.19-4 2020-09-30 [1] RSPM (R 4.0.2)
+## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
+## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
+## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0)
+## DelayedArray 0.16.0 2020-10-27 [1] Bioconductor
+## DESeq2 * 1.30.0 2020-10-27 [1] Bioconductor
+## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
+## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
+## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0)
+## emmeans 1.5.1 2020-09-18 [1] RSPM (R 4.0.2)
+## estimability 1.3 2018-02-11 [1] RSPM (R 4.0.0)
+## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0)
+## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
+## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0)
+## fastmatch 1.1-0 2017-01-28 [1] RSPM (R 4.0.0)
+## fftw 1.0-6 2020-02-24 [1] RSPM (R 4.0.2)
+## genefilter 1.72.0 2020-10-27 [1] Bioconductor
+## geneplotter 1.68.0 2020-10-27 [1] Bioconductor
+## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
+## GenomeInfoDb * 1.26.1 2020-11-20 [1] Bioconductor
+## GenomeInfoDbData 1.2.4 2020-12-01 [1] Bioconductor
+## GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor
+## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
+## ggplot2 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
+## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
+## graph 1.68.0 2020-10-27 [1] Bioconductor
+## GSEABase 1.52.1 2020-12-11 [1] Bioconductor
+## GSVA * 1.38.0 2020-10-27 [1] Bioconductor
+## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
+## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
+## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
+## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.2)
+## IRanges * 2.24.0 2020-10-27 [1] Bioconductor
+## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
+## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
+## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
+## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0)
+## limma * 3.46.0 2020-10-27 [1] Bioconductor
+## locfit 1.5-9.4 2020-03-25 [1] RSPM (R 4.0.0)
+## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
+## Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2)
+## MatrixGenerics * 1.2.0 2020-10-27 [1] Bioconductor
+## matrixStats * 0.57.0 2020-09-25 [1] RSPM (R 4.0.2)
+## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0)
+## msigdbr 7.2.1 2020-10-02 [1] RSPM (R 4.0.2)
+## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
+## mvtnorm 1.1-1 2020-06-09 [1] RSPM (R 4.0.0)
+## nlme 3.1-148 2020-05-24 [2] CRAN (R 4.0.2)
+## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0)
+## org.Hs.eg.db * 3.12.0 2020-12-01 [1] Bioconductor
+## pheatmap 1.0.12 2019-01-04 [1] RSPM (R 4.0.0)
+## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
+## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
+## ps 1.4.0 2020-10-07 [1] RSPM (R 4.0.2)
+## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
+## qusage * 2.24.0 2020-10-27 [1] Bioconductor
+## R.cache 0.14.0 2019-12-06 [1] RSPM (R 4.0.0)
+## R.methodsS3 1.8.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R.oo 1.24.0 2020-08-26 [1] RSPM (R 4.0.2)
+## R.utils 2.10.1 2020-08-26 [1] RSPM (R 4.0.2)
+## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
+## RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.0)
+## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
+## RCurl 1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0)
+## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
+## rematch2 2.1.2 2020-05-01 [1] RSPM (R 4.0.0)
+## rlang 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
+## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
+## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2)
+## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
+## S4Vectors * 0.28.0 2020-10-27 [1] Bioconductor
+## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
+## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
+## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
+## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
+## styler 1.3.2 2020-02-23 [1] RSPM (R 4.0.0)
+## SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor
+## survival 3.1-12 2020-04-10 [2] CRAN (R 4.0.2)
+## tibble 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
+## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
+## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
+## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
+## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2)
+## XML 3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2)
+## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.0.0)
+## XVector 0.30.0 2020-10-27 [1] Bioconductor
+## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
+## zlibbioc 1.36.0 2020-10-27 [1] Bioconductor
+##
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
+Broad Institute Team, 2019 Gene set enrichment analysis (gsea) user guide. https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html
+Carlson M., 2020 org.Hs.eg.db: Genome wide annotation for human. http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html
+Dolgalev I., 2020 msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html
+Gu Z., R. Eils, and M. Schlesner, 2016 Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btw313
+Hänzelmann S., R. Castelo, and J. Guinney, 2013 Biases in Illumina transcriptome sequencing caused by random hexamer priming. BMC Bioinformatics 14. https://doi.org/10.1186/1471-2105-14-7
+Hänzelmann S., R. Castelo, and J. Guinney, 2013 GSVA. https://github.com/rcastelo/GSVA/blob/master/man/gsva.Rd
+Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375
+Liberzon A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, and J. P. Mesirov et al., 2015 The molecular signatures database hallmark gene set collection. Cell Systems 1. https://doi.org/10.1016/j.cels.2015.12.004
+Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, and P. Tamayo et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260
+Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
+Malhotra S., 2018 Decoding gene set variation analysis. https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3
+Slowikowski K., 2017 Make heatmaps in R with pheatmap. https://slowkow.com/notes/pheatmap-tutorial/
+Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102
+Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660
+🚧 Advanced topics are coming soon: Under construction! 🚧
+ diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index 6bee663b..bca779e0 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -53,7 +53,7 @@ if (!dir.exists("data")) { } # Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory +plots_dir <- "plots" # Create the plots folder if it doesn't exist if (!dir.exists(plots_dir)) { @@ -61,7 +61,7 @@ if (!dir.exists(plots_dir)) { } # Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory +results_dir <- "results" # Create the results folder if it doesn't exist if (!dir.exists(results_dir)) { @@ -75,7 +75,7 @@ In the same place you put this `.Rmd` file, you should now have three new empty For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). -Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP133573/identification-of-transcription-factor-relationships-associated-with-androgen-deprivation-therapy-response-and-metastatic-progression-in-prostate-cancer). +Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP140558). Click the "Download Now" button on the right side of this screen. @@ -96,9 +96,9 @@ You will get an email when it is ready. ## About the dataset we are using for this example -For this example analysis, we will use this [prostate cancer dataset](https://www.refine.bio/experiments/SRP133573). -The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. -Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens. +For this example analysis, we will use this [acute viral bronchiolitis dataset](https://www.refine.bio/experiments/SRP140558). +The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. +Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated "AV") and their recovery, their post-convalescence visit (abbreviated "CV"). ## Place the dataset in your new `data/` folder @@ -113,7 +113,7 @@ For more details on the contents of this folder see [these docs on refine.bio](h The `In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules (Langfelder and Horvath 2008). WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNAseq data.
-As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) (Langfelder and Horvath 2007). This eigengene x sample data can, in many instances, be used as you would the original gene expression values. In this example, we use eigengene x sample data to identify differentially expressed modules between our treatment and control group
+In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules (Langfelder and Horvath 2008). WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNA-seq data.
+As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) (Langfelder and Horvath 2007). This eigengene x sample data can, in many instances, be used as you would the original gene expression values. In this example, we use eigengene x sample data to identify differentially expressed modules between our treatment and control group
This method does require some computing power, but can still be run locally (on your own computer) for most refine.bio datasets. As with many clustering and network methods, there are some parameters that may need tweaking.
For general information about downloading data for these examples, see our ‘Getting Started’ section.
-Go to this dataset’s page on refine.bio.
+Go to this dataset’s page on refine.bio.
Click the “Download Now” button on the right side of this screen.
Fill out the pop up window with your email and our Terms and Conditions:
@@ -3826,7 +3835,7 @@For this example analysis, we will use this prostate cancer dataset. The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens.
+For this example analysis, we will use this acute viral bronchiolitis dataset. The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated “AV”) and their recovery, their post-convalescence visit (abbreviated “CV”).
data/
folderFor more details on the contents of this folder see these docs on refine.bio.
The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the SRP133573
folder into your newly created data/
folder.
Copy and paste the SRP140558
folder into your newly created data/
folder.
SRP133573
folder which contains:
+SRP140558
folder which contains:
In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
# Define the file path to the data directory
-data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
+# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
## [1] TRUE
# Check if the metadata file is at the file path stored in `metadata_file`
@@ -3899,9 +3913,9 @@ 4 Identifying co-expression gene
4.1 Install libraries
See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
-We will be using DESeq2
to normalize and transform our RNA-seq data before running WGCNA, so we will need to install that (Love et al. 2014).
-Of course, we will need the WGCNA
package (Langfelder and Horvath 2008). But WGCNA
also requires a package called impute
that it sometimes has trouble installing so we recommend installing that first (Hastie et al. 2020).
-For plotting purposes will be creating a sina
plot and heatmaps which we will need a ggplot2
companion package for, called ggforce
as well as the ComplexHeatmap
package (Gu 2020).
+We will be using DESeq2
to normalize and transform our RNA-seq data before running WGCNA, so we will need to install that (Love et al. 2014).
+Of course, we will need the WGCNA
package (Langfelder and Horvath 2008). But WGCNA
also requires a package called impute
that it sometimes has trouble installing so we recommend installing that first (Hastie et al. 2020).
+For plotting purposes will be creating a sina
plot and heatmaps which we will need a ggplot2
companion package for, called ggforce
as well as the ComplexHeatmap
package (Gu 2020).
if (!("DESeq2" %in% installed.packages())) {
# Install this package if it isn't installed yet
BiocManager::install("DESeq2", update = FALSE)
@@ -3928,480 +3942,406 @@ 4.1 Install libraries
}
Attach some of the packages we need for this analysis.
-## Loading required package: S4Vectors
-## Loading required package: stats4
-## Loading required package: BiocGenerics
-## Loading required package: parallel
-##
-## Attaching package: 'BiocGenerics'
-## The following objects are masked from 'package:parallel':
-##
-## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
-## clusterExport, clusterMap, parApply, parCapply, parLapply,
-## parLapplyLB, parRapply, parSapply, parSapplyLB
-## The following objects are masked from 'package:stats':
-##
-## IQR, mad, sd, var, xtabs
-## The following objects are masked from 'package:base':
-##
-## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
-## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
-## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
-## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
-## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
-## union, unique, unsplit, which.max, which.min
-##
-## Attaching package: 'S4Vectors'
-## The following object is masked from 'package:base':
-##
-## expand.grid
-## Loading required package: IRanges
-## Loading required package: GenomicRanges
-## Loading required package: GenomeInfoDb
-## Loading required package: SummarizedExperiment
-## Loading required package: MatrixGenerics
-## Loading required package: matrixStats
-##
-## Attaching package: 'MatrixGenerics'
-## The following objects are masked from 'package:matrixStats':
-##
-## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
-## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
-## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
-## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
-## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
-## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
-## colWeightedMeans, colWeightedMedians, colWeightedSds,
-## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
-## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
-## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
-## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
-## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
-## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
-## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
-## rowWeightedSds, rowWeightedVars
-## Loading required package: Biobase
-## Welcome to Bioconductor
-##
-## Vignettes contain introductory material; view with
-## 'browseVignettes()'. To cite Bioconductor, see
-## 'citation("Biobase")', and for packages 'citation("pkgname")'.
-##
-## Attaching package: 'Biobase'
-## The following object is masked from 'package:MatrixGenerics':
-##
-## rowMedians
-## The following objects are masked from 'package:matrixStats':
-##
-## anyMissing, rowMedians
-# We will need this so we can use the pipe: %>%
-library(magrittr)
-
-# We'll need this for finding gene modules
-library(WGCNA)
-## Loading required package: dynamicTreeCut
-## Loading required package: fastcluster
-##
-## Attaching package: 'fastcluster'
-## The following object is masked from 'package:stats':
-##
-## hclust
-##
-##
-## Attaching package: 'WGCNA'
-## The following object is masked from 'package:IRanges':
-##
-## cor
-## The following object is masked from 'package:S4Vectors':
-##
-## cor
-## The following object is masked from 'package:stats':
-##
-## cor
-
+library(DESeq2)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+# We'll need this for finding gene modules
+library(WGCNA)
+
+# We'll be making some plots
+library(ggplot2)
4.2 Import and set up data
Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
We stored our file paths as objects named metadata_file
and data_file
in this previous step.
-
+
##
-## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
## cols(
-## .default = col_character(),
-## refinebio_age = col_logical(),
-## refinebio_cell_line = col_logical(),
-## refinebio_compound = col_logical(),
-## refinebio_disease_stage = col_logical(),
-## refinebio_genetic_information = col_logical(),
-## refinebio_processed = col_logical(),
-## refinebio_sex = col_logical(),
-## refinebio_source_archive_url = col_logical(),
-## refinebio_specimen_part = col_logical(),
-## refinebio_time = col_logical()
+## .default = col_logical(),
+## refinebio_accession_code = col_character(),
+## experiment_accession = col_character(),
+## refinebio_organism = col_character(),
+## refinebio_platform = col_character(),
+## refinebio_source_database = col_character(),
+## refinebio_subject = col_character(),
+## refinebio_title = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
-# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
- # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
- tibble::column_to_rownames("Gene")
+# Read in data TSV file
+df <- readr::read_tsv(data_file) %>%
+ # Here we are going to store the gene IDs as row names so that we can have a numeric matrix to perform calculations on later
+ tibble::column_to_rownames("Gene")
##
-## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
Let’s ensure that the metadata and data are in the same sample order.
-# Make the data in the order of the metadata
-df <- df %>%
- dplyr::select(metadata$refinebio_accession_code)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$refinebio_accession_code)
+# Make the data in the order of the metadata
+df <- df %>%
+ dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$refinebio_accession_code)
## [1] TRUE
4.2.1 Prepare data for DESeq2
-There are two things we neeed to do to prep our expression data for DESeq2.
+There are two things we need to do to prep our expression data for DESeq2.
First, we need to make sure all of the values in our data are converted to integers as required by a DESeq2
function we will use later.
Then, we need to filter out the genes that have not been expressed or that have low expression counts. This is recommended by WGCNA docs for RNA-seq data. Removing low count genes can also help improve your WGCNA results. We are going to do some pre-filtering to keep only genes with 50 or more reads in total across the samples.
-# The next DESeq2 functions need the values to be converted to integers
-df <- round(df) %>%
- # The next steps require a data frame and round() returns a matrix
- as.data.frame() %>%
- # Only keep rows that have total counts above the cutoff
- dplyr::filter(rowSums(.) >= 50)
-Another thing we need to do is make sure our main experimental group label is set up. In this case refinebio_treatment
has two groups: pre-adt
and post-adt
. To keep these two treatments in logical (rather than alphabetical) order, we will convert this to a factor with pre-adt
as the first level.
-metadata <- metadata %>%
- dplyr::mutate(refinebio_treatment = factor(refinebio_treatment,
- levels = c("pre-adt", "post-adt")
- ))
-Let’s double check that our factor set up is right.
-
-## [1] "pre-adt" "post-adt"
+# The next DESeq2 functions need the values to be converted to integers
+df <- round(df) %>%
+ # The next steps require a data frame and round() returns a matrix
+ as.data.frame() %>%
+ # Only keep rows that have total counts above the cutoff
+ dplyr::filter(rowSums(.) >= 50)
+Another thing we need to do is set up our main experimental group variable. Unfortunately the metadata for this dataset are not set up into separate, neat columns, but we can accomplish that ourselves.
+For this study, PBMCs were collected at two time points: during the patients’ first, acute bronchiolitis visit (abbreviated “AV”) and their recovery visit, (called post-convalescence and abbreviated “CV”).
+For handier use of this information, we can create a new variable, time_point
, that states this info more clearly. This new time_point
variable will have two labels: acute illness
and recovering
based on the AV
or CV
coding located in the refinebio_title
string variable.
+metadata <- metadata %>%
+ dplyr::mutate(
+ time_point = dplyr::case_when(
+ # Create our new variable based on refinebio_title containing AV/CV
+ stringr::str_detect(refinebio_title, "_AV_") ~ "acute illness",
+ stringr::str_detect(refinebio_title, "_CV_") ~ "recovering"
+ ),
+ # It's easier for future items if this is already set up as a factor
+ time_point = as.factor(time_point)
+ )
+Let’s double check that our factor set up is right. We want acute illness
to be the first level since it was the first time point collected.
+
+## [1] "acute illness" "recovering"
+Great! We’re all set.
4.3 Create a DESeqDataset
-We will be using the DESeq2
package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet
object. We turn the data frame (or matrix) into a DESeqDataSet
object and specify which variable labels our experimental groups using the design
argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design
argument because we are not performing a differential expression analysis.
-# Create a `DESeqDataSet` object
-dds <- DESeqDataSetFromMatrix(
- countData = df, # Our prepped data frame with counts
- colData = metadata, # Data frame with annotation for our samples
- design = ~1 # Here we are not specifying a model
-)
+We will be using the DESeq2
package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet
object. We turn the data frame (or matrix) into a DESeqDataSet
object and specify which variable labels our experimental groups using the design
argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design
argument because we are not performing a differential expression analysis.
+# Create a `DESeqDataSet` object
+dds <- DESeqDataSetFromMatrix(
+ countData = df, # Our prepped data frame with counts
+ colData = metadata, # Data frame with annotation for our samples
+ design = ~1 # Here we are not specifying a model
+)
## converting counts to integer mode
4.4 Perform DESeq2 normalization and transformation
We often suggest normalizing and transforming your data for various applications and in this instance WGCNA’s authors suggest using variance stabilizing transformation before running WGCNA.
We are going to use the vst()
function from the DESeq2
package to normalize and transform the data. For more information about these transformation methods, see here.
-# Normalize and transform the data in the `DESeqDataSet` object using the `vst()`
-# function from the `DESEq2` R package
-dds_norm <- vst(dds)
-At this point, if your data has any outliers, you should look into removing them as they can affect your WGCNA results. WGCNA’s tutorial has an example of exploring your data for outliers you can reference.
+# Normalize and transform the data in the `DESeqDataSet` object using the `vst()`
+# function from the `DESEq2` R package
+dds_norm <- vst(dds)
+At this point, if your data set has any outlier samples, you should look into removing them as they can affect your WGCNA results.
+WGCNA’s tutorial has an example of exploring your data for outliers you can reference.
+For this example data set, we will skip this step (there are no obvious outliers) and proceed.
4.5 Format normalized data for WGCNA
Extract the normalized counts to a matrix and transpose it so we can pass it to WGCNA.
-
+
4.6 Determine parameters for WGCNA
To identify which genes are in the same modules, WGCNA first creates a weighted network to define which genes are near each other. The measure of “adjacency” it uses is based on the correlation matrix, but requires the definition of a threshold value, which in turn depends on a “power” parameter that defines the exponent used when transforming the correlation values. The choice of power parameter will affect the number of modules identified, and the WGCNA modules provides the pickSoftThreshold()
function to help identify good choices for this parameter.
-sft <- pickSoftThreshold(normalized_counts,
- dataIsExpr = TRUE,
- corFnc = cor,
- networkType = "signed"
-)
-## Warning: executing %dopar% sequentially: no parallel backend registered
+sft <- pickSoftThreshold(normalized_counts,
+ dataIsExpr = TRUE,
+ corFnc = cor,
+ networkType = "signed"
+)
+## Warning: executing %dopar% sequentially: no parallel backend
+## registered
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
-## 1 1 0.58200 12.200 0.957 13500.0 13600.0 15500
-## 2 2 0.44500 5.130 0.972 7630.0 7650.0 9910
-## 3 3 0.26300 2.570 0.985 4480.0 4450.0 6680
-## 4 4 0.06480 0.914 0.985 2730.0 2680.0 4720
-## 5 5 0.00662 -0.236 0.964 1720.0 1660.0 3450
-## 6 6 0.15900 -1.010 0.965 1120.0 1060.0 2580
-## 7 7 0.36500 -1.470 0.971 746.0 689.0 1980
-## 8 8 0.50000 -1.730 0.972 509.0 459.0 1550
-## 9 9 0.59700 -1.910 0.972 356.0 313.0 1220
-## 10 10 0.67000 -2.060 0.973 253.0 217.0 982
-## 11 12 0.74000 -2.260 0.970 135.0 110.0 651
-## 12 14 0.79400 -2.320 0.978 76.9 58.6 447
-## 13 16 0.82000 -2.350 0.981 45.9 32.7 315
-## 14 18 0.83800 -2.360 0.985 28.6 18.9 227
-## 15 20 0.84500 -2.350 0.987 18.5 11.2 167
+## 1 1 0.0491 42.50 0.947 13400.0 13400.00 13600
+## 2 2 0.8530 -12.60 0.871 7230.0 7080.00 8430
+## 3 3 0.8800 -5.41 0.856 4120.0 3900.00 5840
+## 4 4 0.8910 -3.28 0.864 2470.0 2230.00 4340
+## 5 5 0.9060 -2.39 0.882 1560.0 1310.00 3380
+## 6 6 0.9140 -1.96 0.895 1030.0 798.00 2740
+## 7 7 0.9220 -1.72 0.908 706.0 496.00 2280
+## 8 8 0.9190 -1.58 0.910 504.0 314.00 1940
+## 9 9 0.9180 -1.48 0.917 371.0 203.00 1680
+## 10 10 0.9080 -1.42 0.915 282.0 134.00 1470
+## 11 12 0.9050 -1.34 0.927 174.0 60.40 1170
+## 12 14 0.8870 -1.31 0.927 116.0 28.60 964
+## 13 16 0.8660 -1.32 0.918 81.7 14.00 810
+## 14 18 0.8560 -1.33 0.921 59.7 7.13 692
+## 15 20 0.8570 -1.33 0.929 45.0 3.71 599
This sft
object has a lot of information, we will want to plot some of it to figure out what our power
soft-threshold should be. We have to first calculate a measure of the model fit, the signed \(R^2\), and make that a new variable.
-
+
Now, let’s plot the model fitting by the power
soft threshold so we can decide on a soft-threshold for power.
-ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) +
- # Plot the points
- geom_point() +
- # We'll put the Power labels slightly above the data points
- geom_text(nudge_y = 0.1) +
- # We will plot what WGCNA recommends as an R^2 cutoff
- geom_hline(yintercept = 0.80, col = "red") +
- # Just in case our values are low, we want to make sure we can still see the 0.80 level
- ylim(c(min(sft_df$model_fit), 1)) +
- # We can add more sensible labels for our axis
- xlab("Soft Threshold (power)") +
- ylab("Scale Free Topology Model Fit, signed R^2") +
- ggtitle("Scale independence") +
- # This adds some nicer aesthetics to our plot
- theme_classic()
-
+ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) +
+ # Plot the points
+ geom_point() +
+ # We'll put the Power labels slightly above the data points
+ geom_text(nudge_y = 0.1) +
+ # We will plot what WGCNA recommends as an R^2 cutoff
+ geom_hline(yintercept = 0.80, col = "red") +
+ # Just in case our values are low, we want to make sure we can still see the 0.80 level
+ ylim(c(min(sft_df$model_fit), 1.05)) +
+ # We can add more sensible labels for our axis
+ xlab("Soft Threshold (power)") +
+ ylab("Scale Free Topology Model Fit, signed R^2") +
+ ggtitle("Scale independence") +
+ # This adds some nicer aesthetics to our plot
+ theme_classic()
+
Using this plot we can decide on a power parameter. WGCNA’s authors recommend using a power
that has an signed \(R^2\) above 0.80
, otherwise they warn your results may be too noisy to be meaningful.
-If you have multiple power values with signed \(R^2\) above 0.80
, then picking the one at an inflection point, in other words where the \(R^2\) values seem to have reached their saturation (Zhang and Horvath 2005). You want to a power
that gives you a big enough \(R^2\) but is not excessively large.
-So using the plot above, going with a power soft-threshold of 16
!
+If you have multiple power values with signed \(R^2\) above 0.80
, then picking the one at an inflection point, in other words where the \(R^2\) values seem to have reached their saturation (Zhang and Horvath 2005). You want to a power
that gives you a big enough \(R^2\) but is not excessively large.
+So using the plot above, going with a power soft-threshold of 7
!
If you find you have all very low \(R^2\) values this may be because there are too many genes with low expression values that are cluttering up the calculations. You can try returning to gene filtering step and choosing a more stringent cutoff (you’ll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped).
4.7 Run WGCNA!
-We will use the blockwiseModules()
function to find gene co-expression modules in WGCNA, using 16
for the power
argument like we determined above.
+We will use the blockwiseModules()
function to find gene co-expression modules in WGCNA, using 7
for the power
argument like we determined above.
This next step takes some time to run. The blockwise
part of the blockwiseModules()
function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources.
Here we are using the default maxBlockSize
, 5000 but, you may want to adjust the maxBlockSize
argument depending on your computer’s memory. The authors of WGCNA recommend running the largest block your computer can handle and they provide some approximations as to GB of memory of a laptop and what maxBlockSize
it should be able to handle:
• If the reader has access to a large workstation with more than 4 GB of memory, the parameter maxBlockSize can be increased. A 16GB workstation should handle up to 20000 probes; a 32GB workstation should handle perhaps 30000. A 4GB standard desktop or a laptop may handle up to 8000-10000 probes, depending on operating system and other running programs.
-(Langfelder and Horvath 2016)
-bwnet <- blockwiseModules(normalized_counts,
- maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
- TOMType = "signed", # topological overlap matrix
- power = 16, # soft threshold for network construction
- numericLabels = TRUE, # Let's use numbers instead of colors for module labels
- randomSeed = 1234, # there's some randomness associated with this calculation
- # so we should set a seed
-)
-The TOMtype
argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. You can safely assume for most situations a signed
network represents what you want – we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an unsigned
network, where positive/negative is ignored see this article to help you figure that out (Langfelder 2018).
-There are a lot of other settings you can tweak – look at ?blockwiseModules
help page as well as the WGCNA tutorial (Langfelder and Horvath 2016).
+(Langfelder and Horvath 2016)
+bwnet <- blockwiseModules(normalized_counts,
+ maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
+ TOMType = "signed", # topological overlap matrix
+ power = 7, # soft threshold for network construction
+ numericLabels = TRUE, # Let's use numbers instead of colors for module labels
+ randomSeed = 1234, # there's some randomness associated with this calculation
+ # so we should set a seed
+)
+The TOMtype
argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. You can safely assume for most situations a signed
network represents what you want – we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an unsigned
network, where positive/negative is ignored see this article to help you figure that out (Langfelder 2018).
+There are a lot of other settings you can tweak – look at ?blockwiseModules
help page as well as the WGCNA tutorial (Langfelder and Horvath 2016).
4.8 Write main WGCNA results object to file
We will save our whole results object to an RDS file in case we want to return to our original WGCNA results.
-
+
4.9 Explore our WGCNA results
The bwnet
object has many parts, storing a lot of information. We can pull out the parts we are most interested in and may want to use use for plotting.
In bwnet
we have a data frame of eigengene module data for each sample in the MEs
slot. These represent the collapsed, combined, and normalized expression of the genes that make up each module.
-
+
4.10 Which modules have biggest differences across treatment groups?
We can also see if our eigengenes relate to our metadata labels. First we double check that our samples are still in order.
-
+
## [1] TRUE
-# Create the design matrix from the refinebio_treatment variable
-des_mat <- model.matrix(~ metadata$refinebio_treatment)
+# Create the design matrix from the `time_point` variable
+des_mat <- model.matrix(~ metadata$time_point)
Run linear model on each module. Limma wants our tests to be per row, so we also need to transpose so the eigengenes are rows
-# lmFit() needs a transposed version of the matrix
-fit <- limma::lmFit(t(module_eigengenes), design = des_mat)
-
-# Apply empirical Bayes to smooth standard errors
-fit <- limma::eBayes(fit)
+# lmFit() needs a transposed version of the matrix
+fit <- limma::lmFit(t(module_eigengenes), design = des_mat)
+
+# Apply empirical Bayes to smooth standard errors
+fit <- limma::eBayes(fit)
Apply multiple testing correction and obtain stats in a data frame.
-# Apply multiple testing correction and obtain stats
-stats_df <- limma::topTable(fit, number = ncol(module_eigengenes)) %>%
- tibble::rownames_to_column("module")
+# Apply multiple testing correction and obtain stats
+stats_df <- limma::topTable(fit, number = ncol(module_eigengenes)) %>%
+ tibble::rownames_to_column("module")
## Removing intercept from test coefficients
Let’s take a look at the results. They are sorted with the most significant results at the top.
-
+
-Module 52 seems to be the most differentially expressed across refinebio_treatment
groups. Now we can do some investigation into this module.
+Module 19 seems to be the most differentially expressed across time_point
groups. Now we can do some investigation into this module.
-
-4.11 Let’s make plot of module 52
-As a sanity check, let’s use ggplot
to see what module 52’s eigengene looks like between treatment groups.
+
+4.11 Let’s make plot of module 19
+As a sanity check, let’s use ggplot
to see what module 18’s eigengene looks like between treatment groups.
First we need to set up the module eigengene for this module with the sample metadata labels we need.
-module_52_df <- module_eigengenes %>%
- tibble::rownames_to_column("accession_code") %>%
- # Here we are performing an inner join with a subset of metadata
- dplyr::inner_join(metadata %>%
- dplyr::select(refinebio_accession_code, refinebio_treatment),
- by = c("accession_code" = "refinebio_accession_code")
- )
+module_19_df <- module_eigengenes %>%
+ tibble::rownames_to_column("accession_code") %>%
+ # Here we are performing an inner join with a subset of metadata
+ dplyr::inner_join(metadata %>%
+ dplyr::select(refinebio_accession_code, time_point),
+ by = c("accession_code" = "refinebio_accession_code")
+ )
Now we are ready for plotting.
-ggplot(
- module_52_df,
- aes(
- x = refinebio_treatment,
- y = ME52,
- color = refinebio_treatment
- )
-) +
- # a boxplot with outlier points hidden (they will be in the sina plot)
- geom_boxplot(width = 0.2, outlier.shape = NA) +
- # A sina plot to show all of the individual data points
- ggforce::geom_sina(maxwidth = 0.3) +
- theme_classic()
-
+ggplot(
+ module_19_df,
+ aes(
+ x = time_point,
+ y = ME19,
+ color = time_point
+ )
+) +
+ # a boxplot with outlier points hidden (they will be in the sina plot)
+ geom_boxplot(width = 0.2, outlier.shape = NA) +
+ # A sina plot to show all of the individual data points
+ ggforce::geom_sina(maxwidth = 0.3) +
+ theme_classic()
+
This makes sense! Looks like module 19 has elevated expression during the acute illness but not when recovering.
-
-4.12 What genes are a part of module 52?
+
+4.12 What genes are a part of module 19?
If you want to know which of your genes make up a modules, you can look at the $colors
slot. This is a named list which associates the genes with the module they are a part of. We can turn this into a data frame for handy use.
-gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module") %>%
- # Let's add the `ME` part so its more clear what these numbers are and it matches elsewhere
- dplyr::mutate(module = paste0("ME", module))
-Now we can find what genes are a part of module 52.
-
+gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module") %>%
+ # Let's add the `ME` part so its more clear what these numbers are and it matches elsewhere
+ dplyr::mutate(module = paste0("ME", module))
+Now we can find what genes are a part of module 19.
+
Let’s save this gene to module key to a TSV file for future use.
-
+
4.13 Make a custom heatmap function
We will make a heatmap that summarizes our differentially expressed module. Because we will make a couple of these, it makes sense to make a custom function for making this heatmap.
-make_module_heatmap <- function(module_name,
- expression_mat = normalized_counts,
- metadata_df = metadata,
- gene_module_key_df = gene_module_key,
- module_eigengenes_df = module_eigengenes) {
- # Create a summary heatmap of a given module.
- #
- # Args:
- # module_name: a character indicating what module should be plotted, e.g. "ME52"
- # expression_mat: The full gene expression matrix. Default is `normalized_counts`.
- # metadata_df: a data frame with refinebio_accession_code and refinebio_treatment
- # as columns. Default is `metadata`.
- # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`.
- # module_eigengenes: a sample x eigengene data.frame with samples as rownames. Default is `module_eigengenes`.
- #
- # Returns:
- # A heatmap of expression matrix for a module's genes, with a barplot of the
- # eigengene expression for that module.
-
- # Set up the module eigengene with its refinebio_accession_code
- module_eigengene <- module_eigengenes_df %>%
- dplyr::select(module_name) %>%
- tibble::rownames_to_column("refinebio_accession_code")
-
- # Set up column annotation from metadata
- col_annot_df <- metadata_df %>%
- # Only select the treatment and sample ID columns
- dplyr::select(refinebio_accession_code, refinebio_treatment) %>%
- # Add on the eigengene expression by joining with sample IDs
- dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>%
- # Arrange by treatment
- dplyr::arrange(refinebio_treatment, refinebio_accession_code) %>%
- # Store sample
- tibble::column_to_rownames("refinebio_accession_code")
-
- # Create the ComplexHeatmap column annotation object
- col_annot <- ComplexHeatmap::HeatmapAnnotation(
- # Supply treatment labels
- refinebio_treatment = col_annot_df$refinebio_treatment,
- # Add annotation barplot
- module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)),
- # Pick colors for each experimental group in refinebio_treatment
- col = list(refinebio_treatment = c("post-adt" = "#f1a340", "pre-adt" = "#998ec3"))
- )
-
- # Get a vector of the Ensembl gene IDs that correspond to this module
- module_genes <- gene_module_key_df %>%
- dplyr::filter(module == module_name) %>%
- dplyr::pull(gene)
-
- # Set up the gene expression data frame
- mod_mat <- expression_mat %>%
- t() %>%
- as.data.frame() %>%
- # Only keep genes from this module
- dplyr::filter(rownames(.) %in% module_genes) %>%
- # Order the samples to match col_annot_df
- dplyr::select(rownames(col_annot_df)) %>%
- # Data needs to be a matrix
- as.matrix()
-
- # Normalize the gene expression values
- mod_mat <- mod_mat %>%
- # Scale can work on matrices, but it does it by column so we will need to
- # transpose first
- t() %>%
- scale() %>%
- # And now we need to transpose back
- t()
-
- # Create a color function based on standardized scale
- color_func <- circlize::colorRamp2(
- c(-2, 0, 2),
- c("#67a9cf", "#f7f7f7", "#ef8a62")
- )
-
- # Plot on a heatmap
- heatmap <- ComplexHeatmap::Heatmap(mod_mat,
- name = module_name,
- # Supply color function
- col = color_func,
- # Supply column annotation
- bottom_annotation = col_annot,
- # We don't want to cluster samples
- cluster_columns = FALSE,
- # We don't need to show sample or gene labels
- show_row_names = FALSE,
- show_column_names = FALSE
- )
-
- # Return heatmap
- return(heatmap)
-}
+make_module_heatmap <- function(module_name,
+ expression_mat = normalized_counts,
+ metadata_df = metadata,
+ gene_module_key_df = gene_module_key,
+ module_eigengenes_df = module_eigengenes) {
+ # Create a summary heatmap of a given module.
+ #
+ # Args:
+ # module_name: a character indicating what module should be plotted, e.g. "ME19"
+ # expression_mat: The full gene expression matrix. Default is `normalized_counts`.
+ # metadata_df: a data frame with refinebio_accession_code and time_point
+ # as columns. Default is `metadata`.
+ # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`.
+ # module_eigengenes: a sample x eigengene data.frame with samples as row names. Default is `module_eigengenes`.
+ #
+ # Returns:
+ # A heatmap of expression matrix for a module's genes, with a barplot of the
+ # eigengene expression for that module.
+
+ # Set up the module eigengene with its refinebio_accession_code
+ module_eigengene <- module_eigengenes_df %>%
+ dplyr::select(all_of(module_name)) %>%
+ tibble::rownames_to_column("refinebio_accession_code")
+
+ # Set up column annotation from metadata
+ col_annot_df <- metadata_df %>%
+ # Only select the treatment and sample ID columns
+ dplyr::select(refinebio_accession_code, time_point, refinebio_subject) %>%
+ # Add on the eigengene expression by joining with sample IDs
+ dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>%
+ # Arrange by patient and time point
+ dplyr::arrange(time_point, refinebio_subject) %>%
+ # Store sample
+ tibble::column_to_rownames("refinebio_accession_code")
+
+ # Create the ComplexHeatmap column annotation object
+ col_annot <- ComplexHeatmap::HeatmapAnnotation(
+ # Supply treatment labels
+ time_point = col_annot_df$time_point,
+ # Add annotation barplot
+ module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)),
+ # Pick colors for each experimental group in time_point
+ col = list(time_point = c("recovering" = "#f1a340", "acute illness" = "#998ec3"))
+ )
+
+ # Get a vector of the Ensembl gene IDs that correspond to this module
+ module_genes <- gene_module_key_df %>%
+ dplyr::filter(module == module_name) %>%
+ dplyr::pull(gene)
+
+ # Set up the gene expression data frame
+ mod_mat <- expression_mat %>%
+ t() %>%
+ as.data.frame() %>%
+ # Only keep genes from this module
+ dplyr::filter(rownames(.) %in% module_genes) %>%
+ # Order the samples to match col_annot_df
+ dplyr::select(rownames(col_annot_df)) %>%
+ # Data needs to be a matrix
+ as.matrix()
+
+ # Normalize the gene expression values
+ mod_mat <- mod_mat %>%
+ # Scale can work on matrices, but it does it by column so we will need to
+ # transpose first
+ t() %>%
+ scale() %>%
+ # And now we need to transpose back
+ t()
+
+ # Create a color function based on standardized scale
+ color_func <- circlize::colorRamp2(
+ c(-2, 0, 2),
+ c("#67a9cf", "#f7f7f7", "#ef8a62")
+ )
+
+ # Plot on a heatmap
+ heatmap <- ComplexHeatmap::Heatmap(mod_mat,
+ name = module_name,
+ # Supply color function
+ col = color_func,
+ # Supply column annotation
+ bottom_annotation = col_annot,
+ # We don't want to cluster samples
+ cluster_columns = FALSE,
+ # We don't need to show sample or gene labels
+ show_row_names = FALSE,
+ show_column_names = FALSE
+ )
+
+ # Return heatmap
+ return(heatmap)
+}
4.14 Make module heatmaps
-Let’s try out the custom heatmap function with module 52 (our most differentially expressed module).
-
+Let’s try out the custom heatmap function with module 19 (our most differentially expressed module).
+
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(module_name)` instead of `module_name` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
-
+
From the barplot portion of our plot, we can see acute illness
samples tend to have higher expression values for the module 19 eigengene. In the heatmap portion, we can see how the individual genes that make up module 19 are overall higher than in the recovering
samples.
We can save this plot to PNG.
-
+
## png
## 2
For comparison, let’s try out the custom heatmap function with a different, not differentially expressed module.
-
+
In this non-significant module’s heatmap, there’s not a particularly strong pattern between acute illness and recovery samples. Though we can still see the genes in this module seem to be very correlated with each other (which is how we found them in the first place, so this makes sense!).
Save this plot also.
-
+
## png
## 2
@@ -4409,18 +4349,18 @@ 4.14 Make module heatmaps
5 Resources for further learning
-- WGCNA FAQ page (Langfelder and Horvath 2016).
-- WGCNA tutorial (Langfelder and Horvath 2016).
-- WGCNA paper (Langfelder and Horvath 2008).
-- ComplexHeatmap’s tutorial guide for more info on how to tweak the heatmaps (Gu 2020).
+- WGCNA FAQ page (Langfelder and Horvath 2016).
+- WGCNA tutorial (Langfelder and Horvath 2016).
+- WGCNA paper (Langfelder and Horvath 2008).
+- ComplexHeatmap’s tutorial guide for more info on how to tweak the heatmaps (Gu 2020).
6 Session info
At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
-
-## ─ Session info ───────────────────────────────────────────────────────────────
+
+## ─ Session info ─────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04 LTS
@@ -4430,9 +4370,9 @@ 6 Session info
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2020-11-30
+## date 2020-12-16
##
-## ─ Packages ───────────────────────────────────────────────────────────────────
+## ─ Packages ─────────────────────────────────────────────────────────
## package * version date lib source
## annotate 1.68.0 2020-10-27 [1] Bioconductor
## AnnotationDbi 1.52.0 2020-10-27 [1] Bioconductor
@@ -4475,8 +4415,8 @@ 6 Session info
## genefilter 1.72.0 2020-10-27 [1] Bioconductor
## geneplotter 1.68.0 2020-10-27 [1] Bioconductor
## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
-## GenomeInfoDb * 1.26.1 2020-11-20 [1] Bioconductor
-## GenomeInfoDbData 1.2.4 2020-11-25 [1] Bioconductor
+## GenomeInfoDb * 1.26.2 2020-12-08 [1] Bioconductor
+## GenomeInfoDbData 1.2.4 2020-12-16 [1] Bioconductor
## GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor
## getopt 1.20.3 2019-03-22 [1] RSPM (R 4.0.0)
## GetoptLong 1.0.3 2020-10-01 [1] RSPM (R 4.0.2)
@@ -4484,7 +4424,7 @@ 6 Session info
## ggplot2 * 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
## GlobalOptions 0.1.2 2020-06-10 [1] RSPM (R 4.0.0)
## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
-## GO.db 3.12.1 2020-11-25 [1] Bioconductor
+## GO.db 3.12.1 2020-12-16 [1] Bioconductor
## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.0.0)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
## Hmisc 4.4-1 2020-08-10 [1] RSPM (R 4.0.2)
@@ -4494,7 +4434,7 @@ 6 Session info
## htmlwidgets 1.5.2 2020-10-03 [1] RSPM (R 4.0.2)
## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.2)
## impute 1.64.0 2020-10-27 [1] Bioconductor
-## IRanges * 2.24.0 2020-10-27 [1] Bioconductor
+## IRanges * 2.24.1 2020-12-12 [1] Bioconductor
## iterators 1.0.12 2019-07-26 [1] RSPM (R 4.0.0)
## jpeg 0.1-8.1 2019-10-24 [1] RSPM (R 4.0.0)
## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
@@ -4538,7 +4478,7 @@ 6 Session info
## rpart 4.1-15 2019-04-12 [2] CRAN (R 4.0.2)
## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
-## S4Vectors * 0.28.0 2020-10-27 [1] Bioconductor
+## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
## shape 1.4.5 2020-09-13 [1] RSPM (R 4.0.2)
@@ -4560,67 +4500,8 @@ 6 Session info
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
## zlibbioc 1.36.0 2020-10-27 [1] Bioconductor
##
-## locale:
-## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
-## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
-## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
-## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
-## [9] LC_ADDRESS=C LC_TELEPHONE=C
-## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
-##
-## attached base packages:
-## [1] parallel stats4 stats graphics grDevices utils datasets
-## [8] methods base
-##
-## other attached packages:
-## [1] ggplot2_3.3.2 WGCNA_1.69
-## [3] fastcluster_1.1.25 dynamicTreeCut_1.63-1
-## [5] magrittr_1.5 DESeq2_1.30.0
-## [7] SummarizedExperiment_1.20.0 Biobase_2.50.0
-## [9] MatrixGenerics_1.2.0 matrixStats_0.57.0
-## [11] GenomicRanges_1.42.0 GenomeInfoDb_1.26.0
-## [13] IRanges_2.24.0 S4Vectors_0.28.0
-## [15] BiocGenerics_0.36.0 optparse_1.6.6
-##
-## loaded via a namespace (and not attached):
-## [1] colorspace_1.4-1 rjson_0.2.20 ellipsis_0.3.1
-## [4] circlize_0.4.10 htmlTable_2.1.0 XVector_0.30.0
-## [7] GlobalOptions_0.1.2 base64enc_0.1-3 clue_0.3-57
-## [10] rstudioapi_0.11 farver_2.0.3 getopt_1.20.3
-## [13] bit64_4.0.5 AnnotationDbi_1.52.0 fansi_0.4.1
-## [16] codetools_0.2-16 splines_4.0.2 R.methodsS3_1.8.1
-## [19] doParallel_1.0.15 impute_1.64.0 geneplotter_1.68.0
-## [22] knitr_1.30 polyclip_1.10-0 jsonlite_1.7.1
-## [25] Formula_1.2-3 Cairo_1.5-12.2 annotate_1.68.0
-## [28] cluster_2.1.0 GO.db_3.12.1 png_0.1-7
-## [31] R.oo_1.24.0 ggforce_0.3.2 readr_1.4.0
-## [34] compiler_4.0.2 httr_1.4.2 backports_1.1.10
-## [37] assertthat_0.2.1 Matrix_1.2-18 limma_3.46.0
-## [40] cli_2.1.0 tweenr_1.0.1 htmltools_0.5.0
-## [43] tools_4.0.2 gtable_0.3.0 glue_1.4.2
-## [46] GenomeInfoDbData_1.2.4 dplyr_1.0.2 Rcpp_1.0.5
-## [49] styler_1.3.2 vctrs_0.3.4 preprocessCore_1.52.0
-## [52] iterators_1.0.12 xfun_0.18 stringr_1.4.0
-## [55] ps_1.4.0 lifecycle_0.2.0 XML_3.99-0.5
-## [58] MASS_7.3-51.6 zlibbioc_1.36.0 scales_1.1.1
-## [61] hms_0.5.3 rematch2_2.1.2 RColorBrewer_1.1-2
-## [64] ComplexHeatmap_2.6.0 yaml_2.2.1 memoise_1.1.0
-## [67] gridExtra_2.3 rpart_4.1-15 latticeExtra_0.6-29
-## [70] stringi_1.5.3 RSQLite_2.2.1 genefilter_1.72.0
-## [73] foreach_1.5.0 checkmate_2.0.0 BiocParallel_1.24.1
-## [76] shape_1.4.5 rlang_0.4.8 pkgconfig_2.0.3
-## [79] bitops_1.0-6 evaluate_0.14 lattice_0.20-41
-## [82] purrr_0.3.4 htmlwidgets_1.5.2 labeling_0.3
-## [85] bit_4.0.4 tidyselect_1.1.0 R6_2.4.1
-## [88] magick_2.4.0 generics_0.0.2 Hmisc_4.4-1
-## [91] DelayedArray_0.16.0 DBI_1.1.0 pillar_1.4.6
-## [94] foreign_0.8-80 withr_2.3.0 survival_3.1-12
-## [97] RCurl_1.98-1.2 nnet_7.3-14 tibble_3.0.4
-## [100] crayon_1.3.4 rmarkdown_2.4 GetoptLong_1.0.3
-## [103] jpeg_0.1-8.1 locfit_1.5-9.4 grid_4.0.2
-## [106] data.table_1.13.0 blob_1.2.1 digest_0.6.25
-## [109] xtable_1.8-4 R.cache_0.14.0 R.utils_2.10.1
-## [112] munsell_0.5.0
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
References
diff --git a/04-advanced-topics/validate_differential_expression_adv_topics_00_author_de.Rmd b/04-advanced-topics/validate_differential_expression_adv_topics_00_author_de.Rmd
index 70a9955a..49532f04 100644
--- a/04-advanced-topics/validate_differential_expression_adv_topics_00_author_de.Rmd
+++ b/04-advanced-topics/validate_differential_expression_adv_topics_00_author_de.Rmd
@@ -36,9 +36,9 @@ if (!("GEOquery" %in% installed.packages())) {
Attach the `limma` library:
-```{r}
-# Magrittr pipe
-`%>%` <- dplyr::`%>%`
+```{r message=FALSE}
+# We will need this so we can use the pipe: %>%
+library(magrittr)
# Attach library
library(limma)
@@ -64,6 +64,7 @@ if (!dir.exists(plots_dir)) {
}
# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
data_dir <- "data" # Replace with path to data directory
# Make a data directory if it isn't created yet
diff --git a/04-advanced-topics/validate_differential_expression_adv_topics_01.Rmd b/04-advanced-topics/validate_differential_expression_adv_topics_01.Rmd
index d726a289..5e477e8b 100644
--- a/04-advanced-topics/validate_differential_expression_adv_topics_01.Rmd
+++ b/04-advanced-topics/validate_differential_expression_adv_topics_01.Rmd
@@ -81,9 +81,9 @@ if (!("VennDiagram" %in% installed.packages())) {
Attach the `limma` library:
-```{r}
-# Magrittr pipe
-`%>%` <- dplyr::`%>%`
+```{r message=FALSE}
+# We will need this so we can use the pipe: %>%
+library(magrittr)
# Attach library
library(limma)
@@ -109,6 +109,7 @@ if (!dir.exists(plots_dir)) {
}
# Define the file path to the data directory
+# Replace with the path of the folder the files will be in
data_dir <- "data" # Replace with path to data directory
```
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 9f950be9..606bf8bf 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -8,7 +8,9 @@
- [Setting up the docker container](#setting-up-the-docker-container)
- [Docker image updates](#docker-image-updates)
- [Download the datasets](#download-the-datasets)
-- [Add a new analysis](#add-a-new-analysis)
+- [Adding a new analysis](#adding-a-new-analysis)
+ - [Draft PR: Big picture reviews](#draft-pr-big-picture-reviews)
+ - [Refined PRs: Detailed reviews](#refined-prs-detailed-reviews)
- [Setting up a new analysis file](#setting-up-a-new-analysis-file)
- [How to use the template.Rmd](#how-to-use-the-templatermd)
- [Adding datasets to the S3 bucket](#adding-datasets-to-the-s3-bucket)
@@ -89,18 +91,35 @@ For development purposes, you can download all the datasets for the example note
scripts/download-data.sh
```
-## Add a new analysis
+## Adding a new analysis
-Here are the summarized steps for adding a new analysis.
-Click on the links to go to the detailed instructions for each step.
+Our PR process for adding a new analysis involves two stages which are 2-3 (or more) PRs.
+This splitting up an analysis into multiple PRs helps make the review process more manageable.
+This process ensures that discussions around the big picture: conceptual decisions, what steps are included, which packages are used, are generally concluded before review moves on to the details and further polishing.
+Note that all the following steps describe PRs to `staging` branch only (see more [about the branch set up](#pull-requests)).
+
+### Draft PR: Big picture reviews
- On your new git branch, [set up the analysis file from the template](#setting-up-the-analysis-file).
-- Add a [link to the html file to `_navbar.html`](#add-new-analyses-to-the-navbar)
+- Get the basic steps for the analysis set up and create a draft PR for a big picture review (Not all descriptions need to be 100% word-smithed, but the general steps/outline should be reflected).
+- Try to highlight things that encapsulate main concepts as ready for review using a `**REVIEW**` tag and/or use a `**DRAFT**` tag to indicate a section that hasn't really been worked on much yet.
+- After the general outline of the analysis has been agreed upon through a reviewing process, incorporate the major feedback from the draft PR process before you split off new branches to file your refined PRs.
+- Keep the original draft PR open for easy reference.
+
+### Refined PRs: Detailed reviews
+
+Break up the steps of the analysis into manageable review chunks on their own branches for detailed review (you may want to discuss what the chunks should be on the Draft PR).
+- Delete any `**REVIEW/DRAFT**` tags leftover from the draft PR.
+- Make sure each steps' explanations are fully realized for these PRs.
+- Ensure that the notebook adheres to [the guidelines](#guidelines-for-analysis-notebooks).
- [Cite sources and add them to the reference.bib file](#citing-sources-in-text)
+- If the file has been [added to snakemake](#add-new-analyses-to-the-snakefile), in the Docker container, run [snakemake for rendering](#how-to-re-render-the-notebooks-locally) to make sure it runs.
+
+These steps should be done in the first refined PR, but don't need to be done again:
+- Add a [link to the html file to `_navbar.html`](#add-new-analyses-to-the-navbar)
- Add [data and metadata files to S3](#adding-datasets-to-the-s3-bucket)
- Add not yet added packages needed for this analysis to the Dockerfile (make sure it successfully builds).
- Add the [expected output html file to snakemake](#add-new-analyses-to-the-snakefile)
-- In the Docker container, run [snakemake for rendering](#how-to-re-render-the-notebooks-locally)
### Setting up a new analysis file
@@ -479,12 +498,15 @@ Hopefully the error message helps you track down the problem, but you can also c
### About the render-notebooks.R script
The `render-notebooks.R` script adds a `bibliography:` specification in the `.Rmd` header so all citations are automatically rendered.
+A file with other R code to include can also be specified, which should be used to set options for rendering, such as the output width.
+No code that affects the computational behavior of the notebook should be included here, as it will be sourced in a hidden chunk and not visible to users.
It also adds other components like CSS styling, a footer, and Google Analytics (these items are all hard-coded into the script).
**Options:**
- `--rmd`: provided by snakemake, the input `.Rmd` file to render.
- `--bib_file`: File path for the `bibliography:` header option.
-Default is the `references.bib` in the `components` folder.
+- `--cite_style`: File path for a CSL file to control citation style
+- `--include_file`: File path for code to be sourced at the start of the notebook but hidden from rendering.
- `--html`: Default is to save the output `.html` file the same name as the input `.Rmd` file. This option allows you to specify an output file name. Default is used by snakemake.
### Add new analyses to the Snakefile
diff --git a/Snakefile b/Snakefile
index cb3705c3..cd92aa49 100644
--- a/Snakefile
+++ b/Snakefile
@@ -8,7 +8,9 @@ rule target:
"02-microarray/dimension-reduction_microarray_01_pca.html",
"02-microarray/dimension-reduction_microarray_02_umap.html",
"02-microarray/gene-id-annotation_microarray_01_ensembl.html",
- "02-microarray/pathway-analysis_microarray_02_ora.html",
+ "02-microarray/pathway-analysis_microarray_01_ora.html",
+ "02-microarray/pathway-analysis_microarray_02_gsea.html",
+ "02-microarray/pathway-analysis_microarray_03_gsva.html",
"02-microarray/ortholog-mapping_microarray_01_ensembl.html",
"03-rnaseq/00-intro-to-rnaseq.html",
"03-rnaseq/clustering_rnaseq_01_heatmap.html",
@@ -17,6 +19,9 @@ rule target:
"03-rnaseq/dimension-reduction_rnaseq_02_umap.html",
"03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html",
"03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.html",
+ "03-rnaseq/pathway-analysis_rnaseq_01_ora.html",
+ "03-rnaseq/pathway-analysis_rnaseq_02_gsea.html",
+ "03-rnaseq/pathway-analysis_rnaseq_03_gsva.html",
"04-advanced-topics/00-intro-to-advanced-topics.html",
"04-advanced-topics/network-analysis_rnaseq_01_wgcna.html"
@@ -31,5 +36,6 @@ rule render_citations:
" --rmd {input.rmd}"
" --bib_file components/references.bib"
" --cite_style components/genetics.csl"
+ " --include_file components/include.R"
" --html {output}"
" --style"
diff --git a/components/_navbar.html b/components/_navbar.html
index a4cfcbdd..179c8365 100644
--- a/components/_navbar.html
+++ b/components/_navbar.html
@@ -7,7 +7,7 @@
-
+
@@ -32,7 +32,9 @@
Differential Expression - Several groups
Dimension Reduction - PCA
Dimension Reduction - UMAP
- Pathway Analysis - ORA
+ Pathway Analysis - ORA
+ Pathway Analysis - GSEA
+ Pathway Analysis - GSVA
Ensembl Gene ID Annotation
Ortholog Mapping
@@ -51,6 +53,9 @@
Dimension Reduction - UMAP
Ensembl Gene ID Annotation
Ortholog Mapping
+ Pathway Analysis - ORA
+ Pathway Analysis - GSEA
+ Pathway Analysis - GSVA
diff --git a/components/dictionary.txt b/components/dictionary.txt
index 1c16ce0d..ae1df0b0 100644
--- a/components/dictionary.txt
+++ b/components/dictionary.txt
@@ -3,20 +3,22 @@ ADT
al
AML
AnaLysis
+arounds
ASXL
Azacytidine
-arounds
Baumer
-Benjamini
BeadArray
+Benjamini
bioconductor
bioinformatics
Brainarray
Brainarray’s
Brems
+bronchiolitis
CCDL
CCDL's
cDNA
+centric
cheatsheet
cheatsheets
ChIP
@@ -26,12 +28,14 @@ ColorBrewer
Compara
ComplexHeatmap
ComplexHeatmap's
+concordantly
CREB
Crouser
cytosolic
Danio
dataset
dataset's
+de
decitabine
DESeq
DESeq2
@@ -43,34 +47,37 @@ displaystyle
DocToc
dorsoventral
duplications
+e.g.
ECM
edgeR
edgeR's
eigengene
-e.g.
ENSDARG
Ensembl
ENSG
Entrez
et
FACS
+FPKM
frac
functionalize
-FPKM
+GC
GeneChips
Genenames
generalizable
ggplot
GitHub
+glioblastoma
glioma
-GC
GSE
GSEA
+GSVA
HCOP
hexamer
HGNC
histological
Hochberg
+Hs
hypomethylating
IDH
Illumina
@@ -82,46 +89,58 @@ JPEG
KEGG
limma
logFC
+lymphoblastic
maxBlockSize
medulloblastoma
microarray’s
molecularly
-musculus
+mononuclear
MSigDB
+musculus
myeloid
nb
NES
Northcott
+oocyte
ortholog
+orthologous
orthologs
orthology
overexpressing
overexpression
+overrepresented
+PBMCs
pheatmap
PLOS
PLX
PNAS
PNG
polymerase
+pre
prostatectomy
PRPS
-pre
-`.Rmd`
-QuSAGE
-qusage
+PRS
QN
+quartile
+qusage
+QuSAGE
README
+recode
rerio
ribosomes
RMA
+Rmd
RPKMs
+RPL
RStudio
+sina
ssGSEA
StatQuest
-SuperSeries
str
+Subramanian
subtype
subtypes
+SuperSeries
tidyverse
TPM
TPMs
@@ -139,6 +158,7 @@ UpSet
vivo
WGCNA
WGCNA's
+wild-type
WNT
Yaari
zebrafish
diff --git a/components/include.R b/components/include.R
new file mode 100644
index 00000000..6b9c2537
--- /dev/null
+++ b/components/include.R
@@ -0,0 +1,2 @@
+# set width for text output
+options(width = 70)
diff --git a/components/references.bib b/components/references.bib
index fe6b4488..9458777a 100644
--- a/components/references.bib
+++ b/components/references.bib
@@ -70,7 +70,7 @@ @article{Brems2017
url = {https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c}
}
-@website{Carlson2019,
+@website{Carlson2019-mouse,
title = {Genome wide annotation for Mouse},
author = {Marc Carlson},
year = {2019},
@@ -92,14 +92,6 @@ @manual{Carlson2020-human
url = {http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html}
}
-
-@website{Carlson2020-package,
- title = {{AnnotationDbi}},
- author = {Marc Carlson},
- year = {2020},
- url = {https://www.bioconductor.org/packages/release/bioc/html/AnnotationDbi.html}
-}
-
@website{Carlson2020-vignette,
title = {{AnnotationDbi}: Introduction To Bioconductor Annotation Packages},
author = {Marc Carlson},
@@ -124,6 +116,7 @@ @manual{CCDL2020-cluster-validation
@website{clusterProfiler-book,
title = {{clusterProfiler}: universal enrichment tool for functional and comparative study},
author = {Guangchuang Yu},
+ year = {2020},
url = {http://yulab-smu.top/clusterProfiler-book/index.html}
}
@@ -158,6 +151,15 @@ @website{dge-workshop-deseq2
url = {https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html}
}
+@manual{Dolgalev2020,
+ title = {{msigdbr}: {MSigDB} Gene Sets for Multiple Organisms in a Tidy Data Format},
+ author = {Igor Dolgalev},
+ year = {2020},
+ note = {https://igordot.github.io/msigdbr,
+https://github.com/igordot/msigdbr},
+ url = {https://cran.r-project.org/web/packages/msigdbr/index.html}
+}
+
@website{Freytag2019,
title = {Workshop: Dimension reduction with {R}},
author = {Saskia Freytag},
@@ -179,6 +181,13 @@ @manual{gene-id-annotation-rna-seq
url = {https://github.com/AlexsLemonade/refinebio-examples/blob/master/03-rnaseq/gene-id-convert_rnaseq_01_ensembl.html}
}
+@website{gene-set-testing-rnaseq,
+ author = {Stephane Ballereau and Mark Dunning and Abbi Edwards and Oscar Rueda and Ashley Sawle},
+ title = {{RNA-seq} analysis in {R}: Gene Set Testing for {RNA-seq}},
+ year = {2018},
+ url = {https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2018/RNASeq2018/html/06_Gene_set_testing.nb.html}
+}
+
@manual{Gonzalez2014,
title = {Statistical analysis of {RNA-Seq} data},
author = {Ignacio Gonzalez},
@@ -209,6 +218,20 @@ @article{Gray2015
url = {https://pubmed.ncbi.nlm.nih.gov/25361968/}
}
+@website{GSEA-broad-institute,
+ title = {{GSEA}: Gene Set Enrichment Analysis},
+ author = {{UC San Diego and Broad Institute Team}},
+ url = {https://www.gsea-msigdb.org/gsea/index.jsp}
+}
+
+@website{GSEA-user-guide,
+ title = {Gene Set Enrichment Analysis (GSEA) User Guide},
+ author = {{Broad Institute Team}},
+ month = {November},
+ year = {2019},
+ url = {https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html}
+}
+
@article{Gu2016,
title = {Complex heatmaps reveal patterns and correlations in multidimensional genomic data},
author = {Zuguang Gu and Roland Eils and Matthias Schlesner},
@@ -232,6 +255,19 @@ @website{Hadfield2016
year = {2016},
url = {https://bitesizebio.com/13542/what-everyone-should-know-about-rna-seq/}
}
+
+@article{Meng2019,
+ author = {Hailong Meng and Gur Yaari and Christopher R. Bolen and Stefan Avey and Steven H. Kleinstein},
+ title = {Gene set meta-analysis with Quantitative Set Analysis for Gene Expression (QuSAGE)},
+ journal = {PLoS Computat Biol.},
+ year = {2019},
+ volume = {15},
+ number = {4},
+ pages = {e1006899},
+ month = {April},
+ doi = {10.1371/journal.pcbi.1006899},
+ url = {https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006899}
+}
@article{Hansen2010,
author = {Hansen, K. D. and Brenner, S. E. and Dudoit, S. },
@@ -246,6 +282,44 @@ @article{Hansen2010
url = {https://pubmed.ncbi.nlm.nih.gov/20395217/}
}
+@article{Hanzelmann2013,
+ author = {Sonja H{\"a}nzelmann and Robert Castelo and Justin Guinney},
+ title = {Biases in {Illumina} transcriptome sequencing caused by random hexamer priming},
+ journal = {BMC Bioinformatics},
+ year = {2013},
+ volume = {14},
+ number = {7},
+ doi = {10.1186/1471-2105-14-7},
+ url = {https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-7}
+}
+
+@article{Hanzelmann2013,
+ doi = {10.1186/1471-2105-14-7},
+ url = {https://doi.org/10.1186/1471-2105-14-7},
+ year = {2013},
+ publisher = {Springer Science and Business Media {LLC}},
+ volume = {14},
+ number = {1},
+ pages = {7},
+ author = {Sonja H{\"a}nzelmann and Robert Castelo and Justin Guinney},
+ title = {{GSVA}: gene set variation analysis for microarray and {RNA}-Seq data},
+ journal = {{BMC} Bioinformatics}
+}
+
+@website{Hanzelmann-github,
+ title = {GSVA},
+ author = {Sonja Hänzelmann and Robert Castelo and Justin Guinney},
+ year = {2013},
+ url = {https://github.com/rcastelo/GSVA/blob/master/man/gsva.Rd}
+}
+
+@manual{Hanzelmann-gsva-vignette,
+ title = {GSVA: The Gene Set Variation Analysis package for microarray and RNA-seq data},
+ author = {Sonja H{\"a}nzelmann and Robert Castelo and Justin Guinney},
+ year = {2020},
+ url = {https://www.bioconductor.org/packages/release/bioc/vignettes/GSVA/inst/doc/GSVA.pdf},
+}
+
@manual{Hastie2020,
title = {impute: Imputation for microarray data},
author = {Trevor Hastie and Robert Tibshirani and Balasubramanian Narasimhan and Gilbert Chu},
@@ -274,21 +348,25 @@ @article{Huber2015
url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4509590/}
}
-@manual{Igor2020,
- title = {msigdbr: MSigDB Gene Sets for Multiple Organisms in a Tidy Data Format},
- author = {Igor Dolgalev},
- year = {2020},
- note = {https://igordot.github.io/msigdbr,
-https://github.com/igordot/msigdbr},
- url = {https://cran.r-project.org/web/packages/msigdbr/index.html}
-}
-
@website{intro-microarray-slides,
title = {Introduction to gene expression microarray data analysis},
author = {Hao Wu},
url = {http://web1.sph.emory.edu/users/hwu30/teaching/bioc/GE1.pdf}
}
+@article{Kampen2019,
+ doi = {10.1038/s41467-019-10508-2},
+ year = {2019},
+ month = jun,
+ publisher = {Springer Science and Business Media {LLC}},
+ volume = {10},
+ number = {1},
+ author = {Kim R. Kampen and Laura Fancello and Tiziana Girardi and Gianmarco Rinaldi and M{\'{e}}lanie Planque and Sergey O. Sulima and Fabricio Loayza-Puch and Benno Verbelen and Stijn Vereecke and Jelle Verbeeck and Joyce Op de Beeck and Jonathan Royaert and Pieter Vermeersch and David Cassiman and Jan Cools and Reuven Agami and Mark Fiers and Sarah-Maria Fendt and Kim De Keersmaecker},
+ title = {Translatome analysis reveals altered serine and glycine metabolism in T-cell acute lymphoblastic leukemia cells},
+ journal = {Nature Communications},
+ url = {https://doi.org/10.1038/s41467-019-10508-2},
+}
+
@article{Kanehisa2000,
author = {Kanehisa, M. and Goto, S. },
title = {{KEGG}: {Kyoto} encyclopedia of genes and genomes},
@@ -403,6 +481,31 @@ @website{LCSciences2014
url = {https://www.lcsciences.com/news/microarray-or-rna-sequencing/}
}
+@article{Liberzon2011,
+ doi = {10.1093/bioinformatics/btr260},
+ url = {https://doi.org/10.1093/bioinformatics/btr260},
+ year = {2011},
+ month = may,
+ publisher = {Oxford University Press ({OUP})},
+ volume = {27},
+ number = {12},
+ pages = {1739--1740},
+ author = {Arthur Liberzon and Aravind Subramanian and Reid Pinchback and Helga Thorvaldsdóttir and Pablo Tamayo and Jill P Mesirov},
+ title = {Molecular signatures database ({MSigDB}) 3.0},
+ journal = {Bioinformatics}
+}
+
+@article{Liberzon2015,
+ title = {The Molecular Signatures Database Hallmark Gene Set Collection},
+ author = {Arthur Liberzon and Chet Birger and Helga Thorvaldsdóttir and Mahmoud Ghandi and Jill P. Mesirov and Pablo Tamayo},
+ year = {2015},
+ journal = {Cell Systems},
+ volume = {1},
+ number = {6},
+ doi = {10.1016/j.cels.2015.12.004},
+ url = {https://dx.doi.org/10.1016%2Fj.cels.2015.12.004}
+}
+
@article{Love2014,
title = {Moderated estimation of fold change and dispersion for {RNA-Seq} data with {DESeq2}},
author = {Michael I. Love and Wolfgang Huber and Simon Anders},
@@ -450,6 +553,13 @@ @website{Love2020
url = {https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html}
}
+@website{Malhotra2018,
+ title = {Decoding Gene Set Variation Analysis},
+ author = {Saksham Malhotra},
+ year = {2018},
+ url = {https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3}
+}
+
@article{Mantione2014,
author = {Mantione, K. J. and Kream, R. M. and Kuzelova, H. and Ptacek, R. and Raboch, J. and Samuel, J. M. and Stefano, G. B. },
title = {Comparing bioinformatic gene expression profiling methods: microarray and {RNA-Seq}},
@@ -529,6 +639,13 @@ @article{Northcott2012
url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3683624}
}
+@manual{Pages2020-package,
+ title = {{AnnotationDbi}: Manipulation of {SQLite}-based annotations in {Bioconductor}},
+ author = {Hervé Pagès and Marc Carlson and Seth Falcon and Nianhua Li},
+ year = {2020},
+ url = {https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html}
+}
+
@website{pca-visually-explained,
title = {Principal Component Analysis Explained Visually},
author = {Victor Powell and Lewis Lehe},
@@ -576,6 +693,14 @@ @manual{Prabhakaran2016
url = {http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html}
}
+@manual{Puthier2015,
+ title = {Statistics for {B}ioinformatics - {P}racticals - {G}ene enrichment statistics},
+ author = {Denis Puthier and Jacques {van Helden}},
+ year = {2015},
+ month = {Nov},
+ url = {https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html},
+ }
+
@manual{R-base,
title = {{R}: A Language and Environment for Statistical Computing},
author = {{R Core Team}},
diff --git a/docker/Dockerfile b/docker/Dockerfile
index d4bd6146..18dfb4c1 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -58,12 +58,15 @@ RUN Rscript -e "options(warn = 2); BiocManager::install( \
'DESeq2', \
'EnhancedVolcano', \
'ggupset', \
+ 'GSVA', \
'impute', \
'limma', \
'marray', \
'msigdbr', \
'org.Mm.eg.db', \
- 'org.Dr.eg.db'), \
+ 'org.Dr.eg.db', \
+ 'org.Hs.eg.db', \
+ 'qusage'), \
update = FALSE)"
# Installs packages needed for plottings
diff --git a/scripts/render-notebooks.R b/scripts/render-notebooks.R
index 8de069aa..488eeb60 100644
--- a/scripts/render-notebooks.R
+++ b/scripts/render-notebooks.R
@@ -45,6 +45,11 @@ option_list <- list(
opt_str = c("-s", "--style"), action = "store_true",
default = FALSE,
help = "Style input file before processing"
+ ),
+ make_option(
+ opt_str = c("-i", "--include_file"), type = "character",
+ default = NULL,
+ help = "File with R code to include for rendering"
)
)
@@ -66,7 +71,10 @@ if (!file.exists(opt$rmd)) {
if (!file.exists(opt$bib_file)) {
stop("File specified for --bib_file option is not at the specified file path.")
} else {
- header_line <- paste("bibliography:", opt$bib_file)
+ header_line <- paste0(
+ "bibliography: ", normalizePath(opt$bib_file), "\n",
+ "link-citations: TRUE"
+ )
}
# Check for a citation style
if (!is.null(opt$cite_style)){
@@ -77,6 +85,22 @@ if (!is.null(opt$cite_style)){
}
}
+# Check for an R code inclusion file and create a chunk to source it
+if (!is.null(opt$include_file)){
+ if (!file.exists(opt$include_file)) {
+ stop("File specified for --include_file option is not at the specified file path.")
+ } else {
+ # create a hidden chunk to source the include file
+ include_chunk <- paste0(
+ '```{r, include = FALSE}\n',
+ 'source("', normalizePath(opt$include_file), '")\n',
+ '```'
+ )
+
+ }
+}
+
+
# If no output html filename specification, make one from the original filename
if (is.null(opt$html)) {
output_file <- stringr::str_replace(normalizePath(opt$rmd), "\\.Rmd$", ".html")
@@ -91,7 +115,7 @@ if (opt$style) {
}
# Specify the temp file
-tmp_file <- stringr::str_replace(opt$rmd, "\\.Rmd$", "-tmp-header-changed.Rmd")
+tmp_file <- stringr::str_replace(opt$rmd, "\\.Rmd$", "-tmp-torender.Rmd")
# Read in as lines
lines <- readr::read_lines(opt$rmd)
@@ -99,15 +123,22 @@ lines <- readr::read_lines(opt$rmd)
# Find which lines are the beginning and end of the header chunk
header_range <- which(lines == "---")
-# Stop if no chunk found
-if (length(header_range) == 0) {
- stop("Not finding the `---` which is at the beginning of the header.")
+# Stop if no header found
+if (length(header_range) < 2) {
+ stop("Not finding the `---` which are at the beginning and end of the header.")
+}
+
+
+# Add the include chunk after the header
+if (!is.null(opt$include_file)){
+ lines <- append(lines, include_chunk, header_range[2])
}
# Add the bibliography specification line at the beginning of the chunk
new_lines <- append(lines, header_line, header_range[1])
-# Write to an tmp file
+
+# Write to a tmp file
readr::write_lines(new_lines, tmp_file)
# Declare path to google analytics bit
@@ -116,7 +147,7 @@ google_analytics_file <- normalizePath(file.path("components", "google-analytics
# Declare path to footer
footer_file <- normalizePath(file.path("components", "footer.html"))
-# Render the header added notebook
+# Render the modified notebook
rmarkdown::render(tmp_file,
output_format = rmarkdown::html_document(
toc = TRUE, toc_depth = 2,
@@ -131,5 +162,5 @@ rmarkdown::render(tmp_file,
output_file = output_file
)
-# Remove the temporary header change .Rmd tmp file
+# Remove the modified .Rmd tmp file
file.remove(tmp_file)
diff --git a/template/template_example.Rmd b/template/template_example.Rmd
index 8241a86d..1057d234 100644
--- a/template/template_example.Rmd
+++ b/template/template_example.Rmd
@@ -43,7 +43,7 @@ if (!dir.exists("data")) {
}
# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
+plots_dir <- "plots"
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
@@ -51,7 +51,7 @@ if (!dir.exists(plots_dir)) {
}
# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
+results_dir <- "results"
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
@@ -75,7 +75,6 @@ Fill out the pop up window with your email and our Terms and Conditions:
-
{{DELETE THIS LINE}}
We are going to use non-quantile normalized data for this analysis.
To get this data, you will need to check the box that says "Skip quantile normalization for RNA-seq samples".
@@ -132,19 +131,24 @@ This is handy to do because if we want to switch the dataset (see next section f
```{r}
# Define the file path to the data directory
-data_dir <- file.path("data", {{experiment_accession}}) # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, {{DATA_ACCESSION FILENAME}}) # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, {{METADATA_ACCESSION FILENAME}}) # Replace with file path to your metadata
+# Replace with the path of the folder the files will be in
+data_dir <- file.path("data", {{experiment_accession}})
+
+# Declare the file path to the gene expression matrix file
+# inside directory saved as `data_dir`
+# Replace with the path to your dataset file
+data_file <- file.path(data_dir, {{DATA_ACCESSION FILENAME}})
+
+# Declare the file path to the metadata file
+# inside the directory saved as `data_dir`
+# Replace with the path to your metadata file
+metadata_file <- file.path(data_dir, {{METADATA_ACCESSION FILENAME}})
```
Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above.
```{r}
-# Check if the gene expression matrix file is at the file path stored in `data_file`
+# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)
# Check if the metadata file is at the file path stored in `metadata_file`
@@ -183,7 +187,7 @@ if (!({{PACKAGE}} %in% installed.packages())) {
Attach the packages we need for this analysis.
-```{r}
+```{r message=FALSE}
# Attach the library
library({{PACKAGE}})
@@ -206,7 +210,7 @@ metadata <- readr::read_tsv(metadata_file)
# Read in data TSV file
df <- readr::read_tsv(data_file) %>%
- # Tuck away the gene ID column as rownames
+ # Tuck away the gene ID column as row names
tibble::column_to_rownames("Gene")
```