From 4aa3c50caed20973c23470d6b458ef109fcca494 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Tue, 24 Nov 2020 14:21:54 -0500 Subject: [PATCH 01/12] switch wording and dataset in general --- .../network-analysis_rnaseq_01_wgcna.Rmd | 122 +++++++++--------- 1 file changed, 64 insertions(+), 58 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index bee42c53..7158a208 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -75,7 +75,7 @@ In the same place you put this `.Rmd` file, you should now have three new empty For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). -Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP133573/identification-of-transcription-factor-relationships-associated-with-androgen-deprivation-therapy-response-and-metastatic-progression-in-prostate-cancer). +Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP140558). Click the "Download Now" button on the right side of this screen. @@ -96,9 +96,9 @@ You will get an email when it is ready. ## About the dataset we are using for this example -For this example analysis, we will use this [prostate cancer dataset](https://www.refine.bio/experiments/SRP133573). -The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. -Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens. +For this example analysis, we will use this [acute viral bronchiolitis dataset](https://www.refine.bio/experiments/SRP140558). +The data that we downloaded from refine.bio for this analysis has 62 peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. +Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated "AV") and their recovery, their post-convalescence visit (abbreviated "CV"). ## Place the dataset in your new `data/` folder @@ -113,7 +113,7 @@ For more details on the contents of this folder see [these docs on refine.bio](h The `` folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like `GSE1235` or `SRP12345`. -Copy and paste the `SRP133573` folder into your newly created `data/` folder. +Copy and paste the `SRP140558` folder into your newly created `data/` folder. ## Check out our file structure! @@ -121,7 +121,7 @@ Your new analysis folder should contain: - The example analysis `.Rmd` you downloaded - A folder called "data" which contains: - - The `SRP133573` folder which contains: + - The `SRP140558` folder which contains: - The gene expression - The metadata TSV - A folder for `plots` (currently empty) @@ -139,13 +139,13 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in +data_dir <- file.path("data", "SRP140558") # Replace with accession number which will be the name of the folder the files will be in # Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset +data_file <- file.path(data_dir, "SRP140558.tsv") # Replace with file path to your dataset # Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata +metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv") # Replace with file path to your metadata ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. @@ -291,21 +291,29 @@ df <- round(df) %>% dplyr::filter(rowSums(.) >= 50) ``` -Another thing we need to do is make sure our main experimental group label is set up. -In this case `refinebio_treatment` has two groups: `pre-adt` and `post-adt`. -To keep these two treatments in logical (rather than alphabetical) order, we will convert this to a factor with `pre-adt` as the first level. +Another thing we need to do is set up our main experimental group variable. +Unfortunately the metadata for this dataset are not separated into separate, neat columns, but we can accomplish that ourselves. + +For this study PBMCs were collected at two time points: during the patients' first, acute bronchiolitis visit (abbreviated "AV") and their recovery visit, (called post-convalescence and abbreviated "CV"). + +For handier use of this information, wee can create a new variable that states this info more clearly. +We'll create a `time_point` has two labels: `acute illness` and `recovering` based on the `AV` or `CV` coding located in the `refinebio_title` string variable. ```{r} metadata <- metadata %>% - dplyr::mutate(refinebio_treatment = factor(refinebio_treatment, - levels = c("pre-adt", "post-adt") - )) + dplyr::mutate(time_point = dplyr::case_when( + # Create our new variable based on refinebio_title containing AV/CV + str_detect(refinebio_title, "_AV_") ~ "acute illness", + str_detect(refinebio_title, "_CV_") ~ "recovering" + ), + # It's easier for future items if this is already set up as a factor + time_point = as.factor(time_point)) ``` Let's double check that our factor set up is right. ```{r} -levels(metadata$refinebio_treatment) +levels(metadata$time_point) ``` ## Create a DESeqDataset @@ -399,14 +407,14 @@ WGCNA's authors recommend using a `power` that has an signed $R^2$ above `0.80`, If you have multiple power values with signed $R^2$ above `0.80`, then picking the one at an inflection point, in other words where the $R^2$ values seem to have reached their saturation [@Zhang2005]. You want to a `power` that gives you a big enough $R^2$ but is not excessively large. -So using the plot above, going with a power soft-threshold of `16`! +So using the plot above, going with a power soft-threshold of `2`! If you find you have all very low $R^2$ values this may be because there are too many genes with low expression values that are cluttering up the calculations. You can try returning to [gene filtering step](#define-a-minimum-counts-cutoff) and choosing a more stringent cutoff (you'll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped). ## Run WGCNA! -We will use the `blockwiseModules()` function to find gene co-expression modules in WGCNA, using `16` for the `power` argument like we determined above. +We will use the `blockwiseModules()` function to find gene co-expression modules in WGCNA, using `2` for the `power` argument like we determined above. This next step takes some time to run. The `blockwise` part of the `blockwiseModules()` function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources. @@ -425,7 +433,7 @@ operating system and other running programs. bwnet <- blockwiseModules(normalized_counts, maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in TOMType = "signed", # topological overlap matrix - power = 16, # soft threshold for network construction + power = 2, # soft threshold for network construction numericLabels = TRUE, # Let's use numbers instead of colors for module labels randomSeed = 1234, # there's some randomness associated with this calculation # so we should set a seed @@ -444,7 +452,7 @@ We will save our whole results object to an RDS file in case we want to return t ```{r} readr::write_rds(bwnet, - file = file.path("results", "SRP133573_wgcna_results.RDS") + file = file.path("results", "SRP140558_wgcna_results.RDS") ) ``` @@ -473,8 +481,8 @@ all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes)) ``` ```{r} -# Create the design matrix from the refinebio_treatment variable -des_mat <- model.matrix(~ metadata$refinebio_treatment) +# Create the design matrix from the time_point variable +des_mat <- model.matrix(~ metadata$time_point) ``` Run linear model on each module. @@ -503,21 +511,21 @@ They are sorted with the most significant results at the top. head(stats_df) ``` -Module 52 seems to be the most differentially expressed across `refinebio_treatment` groups. +Module 18 seems to be the most differentially expressed across `time_point` groups. Now we can do some investigation into this module. -## Let's make plot of module 52 +## Let's make plot of module 18 -As a sanity check, let's use `ggplot` to see what module 52's eigengene looks like between treatment groups. +As a sanity check, let's use `ggplot` to see what module 18's eigengene looks like between treatment groups. First we need to set up the module eigengene for this module with the sample metadata labels we need. ```{r} -module_52_df <- module_eigengenes %>% +module_18_df <- module_eigengenes %>% tibble::rownames_to_column("accession_code") %>% # Here we are performing an inner join with a subset of metadata dplyr::inner_join(metadata %>% - dplyr::select(refinebio_accession_code, refinebio_treatment), + dplyr::select(refinebio_accession_code, time_point), by = c("accession_code" = "refinebio_accession_code") ) ``` @@ -526,11 +534,11 @@ Now we are ready for plotting. ```{r} ggplot( - module_52_df, + module_18_df, aes( - x = refinebio_treatment, - y = ME52, - color = refinebio_treatment + x = time_point, + y = ME18, + color = time_point ) ) + ggforce::geom_sina() + @@ -538,9 +546,9 @@ ggplot( ``` This makes sense! -Looks like module 52 has elevated expression post treatment. +Looks like module 18 has elevated expression during the acute illness but not when recovering. -## What genes are a part of module 52? +## What genes are a part of module 18? If you want to know which of your genes make up a modules, you can look at the `$colors` slot. This is a named list which associates the genes with the module they are a part of. @@ -552,18 +560,18 @@ gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module" dplyr::mutate(module = paste0("ME", module)) ``` -Now we can find what genes are a part of module 52. +Now we can find what genes are a part of module 18. ```{r} gene_module_key %>% - dplyr::filter(module == "ME52") + dplyr::filter(module == "ME18") ``` Let's save this gene to module key to a TSV file for future use. ```{r} readr::write_tsv(gene_module_key, - file = file.path("results", "SRP133573_wgcna_gene_to_module.tsv") + file = file.path("results", "SRP140558_wgcna_gene_to_module.tsv") ) ``` @@ -581,9 +589,9 @@ make_module_heatmap <- function(module_name, # Create a summary heatmap of a given module. # # Args: - # module_name: a character indicating what module should be plotted, e.g. "ME52" + # module_name: a character indicating what module should be plotted, e.g. "ME18" # expression_mat: The full gene expression matrix. Default is `normalized_counts`. - # metadata_df: a data frame with refinebio_accession_code and refinebio_treatment + # metadata_df: a data frame with refinebio_accession_code and time_point # as columns. Default is `metadata`. # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`. # module_eigengenes: a sample x eigengene data.frame with samples as rownames. Default is `module_eigengenes`. @@ -600,22 +608,22 @@ make_module_heatmap <- function(module_name, # Set up column annotation from metadata col_annot_df <- metadata_df %>% # Only select the treatment and sample ID columns - dplyr::select(refinebio_accession_code, refinebio_treatment) %>% + dplyr::select(refinebio_accession_code, time_point, refinebio_subject) %>% # Add on the eigengene expression by joining with sample IDs dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>% - # Arrange by treatment - dplyr::arrange(refinebio_treatment, refinebio_accession_code) %>% + # Arrange by patient and time point + dplyr::arrange(time_point, refinebio_subject) %>% # Store sample tibble::column_to_rownames("refinebio_accession_code") # Create the ComplexHeatmap column annotation object col_annot <- ComplexHeatmap::HeatmapAnnotation( # Supply treatment labels - refinebio_treatment = col_annot_df$refinebio_treatment, + time_point = col_annot_df$time_point, # Add annotation barplot module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)), - # Pick colors for each experimental group in refinebio_treatment - col = list(refinebio_treatment = c("post-adt" = "#f1a340", "pre-adt" = "#998ec3")) + # Pick colors for each experimental group in time_point + col = list(time_point = c("recovering" = "#f1a340", "acute illness" = "#998ec3")) ) # Get a vector of the Ensembl gene IDs that correspond to this module @@ -670,44 +678,42 @@ make_module_heatmap <- function(module_name, ## Make module heatmaps -Let's try out the custom heatmap function with module 52 (our most differentially expressed module). +Let's try out the custom heatmap function with module 18 (our most differentially expressed module). ```{r} -mod_52_heatmap <- make_module_heatmap(module_name = "ME52") +mod_18_heatmap <- make_module_heatmap(module_name = "ME18") # Print out the plot -mod_52_heatmap +mod_18_heatmap ``` -From the barplot portion of our plot, we can see `post-adt` samples have higher values for this eigengene for module 52. -In the heatmap portion, we can see how the individual genes that make up module 52 have more extreme values (very high or very low) in the `post-adt` samples. +From the barplot portion of our plot, we can see `recovering` samples have higher values for this eigengene for module 18. +In the heatmap portion, we can see how the individual genes that make up module 18 are overall higher than in the `recovering` samples. We can save this plot to PDF. ```{r} -pdf(file.path("results", "SRP133573_module_52_heatmap.pdf")) -mod_52_heatmap +pdf(file.path("results", "SRP140558_module_18_heatmap.pdf")) +mod_18_heatmap dev.off() ``` For comparison, let's try out the custom heatmap function with a different, _not_ differentially expressed module. ```{r} -mod_10_heatmap <- make_module_heatmap(module_name = "ME10") +mod_25_heatmap <- make_module_heatmap(module_name = "ME25") # Print out the plot -mod_10_heatmap +mod_25_heatmap ``` -In this non-significant module's heatmap, there's not a particularly strong pattern between pre and post ADT samples. -In general the expression of genes in module 10 does not vary much between groups, staying near the overall mean. -There are a few samples and some genes that show higher expression, but it is not surprising this does not results in a significant overall difference between the groups. +In this non-significant module's heatmap, there's not a particularly strong pattern between acute illness and recovery samples. Save this plot also. ```{r} -pdf(file.path("results", "SRP133573_module_10_heatmap.pdf")) -mod_10_heatmap +pdf(file.path("results", "SRP140558_module_25_heatmap.pdf")) +mod_25_heatmap dev.off() ``` From 44a76151e3ed9278853c15d9f677dfe0152557f6 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Tue, 24 Nov 2020 14:32:45 -0500 Subject: [PATCH 02/12] Few more wording edits --- .../network-analysis_rnaseq_01_wgcna.Rmd | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index 7158a208..134610a2 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -97,7 +97,7 @@ You will get an email when it is ready. ## About the dataset we are using for this example For this example analysis, we will use this [acute viral bronchiolitis dataset](https://www.refine.bio/experiments/SRP140558). -The data that we downloaded from refine.bio for this analysis has 62 peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. +The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated "AV") and their recovery, their post-convalescence visit (abbreviated "CV"). ## Place the dataset in your new `data/` folder @@ -292,12 +292,12 @@ df <- round(df) %>% ``` Another thing we need to do is set up our main experimental group variable. -Unfortunately the metadata for this dataset are not separated into separate, neat columns, but we can accomplish that ourselves. +Unfortunately the metadata for this dataset are not set up into separate, neat columns, but we can accomplish that ourselves. -For this study PBMCs were collected at two time points: during the patients' first, acute bronchiolitis visit (abbreviated "AV") and their recovery visit, (called post-convalescence and abbreviated "CV"). +For this study, PBMCs were collected at two time points: during the patients' first, acute bronchiolitis visit (abbreviated "AV") and their recovery visit, (called post-convalescence and abbreviated "CV"). -For handier use of this information, wee can create a new variable that states this info more clearly. -We'll create a `time_point` has two labels: `acute illness` and `recovering` based on the `AV` or `CV` coding located in the `refinebio_title` string variable. +For handier use of this information, we can create a new variable, `time_point`, that states this info more clearly. +This new `time_point` variable will have two labels: `acute illness` and `recovering` based on the `AV` or `CV` coding located in the `refinebio_title` string variable. ```{r} metadata <- metadata %>% @@ -311,11 +311,14 @@ metadata <- metadata %>% ``` Let's double check that our factor set up is right. +We want `acute illness` to be the first level since it was the first time point collected. ```{r} levels(metadata$time_point) ``` +Great! We're all set. + ## Create a DESeqDataset We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object. @@ -481,7 +484,7 @@ all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes)) ``` ```{r} -# Create the design matrix from the time_point variable +# Create the design matrix from the `time_point` variable des_mat <- model.matrix(~ metadata$time_point) ``` From e5d637e52dd1e62457df9a4d9d698cae7c7acadc Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Tue, 24 Nov 2020 14:39:03 -0500 Subject: [PATCH 03/12] Update dictionary; fix spelling errors --- .../network-analysis_rnaseq_01_wgcna.Rmd | 20 ++++++++++--------- components/dictionary.txt | 4 ++++ 2 files changed, 15 insertions(+), 9 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index 134610a2..ce0a8d16 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -13,7 +13,7 @@ output: In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules [@Langfelder2008]. WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. -This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNAseq data. +This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & rna-seq data. As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) [@Langfelder2007]. This eigengene x sample data can, in many instances, be used as you would the original gene expression values. @@ -273,7 +273,7 @@ all.equal(colnames(df), metadata$refinebio_accession_code) ### Prepare data for `DESeq2` -There are two things we neeed to do to prep our expression data for DESeq2. +There are two things we need to do to prep our expression data for DESeq2. First, we need to make sure all of the values in our data are converted to integers as required by a `DESeq2` function we will use later. @@ -301,13 +301,15 @@ This new `time_point` variable will have two labels: `acute illness` and `recove ```{r} metadata <- metadata %>% - dplyr::mutate(time_point = dplyr::case_when( - # Create our new variable based on refinebio_title containing AV/CV - str_detect(refinebio_title, "_AV_") ~ "acute illness", - str_detect(refinebio_title, "_CV_") ~ "recovering" - ), - # It's easier for future items if this is already set up as a factor - time_point = as.factor(time_point)) + dplyr::mutate( + time_point = dplyr::case_when( + # Create our new variable based on refinebio_title containing AV/CV + str_detect(refinebio_title, "_AV_") ~ "acute illness", + str_detect(refinebio_title, "_CV_") ~ "recovering" + ), + # It's easier for future items if this is already set up as a factor + time_point = as.factor(time_point) + ) ``` Let's double check that our factor set up is right. diff --git a/components/dictionary.txt b/components/dictionary.txt index 2290522d..ae78912d 100644 --- a/components/dictionary.txt +++ b/components/dictionary.txt @@ -14,6 +14,7 @@ bioinformatics Brainarray Brainarray’s Brems +bronchiolitis CCDL CCDL's cDNA @@ -26,6 +27,7 @@ Cmd ColorBrewer Compara ComplexHeatmap +ComplexHeatmap's CREB Crouser cytosolic @@ -89,6 +91,7 @@ maxBlockSize medulloblastoma microarray’s molecularly +mononuclear musculus MSigDB myeloid @@ -100,6 +103,7 @@ orthologs orthology overexpressing overexpression +PBMCs permutated permutating pheatmap From a7907bb4e750c806400cd00df9e5cf026b637ed3 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Tue, 24 Nov 2020 14:45:53 -0500 Subject: [PATCH 04/12] Re-render! --- .../network-analysis_rnaseq_01_wgcna.Rmd | 4 +- .../network-analysis_rnaseq_01_wgcna.html | 459 +++++++++--------- 2 files changed, 236 insertions(+), 227 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index ce0a8d16..1a264859 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -304,8 +304,8 @@ metadata <- metadata %>% dplyr::mutate( time_point = dplyr::case_when( # Create our new variable based on refinebio_title containing AV/CV - str_detect(refinebio_title, "_AV_") ~ "acute illness", - str_detect(refinebio_title, "_CV_") ~ "recovering" + stringr::str_detect(refinebio_title, "_AV_") ~ "acute illness", + stringr::str_detect(refinebio_title, "_CV_") ~ "recovering" ), # It's easier for future items if this is already set up as a factor time_point = as.factor(time_point) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html index b294fb8d..4cbc159d 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html @@ -3772,7 +3772,7 @@

November 2020

1 Purpose of this analysis

-

In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules (Langfelder and Horvath 2008). WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNAseq data.

+

In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules (Langfelder and Horvath 2008). WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & rna-seq data.

As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) (Langfelder and Horvath 2007). This eigengene x sample data can, in many instances, be used as you would the original gene expression values. In this example, we use eigengene x sample data to identify differentially expressed modules between our treatment and control group

This method does require some computing power, but can still be run locally (on your own computer) for most refine.bio datasets. As with many clustering and network methods, there are some parameters that may need tweaking.

⬇️ Jump to the analysis code ⬇️

@@ -3815,7 +3815,7 @@

2.2 Set up your analysis folders<

2.3 Obtain the dataset from refine.bio

For general information about downloading data for these examples, see our ‘Getting Started’ section.

-

Go to this dataset’s page on refine.bio.

+

Go to this dataset’s page on refine.bio.

Click the “Download Now” button on the right side of this screen.

Fill out the pop up window with your email and our Terms and Conditions:

@@ -3826,7 +3826,7 @@

2.3 Obtain the dataset from refin

2.4 About the dataset we are using for this example

-

For this example analysis, we will use this prostate cancer dataset. The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens.

+

For this example analysis, we will use this acute viral bronchiolitis dataset. The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated “AV”) and their recovery, their post-convalescence visit (abbreviated “CV”).

2.5 Place the dataset in your new data/ folder

@@ -3834,7 +3834,7 @@

2.5 Place the dataset in your new

For more details on the contents of this folder see these docs on refine.bio.

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

-

Copy and paste the SRP133573 folder into your newly created data/ folder.

+

Copy and paste the SRP140558 folder into your newly created data/ folder.

2.6 Check out our file structure!

@@ -3844,7 +3844,7 @@

2.6 Check out our file structure!
  • A folder called “data” which contains:
      -
    • The SRP133573 folder which contains: +
    • The SRP140558 folder which contains:
      • The gene expression
      • @@ -3860,13 +3860,13 @@

        2.6 Check out our file structure!

        In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

        First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

        # Define the file path to the data directory
        -data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
        +data_dir <- file.path("data", "SRP140558") # Replace with accession number which will be the name of the folder the files will be in
         
         # Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
        -data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
        +data_file <- file.path(data_dir, "SRP140558.tsv") # Replace with file path to your dataset
         
         # Declare the file path to the metadata file using the data directory saved as `data_dir`
        -metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
        +metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv") # Replace with file path to your metadata

  • Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

    # Check if the gene expression matrix file is at the file path stored in `data_file`
     file.exists(data_file)
    @@ -4029,19 +4029,16 @@

    4.2 Import and set up data

    # Read in metadata TSV file
     metadata <- readr::read_tsv(metadata_file)
    ## 
    -## ── Column specification ───────────────────────────────────────────────────────────────────────────────
    +## ── Column specification ─────────────────────────────────────────────────────────────────────────
     ## cols(
    -##   .default = col_character(),
    -##   refinebio_age = col_logical(),
    -##   refinebio_cell_line = col_logical(),
    -##   refinebio_compound = col_logical(),
    -##   refinebio_disease_stage = col_logical(),
    -##   refinebio_genetic_information = col_logical(),
    -##   refinebio_processed = col_logical(),
    -##   refinebio_sex = col_logical(),
    -##   refinebio_source_archive_url = col_logical(),
    -##   refinebio_specimen_part = col_logical(),
    -##   refinebio_time = col_logical()
    +##   .default = col_logical(),
    +##   refinebio_accession_code = col_character(),
    +##   experiment_accession = col_character(),
    +##   refinebio_organism = col_character(),
    +##   refinebio_platform = col_character(),
    +##   refinebio_source_database = col_character(),
    +##   refinebio_subject = col_character(),
    +##   refinebio_title = col_character()
     ## )
     ## ℹ Use `spec()` for the full column specifications.
    # Read in data TSV file
    @@ -4049,7 +4046,7 @@ 

    4.2 Import and set up data

    # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later tibble::column_to_rownames("Gene")
    ## 
    -## ── Column specification ───────────────────────────────────────────────────────────────────────────────
    +## ── Column specification ─────────────────────────────────────────────────────────────────────────
     ## cols(
     ##   .default = col_double(),
     ##   Gene = col_character()
    @@ -4065,7 +4062,7 @@ 

    4.2 Import and set up data

    ## [1] TRUE

    4.2.1 Prepare data for DESeq2

    -

    There are two things we neeed to do to prep our expression data for DESeq2.

    +

    There are two things we need to do to prep our expression data for DESeq2.

    First, we need to make sure all of the values in our data are converted to integers as required by a DESeq2 function we will use later.

    Then, we need to filter out the genes that have not been expressed or that have low expression counts. This is recommended by WGCNA docs for RNA-seq data. Removing low count genes can also help improve your WGCNA results. We are going to do some pre-filtering to keep only genes with 50 or more reads in total across the samples.

    # The next DESeq2 functions need the values to be converted to integers
    @@ -4074,14 +4071,23 @@ 

    4.2.1 Prepare data for DESe as.data.frame() %>% # Only keep rows that have total counts above the cutoff dplyr::filter(rowSums(.) >= 50)

    -

    Another thing we need to do is make sure our main experimental group label is set up. In this case refinebio_treatment has two groups: pre-adt and post-adt. To keep these two treatments in logical (rather than alphabetical) order, we will convert this to a factor with pre-adt as the first level.

    +

    Another thing we need to do is set up our main experimental group variable. Unfortunately the metadata for this dataset are not set up into separate, neat columns, but we can accomplish that ourselves.

    +

    For this study, PBMCs were collected at two time points: during the patients’ first, acute bronchiolitis visit (abbreviated “AV”) and their recovery visit, (called post-convalescence and abbreviated “CV”).

    +

    For handier use of this information, we can create a new variable, time_point, that states this info more clearly. This new time_point variable will have two labels: acute illness and recovering based on the AV or CV coding located in the refinebio_title string variable.

    metadata <- metadata %>%
    -  dplyr::mutate(refinebio_treatment = factor(refinebio_treatment,
    -    levels = c("pre-adt", "post-adt")
    -  ))
    -

    Let’s double check that our factor set up is right.

    -
    levels(metadata$refinebio_treatment)
    -
    ## [1] "pre-adt"  "post-adt"
    + dplyr::mutate( + time_point = dplyr::case_when( + # Create our new variable based on refinebio_title containing AV/CV + stringr::str_detect(refinebio_title, "_AV_") ~ "acute illness", + stringr::str_detect(refinebio_title, "_CV_") ~ "recovering" + ), + # It's easier for future items if this is already set up as a factor + time_point = as.factor(time_point) + )
    +

    Let’s double check that our factor set up is right. We want acute illness to be the first level since it was the first time point collected.

    +
    levels(metadata$time_point)
    +
    ## [1] "acute illness" "recovering"
    +

    Great! We’re all set.

    @@ -4102,7 +4108,9 @@

    4.4 Perform DESeq2 normalization
    # Normalize and transform the data in the `DESeqDataSet` object using the `vst()`
     # function from the `DESEq2` R package
     dds_norm <- vst(dds)
    -

    At this point, if your data has any outliers, you should look into removing them as they can affect your WGCNA results. WGCNA’s tutorial has an example of exploring your data for outliers you can reference.

    +

    At this point, if your data set has any outlier samples, you should look into removing them as they can affect your WGCNA results.

    +

    WGCNA’s tutorial has an example of exploring your data for outliers you can reference.

    +

    For this example data set, we will skip this step (there are no obvious outliers) and proceed.

    4.5 Format normalized data for WGCNA

    @@ -4121,21 +4129,21 @@

    4.6 Determine parameters for WGCN )

    ## Warning: executing %dopar% sequentially: no parallel backend registered
    ##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
    -## 1      1  0.58200 12.200          0.957 13500.0   13600.0  15500
    -## 2      2  0.44500  5.130          0.972  7630.0    7650.0   9910
    -## 3      3  0.26300  2.570          0.985  4480.0    4450.0   6680
    -## 4      4  0.06480  0.914          0.985  2730.0    2680.0   4720
    -## 5      5  0.00662 -0.236          0.964  1720.0    1660.0   3450
    -## 6      6  0.15900 -1.010          0.965  1120.0    1060.0   2580
    -## 7      7  0.36500 -1.470          0.971   746.0     689.0   1980
    -## 8      8  0.50000 -1.730          0.972   509.0     459.0   1550
    -## 9      9  0.59700 -1.910          0.972   356.0     313.0   1220
    -## 10    10  0.67000 -2.060          0.973   253.0     217.0    982
    -## 11    12  0.74000 -2.260          0.970   135.0     110.0    651
    -## 12    14  0.79400 -2.320          0.978    76.9      58.6    447
    -## 13    16  0.82000 -2.350          0.981    45.9      32.7    315
    -## 14    18  0.83800 -2.360          0.985    28.6      18.9    227
    -## 15    20  0.84500 -2.350          0.987    18.5      11.2    167
    +## 1 1 0.0491 42.50 0.947 13400.0 13400.00 13600 +## 2 2 0.8530 -12.60 0.871 7230.0 7080.00 8430 +## 3 3 0.8800 -5.41 0.856 4120.0 3900.00 5840 +## 4 4 0.8910 -3.28 0.864 2470.0 2230.00 4340 +## 5 5 0.9060 -2.39 0.882 1560.0 1310.00 3380 +## 6 6 0.9140 -1.96 0.895 1030.0 798.00 2740 +## 7 7 0.9220 -1.72 0.908 706.0 496.00 2280 +## 8 8 0.9190 -1.58 0.910 504.0 314.00 1940 +## 9 9 0.9180 -1.48 0.917 371.0 203.00 1680 +## 10 10 0.9080 -1.42 0.915 282.0 134.00 1470 +## 11 12 0.9050 -1.34 0.927 174.0 60.40 1170 +## 12 14 0.8870 -1.31 0.927 116.0 28.60 964 +## 13 16 0.8660 -1.32 0.918 81.7 14.00 810 +## 14 18 0.8560 -1.33 0.921 59.7 7.13 692 +## 15 20 0.8570 -1.33 0.929 45.0 3.71 599

    This sft object has a lot of information, we will want to plot some of it to figure out what our power soft-threshold should be. We have to first calculate a measure of the model fit, the signed \(R^2\), and make that a new variable.

    sft_df <- data.frame(sft$fitIndices) %>%
       dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq)
    @@ -4155,250 +4163,251 @@

    4.6 Determine parameters for WGCN ggtitle("Scale independence") + # This adds some nicer aesthetics to our plot theme_classic() -

    +
    ## Warning: Removed 7 rows containing missing values (geom_text).
    +

    Using this plot we can decide on a power parameter. WGCNA’s authors recommend using a power that has an signed \(R^2\) above 0.80, otherwise they warn your results may be too noisy to be meaningful.

    If you have multiple power values with signed \(R^2\) above 0.80, then picking the one at an inflection point, in other words where the \(R^2\) values seem to have reached their saturation (Zhang and Horvath 2005). You want to a power that gives you a big enough \(R^2\) but is not excessively large.

    -

    So using the plot above, going with a power soft-threshold of 16!

    +

    So using the plot above, going with a power soft-threshold of 2!

    If you find you have all very low \(R^2\) values this may be because there are too many genes with low expression values that are cluttering up the calculations. You can try returning to gene filtering step and choosing a more stringent cutoff (you’ll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped).

    4.7 Run WGCNA!

    -

    We will use the blockwiseModules() function to find gene co-expression modules in WGCNA, using 16 for the power argument like we determined above.

    +

    We will use the blockwiseModules() function to find gene co-expression modules in WGCNA, using 2 for the power argument like we determined above.

    This next step takes some time to run. The blockwise part of the blockwiseModules() function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources.

    Here we are using the default maxBlockSize, 5000 but, you may want to adjust the maxBlockSize argument depending on your computer’s memory. The authors of WGCNA recommend running the largest block your computer can handle and they provide some approximations as to GB of memory of a laptop and what maxBlockSize it should be able to handle:

    • If the reader has access to a large workstation with more than 4 GB of memory, the parameter maxBlockSize can be increased. A 16GB workstation should handle up to 20000 probes; a 32GB workstation should handle perhaps 30000. A 4GB standard desktop or a laptop may handle up to 8000-10000 probes, depending on operating system and other running programs.

    (Langfelder and Horvath 2016)

    -
    bwnet <- blockwiseModules(normalized_counts,
    -  maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
    -  TOMType = "signed", # topological overlap matrix
    -  power = 16, # soft threshold for network construction
    -  numericLabels = TRUE, # Let's use numbers instead of colors for module labels
    -  randomSeed = 1234, # there's some randomness associated with this calculation
    -  # so we should set a seed
    -)
    +
    bwnet <- blockwiseModules(normalized_counts,
    +  maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
    +  TOMType = "signed", # topological overlap matrix
    +  power = 2, # soft threshold for network construction
    +  numericLabels = TRUE, # Let's use numbers instead of colors for module labels
    +  randomSeed = 1234, # there's some randomness associated with this calculation
    +  # so we should set a seed
    +)

    The TOMtype argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. You can safely assume for most situations a signed network represents what you want – we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an unsigned network, where positive/negative is ignored see this article to help you figure that out (Langfelder 2018).

    There are a lot of other settings you can tweak – look at ?blockwiseModules help page as well as the WGCNA tutorial (Langfelder and Horvath 2016).

    4.8 Write main WGCNA results object to file

    We will save our whole results object to an RDS file in case we want to return to our original WGCNA results.

    -
    readr::write_rds(bwnet,
    -  file = file.path("results", "SRP133573_wgcna_results.RDS")
    -)
    +
    readr::write_rds(bwnet,
    +  file = file.path("results", "SRP140558_wgcna_results.RDS")
    +)

    4.9 Explore our WGCNA results

    The bwnet object has many parts, storing a lot of information. We can pull out the parts we are most interested in and may want to use use for plotting.

    In bwnet we have a data frame of eigengene module data for each sample in the MEs slot. These represent the collapsed, combined, and normalized expression of the genes that make up each module.

    -
    module_eigengenes <- bwnet$MEs
    -
    -# Print out a preview
    -head(module_eigengenes)
    +
    module_eigengenes <- bwnet$MEs
    +
    +# Print out a preview
    +head(module_eigengenes)

    4.10 Which modules have biggest differences across treatment groups?

    We can also see if our eigengenes relate to our metadata labels. First we double check that our samples are still in order.

    -
    all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes))
    +
    all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes))
    ## [1] TRUE
    -
    # Create the design matrix from the refinebio_treatment variable
    -des_mat <- model.matrix(~ metadata$refinebio_treatment)
    +
    # Create the design matrix from the `time_point` variable
    +des_mat <- model.matrix(~ metadata$time_point)

    Run linear model on each module. Limma wants our tests to be per row, so we also need to transpose so the eigengenes are rows

    -
    # lmFit() needs a transposed version of the matrix
    -fit <- limma::lmFit(t(module_eigengenes), design = des_mat)
    -
    -# Apply empirical Bayes to smooth standard errors
    -fit <- limma::eBayes(fit)
    +
    # lmFit() needs a transposed version of the matrix
    +fit <- limma::lmFit(t(module_eigengenes), design = des_mat)
    +
    +# Apply empirical Bayes to smooth standard errors
    +fit <- limma::eBayes(fit)

    Apply multiple testing correction and obtain stats in a data frame.

    -
    # Apply multiple testing correction and obtain stats
    -stats_df <- limma::topTable(fit, number = ncol(module_eigengenes)) %>%
    -  tibble::rownames_to_column("module")
    +
    # Apply multiple testing correction and obtain stats
    +stats_df <- limma::topTable(fit, number = ncol(module_eigengenes)) %>%
    +  tibble::rownames_to_column("module")
    ## Removing intercept from test coefficients

    Let’s take a look at the results. They are sorted with the most significant results at the top.

    -
    head(stats_df)
    +
    head(stats_df)
    -

    Module 52 seems to be the most differentially expressed across refinebio_treatment groups. Now we can do some investigation into this module.

    +

    Module 18 seems to be the most differentially expressed across time_point groups. Now we can do some investigation into this module.

    -
    -

    4.11 Let’s make plot of module 52

    -

    As a sanity check, let’s use ggplot to see what module 52’s eigengene looks like between treatment groups.

    +
    +

    4.11 Let’s make plot of module 18

    +

    As a sanity check, let’s use ggplot to see what module 18’s eigengene looks like between treatment groups.

    First we need to set up the module eigengene for this module with the sample metadata labels we need.

    -
    module_52_df <- module_eigengenes %>%
    -  tibble::rownames_to_column("accession_code") %>%
    -  # Here we are performing an inner join with a subset of metadata
    -  dplyr::inner_join(metadata %>%
    -    dplyr::select(refinebio_accession_code, refinebio_treatment),
    -  by = c("accession_code" = "refinebio_accession_code")
    -  )
    +
    module_18_df <- module_eigengenes %>%
    +  tibble::rownames_to_column("accession_code") %>%
    +  # Here we are performing an inner join with a subset of metadata
    +  dplyr::inner_join(metadata %>%
    +    dplyr::select(refinebio_accession_code, time_point),
    +  by = c("accession_code" = "refinebio_accession_code")
    +  )

    Now we are ready for plotting.

    -
    ggplot(
    -  module_52_df,
    -  aes(
    -    x = refinebio_treatment,
    -    y = ME52,
    -    color = refinebio_treatment
    -  )
    -) +
    -  ggforce::geom_sina() +
    -  theme_classic()
    -

    -

    This makes sense! Looks like module 52 has elevated expression post treatment.

    +
    ggplot(
    +  module_18_df,
    +  aes(
    +    x = time_point,
    +    y = ME18,
    +    color = time_point
    +  )
    +) +
    +  ggforce::geom_sina() +
    +  theme_classic()
    +

    +

    This makes sense! Looks like module 18 has elevated expression during the acute illness but not when recovering.

    -
    -

    4.12 What genes are a part of module 52?

    +
    +

    4.12 What genes are a part of module 18?

    If you want to know which of your genes make up a modules, you can look at the $colors slot. This is a named list which associates the genes with the module they are a part of. We can turn this into a data frame for handy use.

    -
    gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module") %>%
    -  # Let's add the `ME` part so its more clear what these numbers are and it matches elsewhere
    -  dplyr::mutate(module = paste0("ME", module))
    -

    Now we can find what genes are a part of module 52.

    -
    gene_module_key %>%
    -  dplyr::filter(module == "ME52")
    +
    gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module") %>%
    +  # Let's add the `ME` part so its more clear what these numbers are and it matches elsewhere
    +  dplyr::mutate(module = paste0("ME", module))
    +

    Now we can find what genes are a part of module 18.

    +
    gene_module_key %>%
    +  dplyr::filter(module == "ME18")

    Let’s save this gene to module key to a TSV file for future use.

    -
    readr::write_tsv(gene_module_key,
    -  file = file.path("results", "SRP133573_wgcna_gene_to_module.tsv")
    -)
    +
    readr::write_tsv(gene_module_key,
    +  file = file.path("results", "SRP140558_wgcna_gene_to_module.tsv")
    +)

    4.13 Make a custom heatmap function

    We will make a heatmap that summarizes our differentially expressed module. Because we will make a couple of these, it makes sense to make a custom function for making this heatmap.

    -
    make_module_heatmap <- function(module_name,
    -                                expression_mat = normalized_counts,
    -                                metadata_df = metadata,
    -                                gene_module_key_df = gene_module_key,
    -                                module_eigengenes_df = module_eigengenes) {
    -  # Create a summary heatmap of a given module.
    -  #
    -  # Args:
    -  # module_name: a character indicating what module should be plotted, e.g. "ME52"
    -  # expression_mat: The full gene expression matrix. Default is `normalized_counts`.
    -  # metadata_df: a data frame with refinebio_accession_code and refinebio_treatment
    -  #              as columns. Default is `metadata`.
    -  # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`.
    -  # module_eigengenes: a sample x eigengene data.frame with samples as rownames. Default is `module_eigengenes`.
    -  #
    -  # Returns:
    -  # A heatmap of expression matrix for a module's genes, with a barplot of the
    -  # eigengene expression for that module.
    -
    -  # Set up the module eigengene with its refinebio_accession_code
    -  module_eigengene <- module_eigengenes_df %>%
    -    dplyr::select(module_name) %>%
    -    tibble::rownames_to_column("refinebio_accession_code")
    -
    -  # Set up column annotation from metadata
    -  col_annot_df <- metadata_df %>%
    -    # Only select the treatment and sample ID columns
    -    dplyr::select(refinebio_accession_code, refinebio_treatment) %>%
    -    # Add on the eigengene expression by joining with sample IDs
    -    dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>%
    -    # Arrange by treatment
    -    dplyr::arrange(refinebio_treatment, refinebio_accession_code) %>%
    -    # Store sample
    -    tibble::column_to_rownames("refinebio_accession_code")
    -
    -  # Create the ComplexHeatmap column annotation object
    -  col_annot <- ComplexHeatmap::HeatmapAnnotation(
    -    # Supply treatment labels
    -    refinebio_treatment = col_annot_df$refinebio_treatment,
    -    # Add annotation barplot
    -    module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)),
    -    # Pick colors for each experimental group in refinebio_treatment
    -    col = list(refinebio_treatment = c("post-adt" = "#f1a340", "pre-adt" = "#998ec3"))
    -  )
    -
    -  # Get a vector of the Ensembl gene IDs that correspond to this module
    -  module_genes <- gene_module_key_df %>%
    -    dplyr::filter(module == module_name) %>%
    -    dplyr::pull(gene)
    -
    -  # Set up the gene expression data frame
    -  mod_mat <- expression_mat %>%
    -    t() %>%
    -    as.data.frame() %>%
    -    # Only keep genes from this module
    -    dplyr::filter(rownames(.) %in% module_genes) %>%
    -    # Order the samples to match col_annot_df
    -    dplyr::select(rownames(col_annot_df)) %>%
    -    # Data needs to be a matrix
    -    as.matrix()
    -
    -  # Normalize the gene expression values
    -  mod_mat <- mod_mat %>%
    -    # Scale can work on matrices, but it does it by column so we will need to
    -    # transpose first
    -    t() %>%
    -    scale() %>%
    -    # And now we need to transpose back
    -    t()
    -
    -  # Create a color function based on standardized scale
    -  color_func <- circlize::colorRamp2(
    -    c(-2, 0, 2),
    -    c("#67a9cf", "#f7f7f7", "#ef8a62")
    -  )
    -
    -  # Plot on a heatmap
    -  heatmap <- ComplexHeatmap::Heatmap(mod_mat,
    -    name = module_name,
    -    # Supply color function
    -    col = color_func,
    -    # Supply column annotation
    -    bottom_annotation = col_annot,
    -    # We don't want to cluster samples
    -    cluster_columns = FALSE,
    -    # We don't need to show sample or gene labels
    -    show_row_names = FALSE,
    -    show_column_names = FALSE
    -  )
    -
    -  # Return heatmap
    -  return(heatmap)
    -}
    +
    make_module_heatmap <- function(module_name,
    +                                expression_mat = normalized_counts,
    +                                metadata_df = metadata,
    +                                gene_module_key_df = gene_module_key,
    +                                module_eigengenes_df = module_eigengenes) {
    +  # Create a summary heatmap of a given module.
    +  #
    +  # Args:
    +  # module_name: a character indicating what module should be plotted, e.g. "ME18"
    +  # expression_mat: The full gene expression matrix. Default is `normalized_counts`.
    +  # metadata_df: a data frame with refinebio_accession_code and time_point
    +  #              as columns. Default is `metadata`.
    +  # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`.
    +  # module_eigengenes: a sample x eigengene data.frame with samples as rownames. Default is `module_eigengenes`.
    +  #
    +  # Returns:
    +  # A heatmap of expression matrix for a module's genes, with a barplot of the
    +  # eigengene expression for that module.
    +
    +  # Set up the module eigengene with its refinebio_accession_code
    +  module_eigengene <- module_eigengenes_df %>%
    +    dplyr::select(module_name) %>%
    +    tibble::rownames_to_column("refinebio_accession_code")
    +
    +  # Set up column annotation from metadata
    +  col_annot_df <- metadata_df %>%
    +    # Only select the treatment and sample ID columns
    +    dplyr::select(refinebio_accession_code, time_point, refinebio_subject) %>%
    +    # Add on the eigengene expression by joining with sample IDs
    +    dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>%
    +    # Arrange by patient and time point
    +    dplyr::arrange(time_point, refinebio_subject) %>%
    +    # Store sample
    +    tibble::column_to_rownames("refinebio_accession_code")
    +
    +  # Create the ComplexHeatmap column annotation object
    +  col_annot <- ComplexHeatmap::HeatmapAnnotation(
    +    # Supply treatment labels
    +    time_point = col_annot_df$time_point,
    +    # Add annotation barplot
    +    module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)),
    +    # Pick colors for each experimental group in time_point
    +    col = list(time_point = c("recovering" = "#f1a340", "acute illness" = "#998ec3"))
    +  )
    +
    +  # Get a vector of the Ensembl gene IDs that correspond to this module
    +  module_genes <- gene_module_key_df %>%
    +    dplyr::filter(module == module_name) %>%
    +    dplyr::pull(gene)
    +
    +  # Set up the gene expression data frame
    +  mod_mat <- expression_mat %>%
    +    t() %>%
    +    as.data.frame() %>%
    +    # Only keep genes from this module
    +    dplyr::filter(rownames(.) %in% module_genes) %>%
    +    # Order the samples to match col_annot_df
    +    dplyr::select(rownames(col_annot_df)) %>%
    +    # Data needs to be a matrix
    +    as.matrix()
    +
    +  # Normalize the gene expression values
    +  mod_mat <- mod_mat %>%
    +    # Scale can work on matrices, but it does it by column so we will need to
    +    # transpose first
    +    t() %>%
    +    scale() %>%
    +    # And now we need to transpose back
    +    t()
    +
    +  # Create a color function based on standardized scale
    +  color_func <- circlize::colorRamp2(
    +    c(-2, 0, 2),
    +    c("#67a9cf", "#f7f7f7", "#ef8a62")
    +  )
    +
    +  # Plot on a heatmap
    +  heatmap <- ComplexHeatmap::Heatmap(mod_mat,
    +    name = module_name,
    +    # Supply color function
    +    col = color_func,
    +    # Supply column annotation
    +    bottom_annotation = col_annot,
    +    # We don't want to cluster samples
    +    cluster_columns = FALSE,
    +    # We don't need to show sample or gene labels
    +    show_row_names = FALSE,
    +    show_column_names = FALSE
    +  )
    +
    +  # Return heatmap
    +  return(heatmap)
    +}

    4.14 Make module heatmaps

    -

    Let’s try out the custom heatmap function with module 52 (our most differentially expressed module).

    -
    mod_52_heatmap <- make_module_heatmap(module_name = "ME52")
    +

    Let’s try out the custom heatmap function with module 18 (our most differentially expressed module).

    +
    mod_18_heatmap <- make_module_heatmap(module_name = "ME18")
    ## Note: Using an external vector in selections is ambiguous.
     ## ℹ Use `all_of(module_name)` instead of `module_name` to silence this message.
     ## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
     ## This message is displayed once per session.
    -
    # Print out the plot
    -mod_52_heatmap
    -

    -

    From the barplot portion of our plot, we can see post-adt samples have higher values for this eigengene for module 52. In the heatmap portion, we can see how the individual genes that make up module 52 have more extreme values (very high or very low) in the post-adt samples.

    +
    # Print out the plot
    +mod_18_heatmap
    +

    +

    From the barplot portion of our plot, we can see recovering samples have higher values for this eigengene for module 18. In the heatmap portion, we can see how the individual genes that make up module 18 are overall higher than in the recovering samples.

    We can save this plot to PDF.

    -
    pdf(file.path("results", "SRP133573_module_52_heatmap.pdf"))
    -mod_52_heatmap
    -dev.off()
    +
    pdf(file.path("results", "SRP140558_module_18_heatmap.pdf"))
    +mod_18_heatmap
    +dev.off()
    ## png 
     ##   2

    For comparison, let’s try out the custom heatmap function with a different, not differentially expressed module.

    -
    mod_10_heatmap <- make_module_heatmap(module_name = "ME10")
    -
    -# Print out the plot
    -mod_10_heatmap
    -

    -

    In this non-significant module’s heatmap, there’s not a particularly strong pattern between pre and post ADT samples. In general the expression of genes in module 10 does not vary much between groups, staying near the overall mean. There are a few samples and some genes that show higher expression, but it is not surprising this does not results in a significant overall difference between the groups.

    +
    mod_25_heatmap <- make_module_heatmap(module_name = "ME25")
    +
    +# Print out the plot
    +mod_25_heatmap
    +

    +

    In this non-significant module’s heatmap, there’s not a particularly strong pattern between acute illness and recovery samples.

    Save this plot also.

    -
    pdf(file.path("results", "SRP133573_module_10_heatmap.pdf"))
    -mod_10_heatmap
    -dev.off()
    +
    pdf(file.path("results", "SRP140558_module_25_heatmap.pdf"))
    +mod_25_heatmap
    +dev.off()
    ## png 
     ##   2
    @@ -4415,8 +4424,8 @@

    5 Resources for further learning<

    6 Session info

    At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

    -
    # Print session info
    -sessionInfo()
    +
    # Print session info
    +sessionInfo()
    ## R version 4.0.2 (2020-06-22)
     ## Platform: x86_64-pc-linux-gnu (64-bit)
     ## Running under: Ubuntu 20.04 LTS
    
    From bbb9d46d8f40688dd8e192d226de32f4ae2fde0b Mon Sep 17 00:00:00 2001
    From: Candace Savonen 
    Date: Tue, 24 Nov 2020 16:28:41 -0500
    Subject: [PATCH 05/12] Change to 7 and incorporate jashapiro review
    
    ---
     .../network-analysis_rnaseq_01_wgcna.Rmd      |  17 +-
     .../network-analysis_rnaseq_01_wgcna.html     | 556 ++++++++++--------
     2 files changed, 326 insertions(+), 247 deletions(-)
    
    diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd
    index 1a264859..5992d7a2 100644
    --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd
    +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd
    @@ -397,7 +397,7 @@ ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) +
       # We will plot what WGCNA recommends as an R^2 cutoff
       geom_hline(yintercept = 0.80, col = "red") +
       # Just in case our values are low, we want to make sure we can still see the 0.80 level
    -  ylim(c(min(sft_df$model_fit), 1)) +
    +  ylim(c(min(sft_df$model_fit), 1.05)) +
       # We can add more sensible labels for our axis
       xlab("Soft Threshold (power)") +
       ylab("Scale Free Topology Model Fit, signed R^2") +
    @@ -412,14 +412,14 @@ WGCNA's authors recommend using a `power` that has an signed $R^2$ above `0.80`,
     If you have multiple power values with signed $R^2$ above `0.80`, then picking the one at an inflection point, in other words where the $R^2$ values seem to have reached their saturation [@Zhang2005].
     You want to a `power` that gives you a big enough $R^2$ but is not excessively large.
     
    -So using the plot above, going with a power soft-threshold of `2`!
    +So using the plot above, going with a power soft-threshold of `7`!
     
     If you find you have all very low $R^2$ values this may be because there are too many genes with low expression values that are cluttering up the calculations. 
     You can try returning to [gene filtering step](#define-a-minimum-counts-cutoff) and choosing a more stringent cutoff (you'll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped). 
     
     ## Run WGCNA!
     
    -We will use the `blockwiseModules()` function to find gene co-expression modules in WGCNA, using `2` for the `power` argument like we determined above. 
    +We will use the `blockwiseModules()` function to find gene co-expression modules in WGCNA, using `7` for the `power` argument like we determined above. 
     
     This next step takes some time to run.
     The `blockwise` part of the `blockwiseModules()` function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources.
    @@ -438,7 +438,7 @@ operating system and other running programs.
     bwnet <- blockwiseModules(normalized_counts,
       maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
       TOMType = "signed", # topological overlap matrix
    -  power = 2, # soft threshold for network construction
    +  power = 7, # soft threshold for network construction
       numericLabels = TRUE, # Let's use numbers instead of colors for module labels
       randomSeed = 1234, # there's some randomness associated with this calculation
       # so we should set a seed
    @@ -546,7 +546,10 @@ ggplot(
         color = time_point
       )
     ) +
    -  ggforce::geom_sina() +
    +  ggforce::geom_sina(maxwidth = 0.3) +
    +  # Putting on a boxplot, but making it see through and so we don't put outliers
    +  # with boxplot
    +  geom_boxplot(width = 0.2, alpha = 0, outlier.shape = NA) +
       theme_classic()
     ```
     
    @@ -692,7 +695,7 @@ mod_18_heatmap <- make_module_heatmap(module_name = "ME18")
     mod_18_heatmap
     ```
     
    -From the barplot portion of our plot, we can see `recovering` samples have higher values for this eigengene for module 18.
    +From the barplot portion of our plot, we can see `acute illness` samples tend to have higher expression values for the module 18 eigengene.
     In the heatmap portion, we can see how the individual genes that make up module 18 are overall higher than in the `recovering` samples. 
     
     We can save this plot to PDF.
    @@ -736,7 +739,7 @@ This helps make your code more reproducible by recording what versions of softwa
     
     ```{r}
     # Print session info
    -sessionInfo()
    +sessioninfo::session_info()
     ```
     
     # References
    diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html
    index 4cbc159d..41230ad8 100644
    --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html
    +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html
    @@ -4029,7 +4029,7 @@ 

    4.2 Import and set up data

    # Read in metadata TSV file
     metadata <- readr::read_tsv(metadata_file)
    ## 
    -## ── Column specification ─────────────────────────────────────────────────────────────────────────
    +## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────
     ## cols(
     ##   .default = col_logical(),
     ##   refinebio_accession_code = col_character(),
    @@ -4046,7 +4046,7 @@ 

    4.2 Import and set up data

    # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later tibble::column_to_rownames("Gene")
    ## 
    -## ── Column specification ─────────────────────────────────────────────────────────────────────────
    +## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────
     ## cols(
     ##   .default = col_double(),
     ##   Gene = col_character()
    @@ -4156,84 +4156,83 @@ 

    4.6 Determine parameters for WGCN # We will plot what WGCNA recommends as an R^2 cutoff geom_hline(yintercept = 0.80, col = "red") + # Just in case our values are low, we want to make sure we can still see the 0.80 level - ylim(c(min(sft_df$model_fit), 1)) + + ylim(c(min(sft_df$model_fit), 1.05)) + # We can add more sensible labels for our axis xlab("Soft Threshold (power)") + ylab("Scale Free Topology Model Fit, signed R^2") + ggtitle("Scale independence") + # This adds some nicer aesthetics to our plot theme_classic()

    -
    ## Warning: Removed 7 rows containing missing values (geom_text).
    -

    +

    Using this plot we can decide on a power parameter. WGCNA’s authors recommend using a power that has an signed \(R^2\) above 0.80, otherwise they warn your results may be too noisy to be meaningful.

    If you have multiple power values with signed \(R^2\) above 0.80, then picking the one at an inflection point, in other words where the \(R^2\) values seem to have reached their saturation (Zhang and Horvath 2005). You want to a power that gives you a big enough \(R^2\) but is not excessively large.

    -

    So using the plot above, going with a power soft-threshold of 2!

    +

    So using the plot above, going with a power soft-threshold of 7!

    If you find you have all very low \(R^2\) values this may be because there are too many genes with low expression values that are cluttering up the calculations. You can try returning to gene filtering step and choosing a more stringent cutoff (you’ll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped).

    4.7 Run WGCNA!

    -

    We will use the blockwiseModules() function to find gene co-expression modules in WGCNA, using 2 for the power argument like we determined above.

    +

    We will use the blockwiseModules() function to find gene co-expression modules in WGCNA, using 7 for the power argument like we determined above.

    This next step takes some time to run. The blockwise part of the blockwiseModules() function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources.

    Here we are using the default maxBlockSize, 5000 but, you may want to adjust the maxBlockSize argument depending on your computer’s memory. The authors of WGCNA recommend running the largest block your computer can handle and they provide some approximations as to GB of memory of a laptop and what maxBlockSize it should be able to handle:

    • If the reader has access to a large workstation with more than 4 GB of memory, the parameter maxBlockSize can be increased. A 16GB workstation should handle up to 20000 probes; a 32GB workstation should handle perhaps 30000. A 4GB standard desktop or a laptop may handle up to 8000-10000 probes, depending on operating system and other running programs.

    (Langfelder and Horvath 2016)

    -
    bwnet <- blockwiseModules(normalized_counts,
    -  maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
    -  TOMType = "signed", # topological overlap matrix
    -  power = 2, # soft threshold for network construction
    -  numericLabels = TRUE, # Let's use numbers instead of colors for module labels
    -  randomSeed = 1234, # there's some randomness associated with this calculation
    -  # so we should set a seed
    -)
    +
    bwnet <- blockwiseModules(normalized_counts,
    +  maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
    +  TOMType = "signed", # topological overlap matrix
    +  power = 7, # soft threshold for network construction
    +  numericLabels = TRUE, # Let's use numbers instead of colors for module labels
    +  randomSeed = 1234, # there's some randomness associated with this calculation
    +  # so we should set a seed
    +)

    The TOMtype argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. You can safely assume for most situations a signed network represents what you want – we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an unsigned network, where positive/negative is ignored see this article to help you figure that out (Langfelder 2018).

    There are a lot of other settings you can tweak – look at ?blockwiseModules help page as well as the WGCNA tutorial (Langfelder and Horvath 2016).

    4.8 Write main WGCNA results object to file

    We will save our whole results object to an RDS file in case we want to return to our original WGCNA results.

    -
    readr::write_rds(bwnet,
    -  file = file.path("results", "SRP140558_wgcna_results.RDS")
    -)
    +
    readr::write_rds(bwnet,
    +  file = file.path("results", "SRP140558_wgcna_results.RDS")
    +)

    4.9 Explore our WGCNA results

    The bwnet object has many parts, storing a lot of information. We can pull out the parts we are most interested in and may want to use use for plotting.

    In bwnet we have a data frame of eigengene module data for each sample in the MEs slot. These represent the collapsed, combined, and normalized expression of the genes that make up each module.

    -
    module_eigengenes <- bwnet$MEs
    -
    -# Print out a preview
    -head(module_eigengenes)
    +
    module_eigengenes <- bwnet$MEs
    +
    +# Print out a preview
    +head(module_eigengenes)

    4.10 Which modules have biggest differences across treatment groups?

    We can also see if our eigengenes relate to our metadata labels. First we double check that our samples are still in order.

    -
    all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes))
    +
    all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes))
    ## [1] TRUE
    -
    # Create the design matrix from the `time_point` variable
    -des_mat <- model.matrix(~ metadata$time_point)
    +
    # Create the design matrix from the `time_point` variable
    +des_mat <- model.matrix(~ metadata$time_point)

    Run linear model on each module. Limma wants our tests to be per row, so we also need to transpose so the eigengenes are rows

    -
    # lmFit() needs a transposed version of the matrix
    -fit <- limma::lmFit(t(module_eigengenes), design = des_mat)
    -
    -# Apply empirical Bayes to smooth standard errors
    -fit <- limma::eBayes(fit)
    +
    # lmFit() needs a transposed version of the matrix
    +fit <- limma::lmFit(t(module_eigengenes), design = des_mat)
    +
    +# Apply empirical Bayes to smooth standard errors
    +fit <- limma::eBayes(fit)

    Apply multiple testing correction and obtain stats in a data frame.

    -
    # Apply multiple testing correction and obtain stats
    -stats_df <- limma::topTable(fit, number = ncol(module_eigengenes)) %>%
    -  tibble::rownames_to_column("module")
    +
    # Apply multiple testing correction and obtain stats
    +stats_df <- limma::topTable(fit, number = ncol(module_eigengenes)) %>%
    +  tibble::rownames_to_column("module")
    ## Removing intercept from test coefficients

    Let’s take a look at the results. They are sorted with the most significant results at the top.

    -
    head(stats_df)
    +
    head(stats_df)

    Module 18 seems to be the most differentially expressed across time_point groups. Now we can do some investigation into this module.

    @@ -4242,172 +4241,175 @@

    4.10 Which modules have biggest d

    4.11 Let’s make plot of module 18

    As a sanity check, let’s use ggplot to see what module 18’s eigengene looks like between treatment groups.

    First we need to set up the module eigengene for this module with the sample metadata labels we need.

    -
    module_18_df <- module_eigengenes %>%
    -  tibble::rownames_to_column("accession_code") %>%
    -  # Here we are performing an inner join with a subset of metadata
    -  dplyr::inner_join(metadata %>%
    -    dplyr::select(refinebio_accession_code, time_point),
    -  by = c("accession_code" = "refinebio_accession_code")
    -  )
    +
    module_18_df <- module_eigengenes %>%
    +  tibble::rownames_to_column("accession_code") %>%
    +  # Here we are performing an inner join with a subset of metadata
    +  dplyr::inner_join(metadata %>%
    +    dplyr::select(refinebio_accession_code, time_point),
    +  by = c("accession_code" = "refinebio_accession_code")
    +  )

    Now we are ready for plotting.

    -
    ggplot(
    -  module_18_df,
    -  aes(
    -    x = time_point,
    -    y = ME18,
    -    color = time_point
    -  )
    -) +
    -  ggforce::geom_sina() +
    -  theme_classic()
    -

    +
    ggplot(
    +  module_18_df,
    +  aes(
    +    x = time_point,
    +    y = ME18,
    +    color = time_point
    +  )
    +) +
    +  ggforce::geom_sina(maxwidth = 0.3) +
    +  # Putting on a boxplot, but making it see through and so we don't put outliers
    +  # with boxplot
    +  geom_boxplot(width = 0.2, alpha = 0, outlier.shape = NA) +
    +  theme_classic()
    +

    This makes sense! Looks like module 18 has elevated expression during the acute illness but not when recovering.

    4.12 What genes are a part of module 18?

    If you want to know which of your genes make up a modules, you can look at the $colors slot. This is a named list which associates the genes with the module they are a part of. We can turn this into a data frame for handy use.

    -
    gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module") %>%
    -  # Let's add the `ME` part so its more clear what these numbers are and it matches elsewhere
    -  dplyr::mutate(module = paste0("ME", module))
    +
    gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module") %>%
    +  # Let's add the `ME` part so its more clear what these numbers are and it matches elsewhere
    +  dplyr::mutate(module = paste0("ME", module))

    Now we can find what genes are a part of module 18.

    -
    gene_module_key %>%
    -  dplyr::filter(module == "ME18")
    +
    gene_module_key %>%
    +  dplyr::filter(module == "ME18")

    Let’s save this gene to module key to a TSV file for future use.

    -
    readr::write_tsv(gene_module_key,
    -  file = file.path("results", "SRP140558_wgcna_gene_to_module.tsv")
    -)
    +
    readr::write_tsv(gene_module_key,
    +  file = file.path("results", "SRP140558_wgcna_gene_to_module.tsv")
    +)

    4.13 Make a custom heatmap function

    We will make a heatmap that summarizes our differentially expressed module. Because we will make a couple of these, it makes sense to make a custom function for making this heatmap.

    -
    make_module_heatmap <- function(module_name,
    -                                expression_mat = normalized_counts,
    -                                metadata_df = metadata,
    -                                gene_module_key_df = gene_module_key,
    -                                module_eigengenes_df = module_eigengenes) {
    -  # Create a summary heatmap of a given module.
    -  #
    -  # Args:
    -  # module_name: a character indicating what module should be plotted, e.g. "ME18"
    -  # expression_mat: The full gene expression matrix. Default is `normalized_counts`.
    -  # metadata_df: a data frame with refinebio_accession_code and time_point
    -  #              as columns. Default is `metadata`.
    -  # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`.
    -  # module_eigengenes: a sample x eigengene data.frame with samples as rownames. Default is `module_eigengenes`.
    -  #
    -  # Returns:
    -  # A heatmap of expression matrix for a module's genes, with a barplot of the
    -  # eigengene expression for that module.
    -
    -  # Set up the module eigengene with its refinebio_accession_code
    -  module_eigengene <- module_eigengenes_df %>%
    -    dplyr::select(module_name) %>%
    -    tibble::rownames_to_column("refinebio_accession_code")
    -
    -  # Set up column annotation from metadata
    -  col_annot_df <- metadata_df %>%
    -    # Only select the treatment and sample ID columns
    -    dplyr::select(refinebio_accession_code, time_point, refinebio_subject) %>%
    -    # Add on the eigengene expression by joining with sample IDs
    -    dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>%
    -    # Arrange by patient and time point
    -    dplyr::arrange(time_point, refinebio_subject) %>%
    -    # Store sample
    -    tibble::column_to_rownames("refinebio_accession_code")
    -
    -  # Create the ComplexHeatmap column annotation object
    -  col_annot <- ComplexHeatmap::HeatmapAnnotation(
    -    # Supply treatment labels
    -    time_point = col_annot_df$time_point,
    -    # Add annotation barplot
    -    module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)),
    -    # Pick colors for each experimental group in time_point
    -    col = list(time_point = c("recovering" = "#f1a340", "acute illness" = "#998ec3"))
    -  )
    -
    -  # Get a vector of the Ensembl gene IDs that correspond to this module
    -  module_genes <- gene_module_key_df %>%
    -    dplyr::filter(module == module_name) %>%
    -    dplyr::pull(gene)
    -
    -  # Set up the gene expression data frame
    -  mod_mat <- expression_mat %>%
    -    t() %>%
    -    as.data.frame() %>%
    -    # Only keep genes from this module
    -    dplyr::filter(rownames(.) %in% module_genes) %>%
    -    # Order the samples to match col_annot_df
    -    dplyr::select(rownames(col_annot_df)) %>%
    -    # Data needs to be a matrix
    -    as.matrix()
    -
    -  # Normalize the gene expression values
    -  mod_mat <- mod_mat %>%
    -    # Scale can work on matrices, but it does it by column so we will need to
    -    # transpose first
    -    t() %>%
    -    scale() %>%
    -    # And now we need to transpose back
    -    t()
    -
    -  # Create a color function based on standardized scale
    -  color_func <- circlize::colorRamp2(
    -    c(-2, 0, 2),
    -    c("#67a9cf", "#f7f7f7", "#ef8a62")
    -  )
    -
    -  # Plot on a heatmap
    -  heatmap <- ComplexHeatmap::Heatmap(mod_mat,
    -    name = module_name,
    -    # Supply color function
    -    col = color_func,
    -    # Supply column annotation
    -    bottom_annotation = col_annot,
    -    # We don't want to cluster samples
    -    cluster_columns = FALSE,
    -    # We don't need to show sample or gene labels
    -    show_row_names = FALSE,
    -    show_column_names = FALSE
    -  )
    -
    -  # Return heatmap
    -  return(heatmap)
    -}
    +
    make_module_heatmap <- function(module_name,
    +                                expression_mat = normalized_counts,
    +                                metadata_df = metadata,
    +                                gene_module_key_df = gene_module_key,
    +                                module_eigengenes_df = module_eigengenes) {
    +  # Create a summary heatmap of a given module.
    +  #
    +  # Args:
    +  # module_name: a character indicating what module should be plotted, e.g. "ME18"
    +  # expression_mat: The full gene expression matrix. Default is `normalized_counts`.
    +  # metadata_df: a data frame with refinebio_accession_code and time_point
    +  #              as columns. Default is `metadata`.
    +  # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`.
    +  # module_eigengenes: a sample x eigengene data.frame with samples as rownames. Default is `module_eigengenes`.
    +  #
    +  # Returns:
    +  # A heatmap of expression matrix for a module's genes, with a barplot of the
    +  # eigengene expression for that module.
    +
    +  # Set up the module eigengene with its refinebio_accession_code
    +  module_eigengene <- module_eigengenes_df %>%
    +    dplyr::select(module_name) %>%
    +    tibble::rownames_to_column("refinebio_accession_code")
    +
    +  # Set up column annotation from metadata
    +  col_annot_df <- metadata_df %>%
    +    # Only select the treatment and sample ID columns
    +    dplyr::select(refinebio_accession_code, time_point, refinebio_subject) %>%
    +    # Add on the eigengene expression by joining with sample IDs
    +    dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>%
    +    # Arrange by patient and time point
    +    dplyr::arrange(time_point, refinebio_subject) %>%
    +    # Store sample
    +    tibble::column_to_rownames("refinebio_accession_code")
    +
    +  # Create the ComplexHeatmap column annotation object
    +  col_annot <- ComplexHeatmap::HeatmapAnnotation(
    +    # Supply treatment labels
    +    time_point = col_annot_df$time_point,
    +    # Add annotation barplot
    +    module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)),
    +    # Pick colors for each experimental group in time_point
    +    col = list(time_point = c("recovering" = "#f1a340", "acute illness" = "#998ec3"))
    +  )
    +
    +  # Get a vector of the Ensembl gene IDs that correspond to this module
    +  module_genes <- gene_module_key_df %>%
    +    dplyr::filter(module == module_name) %>%
    +    dplyr::pull(gene)
    +
    +  # Set up the gene expression data frame
    +  mod_mat <- expression_mat %>%
    +    t() %>%
    +    as.data.frame() %>%
    +    # Only keep genes from this module
    +    dplyr::filter(rownames(.) %in% module_genes) %>%
    +    # Order the samples to match col_annot_df
    +    dplyr::select(rownames(col_annot_df)) %>%
    +    # Data needs to be a matrix
    +    as.matrix()
    +
    +  # Normalize the gene expression values
    +  mod_mat <- mod_mat %>%
    +    # Scale can work on matrices, but it does it by column so we will need to
    +    # transpose first
    +    t() %>%
    +    scale() %>%
    +    # And now we need to transpose back
    +    t()
    +
    +  # Create a color function based on standardized scale
    +  color_func <- circlize::colorRamp2(
    +    c(-2, 0, 2),
    +    c("#67a9cf", "#f7f7f7", "#ef8a62")
    +  )
    +
    +  # Plot on a heatmap
    +  heatmap <- ComplexHeatmap::Heatmap(mod_mat,
    +    name = module_name,
    +    # Supply color function
    +    col = color_func,
    +    # Supply column annotation
    +    bottom_annotation = col_annot,
    +    # We don't want to cluster samples
    +    cluster_columns = FALSE,
    +    # We don't need to show sample or gene labels
    +    show_row_names = FALSE,
    +    show_column_names = FALSE
    +  )
    +
    +  # Return heatmap
    +  return(heatmap)
    +}

    4.14 Make module heatmaps

    Let’s try out the custom heatmap function with module 18 (our most differentially expressed module).

    -
    mod_18_heatmap <- make_module_heatmap(module_name = "ME18")
    +
    mod_18_heatmap <- make_module_heatmap(module_name = "ME18")
    ## Note: Using an external vector in selections is ambiguous.
     ## ℹ Use `all_of(module_name)` instead of `module_name` to silence this message.
     ## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
     ## This message is displayed once per session.
    -
    # Print out the plot
    -mod_18_heatmap
    -

    -

    From the barplot portion of our plot, we can see recovering samples have higher values for this eigengene for module 18. In the heatmap portion, we can see how the individual genes that make up module 18 are overall higher than in the recovering samples.

    +
    # Print out the plot
    +mod_18_heatmap
    +

    +

    From the barplot portion of our plot, we can see acute illness samples tend to have higher expression values for the module 18 eigengene. In the heatmap portion, we can see how the individual genes that make up module 18 are overall higher than in the recovering samples.

    We can save this plot to PDF.

    -
    pdf(file.path("results", "SRP140558_module_18_heatmap.pdf"))
    -mod_18_heatmap
    -dev.off()
    +
    pdf(file.path("results", "SRP140558_module_18_heatmap.pdf"))
    +mod_18_heatmap
    +dev.off()
    ## png 
     ##   2

    For comparison, let’s try out the custom heatmap function with a different, not differentially expressed module.

    -
    mod_25_heatmap <- make_module_heatmap(module_name = "ME25")
    -
    -# Print out the plot
    -mod_25_heatmap
    -

    +
    mod_25_heatmap <- make_module_heatmap(module_name = "ME25")
    +
    +# Print out the plot
    +mod_25_heatmap
    +

    In this non-significant module’s heatmap, there’s not a particularly strong pattern between acute illness and recovery samples.

    Save this plot also.

    -
    pdf(file.path("results", "SRP140558_module_25_heatmap.pdf"))
    -mod_25_heatmap
    -dev.off()
    +
    pdf(file.path("results", "SRP140558_module_25_heatmap.pdf"))
    +mod_25_heatmap
    +dev.off()
    ## png 
     ##   2
    @@ -4424,76 +4426,150 @@

    5 Resources for further learning<

    6 Session info

    At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

    -
    # Print session info
    -sessionInfo()
    -
    ## R version 4.0.2 (2020-06-22)
    -## Platform: x86_64-pc-linux-gnu (64-bit)
    -## Running under: Ubuntu 20.04 LTS
    +
    # Print session info
    +sessioninfo::session_info()
    +
    ## ─ Session info ───────────────────────────────────────────────────────────────
    +##  setting  value                       
    +##  version  R version 4.0.2 (2020-06-22)
    +##  os       Ubuntu 20.04 LTS            
    +##  system   x86_64, linux-gnu           
    +##  ui       X11                         
    +##  language (EN)                        
    +##  collate  en_US.UTF-8                 
    +##  ctype    en_US.UTF-8                 
    +##  tz       Etc/UTC                     
    +##  date     2020-11-24                  
     ## 
    -## Matrix products: default
    -## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
    +## ─ Packages ───────────────────────────────────────────────────────────────────
    +##  package              * version  date       lib source        
    +##  annotate               1.68.0   2020-10-27 [1] Bioconductor  
    +##  AnnotationDbi          1.52.0   2020-10-27 [1] Bioconductor  
    +##  assertthat             0.2.1    2019-03-21 [1] RSPM (R 4.0.0)
    +##  backports              1.1.10   2020-09-15 [1] RSPM (R 4.0.2)
    +##  base64enc              0.1-3    2015-07-28 [1] RSPM (R 4.0.0)
    +##  Biobase              * 2.50.0   2020-10-27 [1] Bioconductor  
    +##  BiocGenerics         * 0.36.0   2020-10-27 [1] Bioconductor  
    +##  BiocParallel           1.24.1   2020-11-06 [1] Bioconductor  
    +##  bit                    4.0.4    2020-08-04 [1] RSPM (R 4.0.2)
    +##  bit64                  4.0.5    2020-08-30 [1] RSPM (R 4.0.2)
    +##  bitops                 1.0-6    2013-08-17 [1] RSPM (R 4.0.0)
    +##  blob                   1.2.1    2020-01-20 [1] RSPM (R 4.0.0)
    +##  Cairo                  1.5-12.2 2020-07-07 [1] RSPM (R 4.0.2)
    +##  checkmate              2.0.0    2020-02-06 [1] RSPM (R 4.0.0)
    +##  circlize               0.4.10   2020-06-15 [1] RSPM (R 4.0.0)
    +##  cli                    2.1.0    2020-10-12 [1] RSPM (R 4.0.2)
    +##  clue                   0.3-57   2019-02-25 [1] RSPM (R 4.0.0)
    +##  cluster                2.1.0    2019-06-19 [1] RSPM (R 4.0.0)
    +##  codetools              0.2-16   2018-12-24 [2] CRAN (R 4.0.2)
    +##  colorspace             1.4-1    2019-03-18 [1] RSPM (R 4.0.0)
    +##  ComplexHeatmap         2.6.0    2020-10-27 [1] Bioconductor  
    +##  crayon                 1.3.4    2017-09-16 [1] RSPM (R 4.0.0)
    +##  data.table             1.13.0   2020-07-24 [1] RSPM (R 4.0.2)
    +##  DBI                    1.1.0    2019-12-15 [1] RSPM (R 4.0.0)
    +##  DelayedArray           0.16.0   2020-10-27 [1] Bioconductor  
    +##  DESeq2               * 1.30.0   2020-10-27 [1] Bioconductor  
    +##  digest                 0.6.25   2020-02-23 [1] RSPM (R 4.0.0)
    +##  doParallel             1.0.15   2019-08-02 [1] RSPM (R 4.0.0)
    +##  dplyr                  1.0.2    2020-08-18 [1] RSPM (R 4.0.2)
    +##  dynamicTreeCut       * 1.63-1   2016-03-11 [1] RSPM (R 4.0.0)
    +##  ellipsis               0.3.1    2020-05-15 [1] RSPM (R 4.0.0)
    +##  evaluate               0.14     2019-05-28 [1] RSPM (R 4.0.0)
    +##  fansi                  0.4.1    2020-01-08 [1] RSPM (R 4.0.0)
    +##  farver                 2.0.3    2020-01-16 [1] RSPM (R 4.0.0)
    +##  fastcluster          * 1.1.25   2018-06-07 [1] RSPM (R 4.0.0)
    +##  foreach                1.5.0    2020-03-30 [1] RSPM (R 4.0.0)
    +##  foreign                0.8-80   2020-05-24 [2] CRAN (R 4.0.2)
    +##  Formula                1.2-3    2018-05-03 [1] RSPM (R 4.0.0)
    +##  genefilter             1.72.0   2020-10-27 [1] Bioconductor  
    +##  geneplotter            1.68.0   2020-10-27 [1] Bioconductor  
    +##  generics               0.0.2    2018-11-29 [1] RSPM (R 4.0.0)
    +##  GenomeInfoDb         * 1.26.0   2020-10-27 [1] Bioconductor  
    +##  GenomeInfoDbData       1.2.4    2020-11-09 [1] Bioconductor  
    +##  GenomicRanges        * 1.42.0   2020-10-27 [1] Bioconductor  
    +##  getopt                 1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
    +##  GetoptLong             1.0.3    2020-10-01 [1] RSPM (R 4.0.2)
    +##  ggforce                0.3.2    2020-06-23 [1] RSPM (R 4.0.2)
    +##  ggplot2              * 3.3.2    2020-06-19 [1] RSPM (R 4.0.1)
    +##  GlobalOptions          0.1.2    2020-06-10 [1] RSPM (R 4.0.0)
    +##  glue                   1.4.2    2020-08-27 [1] RSPM (R 4.0.2)
    +##  GO.db                  3.12.1   2020-11-09 [1] Bioconductor  
    +##  gridExtra              2.3      2017-09-09 [1] RSPM (R 4.0.0)
    +##  gtable                 0.3.0    2019-03-25 [1] RSPM (R 4.0.0)
    +##  Hmisc                  4.4-1    2020-08-10 [1] RSPM (R 4.0.2)
    +##  hms                    0.5.3    2020-01-08 [1] RSPM (R 4.0.0)
    +##  htmlTable              2.1.0    2020-09-16 [1] RSPM (R 4.0.2)
    +##  htmltools              0.5.0    2020-06-16 [1] RSPM (R 4.0.1)
    +##  htmlwidgets            1.5.2    2020-10-03 [1] RSPM (R 4.0.2)
    +##  httr                   1.4.2    2020-07-20 [1] RSPM (R 4.0.2)
    +##  impute                 1.64.0   2020-10-27 [1] Bioconductor  
    +##  IRanges              * 2.24.0   2020-10-27 [1] Bioconductor  
    +##  iterators              1.0.12   2019-07-26 [1] RSPM (R 4.0.0)
    +##  jpeg                   0.1-8.1  2019-10-24 [1] RSPM (R 4.0.0)
    +##  jsonlite               1.7.1    2020-09-07 [1] RSPM (R 4.0.2)
    +##  knitr                  1.30     2020-09-22 [1] RSPM (R 4.0.2)
    +##  labeling               0.3      2014-08-23 [1] RSPM (R 4.0.0)
    +##  lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
    +##  latticeExtra           0.6-29   2019-12-19 [1] RSPM (R 4.0.0)
    +##  lifecycle              0.2.0    2020-03-06 [1] RSPM (R 4.0.0)
    +##  limma                  3.46.0   2020-10-27 [1] Bioconductor  
    +##  locfit                 1.5-9.4  2020-03-25 [1] RSPM (R 4.0.0)
    +##  magick                 2.4.0    2020-06-23 [1] RSPM (R 4.0.2)
    +##  magrittr             * 1.5      2014-11-22 [1] RSPM (R 4.0.0)
    +##  MASS                   7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
    +##  Matrix                 1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
    +##  MatrixGenerics       * 1.2.0    2020-10-27 [1] Bioconductor  
    +##  matrixStats          * 0.57.0   2020-09-25 [1] RSPM (R 4.0.2)
    +##  memoise                1.1.0    2017-04-21 [1] RSPM (R 4.0.0)
    +##  munsell                0.5.0    2018-06-12 [1] RSPM (R 4.0.0)
    +##  nnet                   7.3-14   2020-04-26 [2] CRAN (R 4.0.2)
    +##  optparse             * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
    +##  pillar                 1.4.6    2020-07-10 [1] RSPM (R 4.0.2)
    +##  pkgconfig              2.0.3    2019-09-22 [1] RSPM (R 4.0.0)
    +##  png                    0.1-7    2013-12-03 [1] RSPM (R 4.0.0)
    +##  polyclip               1.10-0   2019-03-14 [1] RSPM (R 4.0.0)
    +##  preprocessCore         1.52.0   2020-10-27 [1] Bioconductor  
    +##  ps                     1.4.0    2020-10-07 [1] RSPM (R 4.0.2)
    +##  purrr                  0.3.4    2020-04-17 [1] RSPM (R 4.0.0)
    +##  R.cache                0.14.0   2019-12-06 [1] RSPM (R 4.0.0)
    +##  R.methodsS3            1.8.1    2020-08-26 [1] RSPM (R 4.0.2)
    +##  R.oo                   1.24.0   2020-08-26 [1] RSPM (R 4.0.2)
    +##  R.utils                2.10.1   2020-08-26 [1] RSPM (R 4.0.2)
    +##  R6                     2.4.1    2019-11-12 [1] RSPM (R 4.0.0)
    +##  RColorBrewer           1.1-2    2014-12-07 [1] RSPM (R 4.0.0)
    +##  Rcpp                   1.0.5    2020-07-06 [1] RSPM (R 4.0.2)
    +##  RCurl                  1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0)
    +##  readr                  1.4.0    2020-10-05 [1] RSPM (R 4.0.2)
    +##  rematch2               2.1.2    2020-05-01 [1] RSPM (R 4.0.0)
    +##  rjson                  0.2.20   2018-06-08 [1] RSPM (R 4.0.0)
    +##  rlang                  0.4.8    2020-10-08 [1] RSPM (R 4.0.2)
    +##  rmarkdown              2.4      2020-09-30 [1] RSPM (R 4.0.2)
    +##  rpart                  4.1-15   2019-04-12 [2] CRAN (R 4.0.2)
    +##  RSQLite                2.2.1    2020-09-30 [1] RSPM (R 4.0.2)
    +##  rstudioapi             0.11     2020-02-07 [1] RSPM (R 4.0.0)
    +##  S4Vectors            * 0.28.0   2020-10-27 [1] Bioconductor  
    +##  scales                 1.1.1    2020-05-11 [1] RSPM (R 4.0.0)
    +##  sessioninfo            1.1.1    2018-11-05 [1] RSPM (R 4.0.0)
    +##  shape                  1.4.5    2020-09-13 [1] RSPM (R 4.0.2)
    +##  stringi                1.5.3    2020-09-09 [1] RSPM (R 4.0.2)
    +##  stringr                1.4.0    2019-02-10 [1] RSPM (R 4.0.0)
    +##  styler                 1.3.2    2020-02-23 [1] RSPM (R 4.0.0)
    +##  SummarizedExperiment * 1.20.0   2020-10-27 [1] Bioconductor  
    +##  survival               3.1-12   2020-04-10 [2] CRAN (R 4.0.2)
    +##  tibble                 3.0.4    2020-10-12 [1] RSPM (R 4.0.2)
    +##  tidyselect             1.1.0    2020-05-11 [1] RSPM (R 4.0.0)
    +##  tweenr                 1.0.1    2018-12-14 [1] RSPM (R 4.0.2)
    +##  vctrs                  0.3.4    2020-08-29 [1] RSPM (R 4.0.2)
    +##  WGCNA                * 1.69     2020-02-28 [1] RSPM (R 4.0.2)
    +##  withr                  2.3.0    2020-09-22 [1] RSPM (R 4.0.2)
    +##  xfun                   0.18     2020-09-29 [1] RSPM (R 4.0.2)
    +##  XML                    3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2)
    +##  xtable                 1.8-4    2019-04-21 [1] RSPM (R 4.0.0)
    +##  XVector                0.30.0   2020-10-27 [1] Bioconductor  
    +##  yaml                   2.2.1    2020-02-01 [1] RSPM (R 4.0.0)
    +##  zlibbioc               1.36.0   2020-10-27 [1] Bioconductor  
     ## 
    -## locale:
    -##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
    -##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
    -##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
    -##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
    -##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    -## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
    -## 
    -## attached base packages:
    -## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
    -## [8] methods   base     
    -## 
    -## other attached packages:
    -##  [1] ggplot2_3.3.2               WGCNA_1.69                 
    -##  [3] fastcluster_1.1.25          dynamicTreeCut_1.63-1      
    -##  [5] magrittr_1.5                DESeq2_1.30.0              
    -##  [7] SummarizedExperiment_1.20.0 Biobase_2.50.0             
    -##  [9] MatrixGenerics_1.2.0        matrixStats_0.57.0         
    -## [11] GenomicRanges_1.42.0        GenomeInfoDb_1.26.0        
    -## [13] IRanges_2.24.0              S4Vectors_0.28.0           
    -## [15] BiocGenerics_0.36.0         optparse_1.6.6             
    -## 
    -## loaded via a namespace (and not attached):
    -##   [1] colorspace_1.4-1       rjson_0.2.20           ellipsis_0.3.1        
    -##   [4] circlize_0.4.10        htmlTable_2.1.0        XVector_0.30.0        
    -##   [7] GlobalOptions_0.1.2    base64enc_0.1-3        clue_0.3-57           
    -##  [10] rstudioapi_0.11        farver_2.0.3           getopt_1.20.3         
    -##  [13] bit64_4.0.5            AnnotationDbi_1.52.0   fansi_0.4.1           
    -##  [16] codetools_0.2-16       splines_4.0.2          R.methodsS3_1.8.1     
    -##  [19] doParallel_1.0.15      impute_1.64.0          geneplotter_1.68.0    
    -##  [22] knitr_1.30             polyclip_1.10-0        jsonlite_1.7.1        
    -##  [25] Formula_1.2-3          Cairo_1.5-12.2         annotate_1.68.0       
    -##  [28] cluster_2.1.0          GO.db_3.12.1           png_0.1-7             
    -##  [31] R.oo_1.24.0            ggforce_0.3.2          readr_1.4.0           
    -##  [34] compiler_4.0.2         httr_1.4.2             backports_1.1.10      
    -##  [37] assertthat_0.2.1       Matrix_1.2-18          limma_3.46.0          
    -##  [40] cli_2.1.0              tweenr_1.0.1           htmltools_0.5.0       
    -##  [43] tools_4.0.2            gtable_0.3.0           glue_1.4.2            
    -##  [46] GenomeInfoDbData_1.2.4 dplyr_1.0.2            Rcpp_1.0.5            
    -##  [49] styler_1.3.2           vctrs_0.3.4            preprocessCore_1.52.0 
    -##  [52] iterators_1.0.12       xfun_0.18              stringr_1.4.0         
    -##  [55] ps_1.4.0               lifecycle_0.2.0        XML_3.99-0.5          
    -##  [58] MASS_7.3-51.6          zlibbioc_1.36.0        scales_1.1.1          
    -##  [61] hms_0.5.3              rematch2_2.1.2         RColorBrewer_1.1-2    
    -##  [64] ComplexHeatmap_2.6.0   yaml_2.2.1             memoise_1.1.0         
    -##  [67] gridExtra_2.3          rpart_4.1-15           latticeExtra_0.6-29   
    -##  [70] stringi_1.5.3          RSQLite_2.2.1          genefilter_1.72.0     
    -##  [73] foreach_1.5.0          checkmate_2.0.0        BiocParallel_1.24.1   
    -##  [76] shape_1.4.5            rlang_0.4.8            pkgconfig_2.0.3       
    -##  [79] bitops_1.0-6           evaluate_0.14          lattice_0.20-41       
    -##  [82] purrr_0.3.4            htmlwidgets_1.5.2      labeling_0.3          
    -##  [85] bit_4.0.4              tidyselect_1.1.0       R6_2.4.1              
    -##  [88] magick_2.4.0           generics_0.0.2         Hmisc_4.4-1           
    -##  [91] DelayedArray_0.16.0    DBI_1.1.0              pillar_1.4.6          
    -##  [94] foreign_0.8-80         withr_2.3.0            survival_3.1-12       
    -##  [97] RCurl_1.98-1.2         nnet_7.3-14            tibble_3.0.4          
    -## [100] crayon_1.3.4           rmarkdown_2.4          GetoptLong_1.0.3      
    -## [103] jpeg_0.1-8.1           locfit_1.5-9.4         grid_4.0.2            
    -## [106] data.table_1.13.0      blob_1.2.1             digest_0.6.25         
    -## [109] xtable_1.8-4           R.cache_0.14.0         R.utils_2.10.1        
    -## [112] munsell_0.5.0
    +## [1] /usr/local/lib/R/site-library +## [2] /usr/local/lib/R/library

    References

    From 4d142df1a0a718f8dcb5b7f7fb635dc8defbdf59 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Tue, 24 Nov 2020 16:40:51 -0500 Subject: [PATCH 06/12] Also switch the most sig module! --- .../network-analysis_rnaseq_01_wgcna.Rmd | 35 ++++++++++--------- 1 file changed, 18 insertions(+), 17 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index 5992d7a2..76bc5f6d 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -516,17 +516,17 @@ They are sorted with the most significant results at the top. head(stats_df) ``` -Module 18 seems to be the most differentially expressed across `time_point` groups. +Module 19 seems to be the most differentially expressed across `time_point` groups. Now we can do some investigation into this module. -## Let's make plot of module 18 +## Let's make plot of module 19 As a sanity check, let's use `ggplot` to see what module 18's eigengene looks like between treatment groups. First we need to set up the module eigengene for this module with the sample metadata labels we need. ```{r} -module_18_df <- module_eigengenes %>% +module_19_df <- module_eigengenes %>% tibble::rownames_to_column("accession_code") %>% # Here we are performing an inner join with a subset of metadata dplyr::inner_join(metadata %>% @@ -539,10 +539,10 @@ Now we are ready for plotting. ```{r} ggplot( - module_18_df, + module_19_df, aes( x = time_point, - y = ME18, + y = ME19, color = time_point ) ) + @@ -554,9 +554,9 @@ ggplot( ``` This makes sense! -Looks like module 18 has elevated expression during the acute illness but not when recovering. +Looks like module 19 has elevated expression during the acute illness but not when recovering. -## What genes are a part of module 18? +## What genes are a part of module 19? If you want to know which of your genes make up a modules, you can look at the `$colors` slot. This is a named list which associates the genes with the module they are a part of. @@ -568,11 +568,11 @@ gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module" dplyr::mutate(module = paste0("ME", module)) ``` -Now we can find what genes are a part of module 18. +Now we can find what genes are a part of module 19. ```{r} gene_module_key %>% - dplyr::filter(module == "ME18") + dplyr::filter(module == "ME19") ``` Let's save this gene to module key to a TSV file for future use. @@ -597,7 +597,7 @@ make_module_heatmap <- function(module_name, # Create a summary heatmap of a given module. # # Args: - # module_name: a character indicating what module should be plotted, e.g. "ME18" + # module_name: a character indicating what module should be plotted, e.g. "ME19" # expression_mat: The full gene expression matrix. Default is `normalized_counts`. # metadata_df: a data frame with refinebio_accession_code and time_point # as columns. Default is `metadata`. @@ -686,23 +686,23 @@ make_module_heatmap <- function(module_name, ## Make module heatmaps -Let's try out the custom heatmap function with module 18 (our most differentially expressed module). +Let's try out the custom heatmap function with module 19 (our most differentially expressed module). ```{r} -mod_18_heatmap <- make_module_heatmap(module_name = "ME18") +mod_19_heatmap <- make_module_heatmap(module_name = "ME19") # Print out the plot -mod_18_heatmap +mod_19_heatmap ``` -From the barplot portion of our plot, we can see `acute illness` samples tend to have higher expression values for the module 18 eigengene. -In the heatmap portion, we can see how the individual genes that make up module 18 are overall higher than in the `recovering` samples. +From the barplot portion of our plot, we can see `acute illness` samples tend to have higher expression values for the module 19 eigengene. +In the heatmap portion, we can see how the individual genes that make up module 19 are overall higher than in the `recovering` samples. We can save this plot to PDF. ```{r} -pdf(file.path("results", "SRP140558_module_18_heatmap.pdf")) -mod_18_heatmap +pdf(file.path("results", "SRP140558_module_19_heatmap.pdf")) +mod_19_heatmap dev.off() ``` @@ -716,6 +716,7 @@ mod_25_heatmap ``` In this non-significant module's heatmap, there's not a particularly strong pattern between acute illness and recovery samples. +Though we can still see the genes in this module seem to be very correlated with each other (which is how we found them in the first place, so this makes sense!). Save this plot also. From 8f674da29baa51391414c3fe689c4f9f16057f09 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 25 Nov 2020 09:05:04 -0500 Subject: [PATCH 07/12] Two comments from jashapiro review --- 04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index 76bc5f6d..cd8f98d6 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -13,7 +13,7 @@ output: In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules [@Langfelder2008]. WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. -This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & rna-seq data. +This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNA-seq data. As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) [@Langfelder2007]. This eigengene x sample data can, in many instances, be used as you would the original gene expression values. @@ -546,10 +546,10 @@ ggplot( color = time_point ) ) + - ggforce::geom_sina(maxwidth = 0.3) + # Putting on a boxplot, but making it see through and so we don't put outliers # with boxplot - geom_boxplot(width = 0.2, alpha = 0, outlier.shape = NA) + + geom_boxplot(width = 0.2, outlier.shape = NA) + + ggforce::geom_sina(maxwidth = 0.3) + theme_classic() ``` From b89cb26cd5fd9db970a70be8112946622e0a9c4a Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 25 Nov 2020 09:14:13 -0500 Subject: [PATCH 08/12] Put the comments too --- 04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index cd8f98d6..64a9e4e1 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -546,9 +546,9 @@ ggplot( color = time_point ) ) + - # Putting on a boxplot, but making it see through and so we don't put outliers - # with boxplot + # a boxplot with outlier points hidden (they will be in the sina plot) geom_boxplot(width = 0.2, outlier.shape = NA) + + # A sina plot to show all of the individual data points ggforce::geom_sina(maxwidth = 0.3) + theme_classic() ``` From 54575f393d374bd4d34e4ea5d19295f56320bd45 Mon Sep 17 00:00:00 2001 From: GitHub Actions Date: Wed, 25 Nov 2020 14:17:00 +0000 Subject: [PATCH 09/12] Style Rmds --- 04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index 64a9e4e1..1ca240e9 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -546,7 +546,7 @@ ggplot( color = time_point ) ) + - # a boxplot with outlier points hidden (they will be in the sina plot) + # a boxplot with outlier points hidden (they will be in the sina plot) geom_boxplot(width = 0.2, outlier.shape = NA) + # A sina plot to show all of the individual data points ggforce::geom_sina(maxwidth = 0.3) + From 682699d6a1f89e160fd89f8b72af168e6248ac18 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 25 Nov 2020 11:14:21 -0500 Subject: [PATCH 10/12] Use all_of() to get rid warning --- 04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index 1ca240e9..d29045e5 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -610,8 +610,8 @@ make_module_heatmap <- function(module_name, # Set up the module eigengene with its refinebio_accession_code module_eigengene <- module_eigengenes_df %>% - dplyr::select(module_name) %>% - tibble::rownames_to_column("refinebio_accession_code") + dplyr::select(all_of(module_name)) %>% + tibble::rownames_to_column("refinebio_accession_code") # Set up column annotation from metadata col_annot_df <- metadata_df %>% From 11cccfb375a52e706e306528ee6a5d6ea02f80cb Mon Sep 17 00:00:00 2001 From: GitHub Actions Date: Wed, 25 Nov 2020 16:16:46 +0000 Subject: [PATCH 11/12] Style Rmds --- 04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index d29045e5..fda3c87f 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -610,8 +610,8 @@ make_module_heatmap <- function(module_name, # Set up the module eigengene with its refinebio_accession_code module_eigengene <- module_eigengenes_df %>% - dplyr::select(all_of(module_name)) %>% - tibble::rownames_to_column("refinebio_accession_code") + dplyr::select(all_of(module_name)) %>% + tibble::rownames_to_column("refinebio_accession_code") # Set up column annotation from metadata col_annot_df <- metadata_df %>% From 01be8729639e3d3aeb9e78f8bdfb0b6f4310e0d5 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 25 Nov 2020 12:04:33 -0500 Subject: [PATCH 12/12] Re-render --- .../network-analysis_rnaseq_01_wgcna.html | 62 +++++++++---------- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html index 41230ad8..d0e555b4 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html @@ -3772,7 +3772,7 @@

    November 2020

    1 Purpose of this analysis

    -

    In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules (Langfelder and Horvath 2008). WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & rna-seq data.

    +

    In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules (Langfelder and Horvath 2008). WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNA-seq data.

    As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) (Langfelder and Horvath 2007). This eigengene x sample data can, in many instances, be used as you would the original gene expression values. In this example, we use eigengene x sample data to identify differentially expressed modules between our treatment and control group

    This method does require some computing power, but can still be run locally (on your own computer) for most refine.bio datasets. As with many clustering and network methods, there are some parameters that may need tweaking.

    ⬇️ Jump to the analysis code ⬇️

    @@ -4029,7 +4029,7 @@

    4.2 Import and set up data

    # Read in metadata TSV file
     metadata <- readr::read_tsv(metadata_file)
    ## 
    -## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────
    +## ── Column specification ──────────────────────────────────────────────────────────────
     ## cols(
     ##   .default = col_logical(),
     ##   refinebio_accession_code = col_character(),
    @@ -4046,7 +4046,7 @@ 

    4.2 Import and set up data

    # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later tibble::column_to_rownames("Gene")
    ## 
    -## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────
    +## ── Column specification ──────────────────────────────────────────────────────────────
     ## cols(
     ##   .default = col_double(),
     ##   Gene = col_character()
    @@ -4235,13 +4235,13 @@ 

    4.10 Which modules have biggest d {"columns":[{"label":[""],"name":["_rn_"],"type":[""],"align":["left"]},{"label":["module"],"name":[1],"type":["chr"],"align":["left"]},{"label":["logFC"],"name":[2],"type":["dbl"],"align":["right"]},{"label":["AveExpr"],"name":[3],"type":["dbl"],"align":["right"]},{"label":["t"],"name":[4],"type":["dbl"],"align":["right"]},{"label":["P.Value"],"name":[5],"type":["dbl"],"align":["right"]},{"label":["adj.P.Val"],"name":[6],"type":["dbl"],"align":["right"]},{"label":["B"],"name":[7],"type":["dbl"],"align":["right"]}],"data":[{"1":"ME19","2":"-0.1503138","3":"3.907325e-17","4":"-4.772737","5":"1.929470e-06","6":"7.524934e-05","7":"4.6103808","_rn_":"1"},{"1":"ME9","2":"0.1340967","3":"-3.747143e-17","4":"4.257813","5":"2.145859e-05","6":"4.184426e-04","7":"2.3318227","_rn_":"2"},{"1":"ME38","2":"-0.1251719","3":"-2.436657e-17","4":"-3.974435","5":"7.268279e-05","6":"9.448762e-04","7":"1.1887138","_rn_":"3"},{"1":"ME14","2":"-0.1214709","3":"-2.076772e-17","4":"-3.856921","5":"1.179246e-04","6":"1.149765e-03","7":"0.7377623","_rn_":"4"},{"1":"ME32","2":"-0.1170631","3":"-7.479596e-17","4":"-3.716965","5":"2.063607e-04","6":"1.609613e-03","7":"0.2183497","_rn_":"5"},{"1":"ME23","2":"-0.1135737","3":"6.686642e-18","4":"-3.606173","5":"3.172480e-04","6":"2.062112e-03","7":"-0.1792164","_rn_":"6"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}}

    -

    Module 18 seems to be the most differentially expressed across time_point groups. Now we can do some investigation into this module.

    +

    Module 19 seems to be the most differentially expressed across time_point groups. Now we can do some investigation into this module.

    -
    -

    4.11 Let’s make plot of module 18

    +
    +

    4.11 Let’s make plot of module 19

    As a sanity check, let’s use ggplot to see what module 18’s eigengene looks like between treatment groups.

    First we need to set up the module eigengene for this module with the sample metadata labels we need.

    -
    module_18_df <- module_eigengenes %>%
    +
    module_19_df <- module_eigengenes %>%
       tibble::rownames_to_column("accession_code") %>%
       # Here we are performing an inner join with a subset of metadata
       dplyr::inner_join(metadata %>%
    @@ -4250,33 +4250,33 @@ 

    4.11 Let’s make plot of module )

    Now we are ready for plotting.

    ggplot(
    -  module_18_df,
    +  module_19_df,
       aes(
         x = time_point,
    -    y = ME18,
    +    y = ME19,
         color = time_point
       )
     ) +
    -  ggforce::geom_sina(maxwidth = 0.3) +
    -  # Putting on a boxplot, but making it see through and so we don't put outliers
    -  # with boxplot
    -  geom_boxplot(width = 0.2, alpha = 0, outlier.shape = NA) +
    +  # a boxplot with outlier points hidden (they will be in the sina plot)
    +  geom_boxplot(width = 0.2, outlier.shape = NA) +
    +  # A sina plot to show all of the individual data points
    +  ggforce::geom_sina(maxwidth = 0.3) +
       theme_classic()
    -

    -

    This makes sense! Looks like module 18 has elevated expression during the acute illness but not when recovering.

    +

    +

    This makes sense! Looks like module 19 has elevated expression during the acute illness but not when recovering.

    -
    -

    4.12 What genes are a part of module 18?

    +
    +

    4.12 What genes are a part of module 19?

    If you want to know which of your genes make up a modules, you can look at the $colors slot. This is a named list which associates the genes with the module they are a part of. We can turn this into a data frame for handy use.

    gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module") %>%
       # Let's add the `ME` part so its more clear what these numbers are and it matches elsewhere
       dplyr::mutate(module = paste0("ME", module))
    -

    Now we can find what genes are a part of module 18.

    +

    Now we can find what genes are a part of module 19.

    gene_module_key %>%
    -  dplyr::filter(module == "ME18")
    + dplyr::filter(module == "ME19")

    Let’s save this gene to module key to a TSV file for future use.

    @@ -4295,7 +4295,7 @@

    4.13 Make a custom heatmap functi # Create a summary heatmap of a given module. # # Args: - # module_name: a character indicating what module should be plotted, e.g. "ME18" + # module_name: a character indicating what module should be plotted, e.g. "ME19" # expression_mat: The full gene expression matrix. Default is `normalized_counts`. # metadata_df: a data frame with refinebio_accession_code and time_point # as columns. Default is `metadata`. @@ -4308,7 +4308,7 @@

    4.13 Make a custom heatmap functi # Set up the module eigengene with its refinebio_accession_code module_eigengene <- module_eigengenes_df %>% - dplyr::select(module_name) %>% + dplyr::select(all_of(module_name)) %>% tibble::rownames_to_column("refinebio_accession_code") # Set up column annotation from metadata @@ -4383,19 +4383,19 @@

    4.13 Make a custom heatmap functi

    4.14 Make module heatmaps

    -

    Let’s try out the custom heatmap function with module 18 (our most differentially expressed module).

    -
    mod_18_heatmap <- make_module_heatmap(module_name = "ME18")
    +

    Let’s try out the custom heatmap function with module 19 (our most differentially expressed module).

    +
    mod_19_heatmap <- make_module_heatmap(module_name = "ME19")
    ## Note: Using an external vector in selections is ambiguous.
     ## ℹ Use `all_of(module_name)` instead of `module_name` to silence this message.
     ## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
     ## This message is displayed once per session.
    # Print out the plot
    -mod_18_heatmap
    -

    -

    From the barplot portion of our plot, we can see acute illness samples tend to have higher expression values for the module 18 eigengene. In the heatmap portion, we can see how the individual genes that make up module 18 are overall higher than in the recovering samples.

    +mod_19_heatmap
    +

    +

    From the barplot portion of our plot, we can see acute illness samples tend to have higher expression values for the module 19 eigengene. In the heatmap portion, we can see how the individual genes that make up module 19 are overall higher than in the recovering samples.

    We can save this plot to PDF.

    -
    pdf(file.path("results", "SRP140558_module_18_heatmap.pdf"))
    -mod_18_heatmap
    +
    pdf(file.path("results", "SRP140558_module_19_heatmap.pdf"))
    +mod_19_heatmap
     dev.off()
    ## png 
     ##   2
    @@ -4405,7 +4405,7 @@

    4.14 Make module heatmaps

    # Print out the plot mod_25_heatmap

    -

    In this non-significant module’s heatmap, there’s not a particularly strong pattern between acute illness and recovery samples.

    +

    In this non-significant module’s heatmap, there’s not a particularly strong pattern between acute illness and recovery samples. Though we can still see the genes in this module seem to be very correlated with each other (which is how we found them in the first place, so this makes sense!).

    Save this plot also.

    pdf(file.path("results", "SRP140558_module_25_heatmap.pdf"))
     mod_25_heatmap
    @@ -4438,7 +4438,7 @@ 

    6 Session info

    ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2020-11-24 +## date 2020-11-25 ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date lib source