From a42344c807015819b1fa0e60495ebe5359b73a24 Mon Sep 17 00:00:00 2001 From: Joshua Shapiro Date: Tue, 24 Nov 2020 10:43:46 -0500 Subject: [PATCH 1/7] Starting to tighten up advanced intro. --- .../network-analysis_rnaseq_01_wgcna.Rmd | 134 ++++-------------- 1 file changed, 28 insertions(+), 106 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index bee42c53..7a8b0232 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -26,100 +26,24 @@ As with many clustering and network methods, there are some parameters that may # How to run this example -For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). -We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. +For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' pages](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html). +Below are some brief instructions about the files and directory structure this example expects. +If you need more detailed instructions about how to obtain the data files from refine.bio, please consult one of the earlier examples, such as this one about [clustering and heatmaps](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/clustering_rnaseq_01_heatmap.html). -## Obtain the `.Rmd` file -To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential_expression_rnaseq_01_rnaseq.Rmd). +## Obtain the `.Rmd` and data files -Clicking this link will most likely send this to your downloads folder on your computer. -Move this `.Rmd` file to where you would like this example and its files to be stored. +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential_expression_rnaseq_01_rnaseq.Rmd) and moving the `.Rmd` file to your preferred analysis folder. -You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) +The data we are using is from the SRA project SRP133573 as processed by refine.bio. +To obtain this data, go to the [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP133573/identification-of-transcription-factor-relationships-associated-with-androgen-deprivation-therapy-response-and-metastatic-progression-in-prostate-cancer) and click the "Download Now" button. +Follow the instructions, being sure to select "Skip quantile normalization for RNA-seq samples". -## Set up your analysis folders +Once you have downloaded and unzipped the refine.bio dataset, place the `SRP133573` folder in a `data` subdirectory of your analysis folder. -Good file organization is helpful for keeping your data analysis project on track! -We have set up some code that will automatically set up a folder structure for you. -Run this next chunk to set up your folders! +You will also want to create `plots` and `results` folders, so your analysis directory will now contain: -If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. - -```{r} -# Create the data folder if it doesn't exist -if (!dir.exists("data")) { - dir.create("data") -} - -# Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory - -# Create the plots folder if it doesn't exist -if (!dir.exists(plots_dir)) { - dir.create(plots_dir) -} - -# Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory - -# Create the results folder if it doesn't exist -if (!dir.exists(results_dir)) { - dir.create(results_dir) -} -``` - -In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! - -## Obtain the dataset from refine.bio - -For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). - -Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP133573/identification-of-transcription-factor-relationships-associated-with-androgen-deprivation-therapy-response-and-metastatic-progression-in-prostate-cancer). - -Click the "Download Now" button on the right side of this screen. - - - -Fill out the pop up window with your email and our Terms and Conditions: - - - -We are going to use non-quantile normalized data for this analysis. -To get this data, you will need to check the box that says "Skip quantile normalization for RNA-seq samples". -Note that this option will only be available for RNA-seq datasets. - - - -It may take a few minutes for the dataset to process. -You will get an email when it is ready. - -## About the dataset we are using for this example - -For this example analysis, we will use this [prostate cancer dataset](https://www.refine.bio/experiments/SRP133573). -The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. -Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens. - -## Place the dataset in your new `data/` folder - -refine.bio will send you a download button in the email when it is ready. -Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. -Double clicking should unzip this for you and create a folder of the same name. - - - -For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). - -The `` folder has the data and metadata TSV files you will need for this example analysis. -Experiment accession ids usually look something like `GSE1235` or `SRP12345`. - -Copy and paste the `SRP133573` folder into your newly created `data/` folder. - -## Check out our file structure! - -Your new analysis folder should contain: - -- The example analysis `.Rmd` you downloaded +- The example analysis `.Rmd` - A folder called "data" which contains: - The `SRP133573` folder which contains: - The gene expression @@ -131,41 +55,39 @@ Your example analysis folder should now look something like this (except with re -In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. -These chunks will declare your file paths and double check that your files are in the right place. +## Define file paths -First we will declare our file paths to our data and metadata files, which should be in our data directory. -This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started. +We will define variables for the files and directories we are using in the chunk below, assuming the folder structure we have defined. ```{r} -# Define the file path to the data directory -data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in +# Define the file path to the accession data directory +data_dir <- file.path("data", "SRP133573") -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset +# path to the refine.bio expression matrix +data_file <- file.path(data_dir, "SRP133573.tsv") +# file path to the refine.bio metadata file +metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata +# Define the file path to the plots directory +plots_dir <- "plots" + +# Define the file path to the results directory +results_dir <- "results" ``` -Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. +It is always worth checking that the paths we defined above are correct and the files are where we expect! ```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` +# Check the gene expression matrix file file.exists(data_file) -# Check if the metadata file is at the file path stored in `metadata_file` +# Check for the metadata file file.exists(metadata_file) ``` -If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. - -If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). - -# Using a different refine.bio dataset with this analysis? +## Using a different refine.bio dataset with this analysis? If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). -We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences. ### Sample size From 4e232f6d12bba54dbf54c03cda5ce6dfb79d5176 Mon Sep 17 00:00:00 2001 From: Joshua Shapiro Date: Tue, 24 Nov 2020 12:14:04 -0500 Subject: [PATCH 2/7] Small wording updates and render --- .../network-analysis_rnaseq_01_wgcna.Rmd | 29 +- .../network-analysis_rnaseq_01_wgcna.html | 888 +++++++++--------- 2 files changed, 473 insertions(+), 444 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index 7a8b0232..8d898d40 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -26,12 +26,12 @@ As with many clustering and network methods, there are some parameters that may # How to run this example -For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' pages](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html). +For general information about our tutorials and the basic software packages required, please see our ['Getting Started' pages](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html). Below are some brief instructions about the files and directory structure this example expects. If you need more detailed instructions about how to obtain the data files from refine.bio, please consult one of the earlier examples, such as this one about [clustering and heatmaps](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/clustering_rnaseq_01_heatmap.html). -## Obtain the `.Rmd` and data files +## Directory structure and required files To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential_expression_rnaseq_01_rnaseq.Rmd) and moving the `.Rmd` file to your preferred analysis folder. @@ -41,17 +41,18 @@ Follow the instructions, being sure to select "Skip quantile normalization for R Once you have downloaded and unzipped the refine.bio dataset, place the `SRP133573` folder in a `data` subdirectory of your analysis folder. -You will also want to create `plots` and `results` folders, so your analysis directory will now contain: +You will also want to create `plots` and `results` folders for future use. +At this point, your analysis directory should contain: - The example analysis `.Rmd` -- A folder called "data" which contains: +- A folder called `data` which contains: - The `SRP133573` folder which contains: - The gene expression - The metadata TSV -- A folder for `plots` (currently empty) -- A folder for `results` (currently empty) +- A `plots` folder (currently empty) +- A `results` folder (currently empty) -Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): +Your analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): @@ -61,18 +62,18 @@ We will define variables for the files and directories we are using in the chunk ```{r} # Define the file path to the accession data directory -data_dir <- file.path("data", "SRP133573") +data_dir <- file.path("data", "SRP133573") # path to the refine.bio expression matrix -data_file <- file.path(data_dir, "SRP133573.tsv") -# file path to the refine.bio metadata file +data_file <- file.path(data_dir, "SRP133573.tsv") +# file path to the refine.bio metadata file metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Define the file path to the plots directory plots_dir <- "plots" # Define the file path to the results directory -results_dir <- "results" +results_dir <- "results" ``` It is always worth checking that the paths we defined above are correct and the files are where we expect! @@ -92,12 +93,12 @@ From here you can customize this analysis example to fit your own scientific que ### Sample size -Keep in mind when using a different refine.bio dataset with this example, that WGCNA requires at least 15 samples to produce a meaningful result [according to its authors](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html). +Keep in mind when using a different refine.bio dataset with this example that WGCNA requires at least 15 samples to produce a meaningful result [according to its authors](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html). So you will need to make sure the dataset you use is sufficiently large. However, note that very large datasets will be difficult to run locally (on a personal laptop) due to the required computing power. While you can adjust some parameters to make this more doable on a laptop, it may decrease the reliability of your result if taken to an extreme (more on this parameter, called `maxBlockSize`, in the [`Run WGCNA!` section](#run-wgcna)). -### Microarray vs RNA-seq +### Microarray vs. RNA-seq WGCNA can be used with both RNA-seq and microarray datasets so long as they are well normalized and filtered. In this example we use RNA-seq and [normalize and transform the data with DESeq2's `vst()`](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which not only is a method and package we recommend in general, but is also the [authors' specific recommendations for using WGCNA with RNA-seq data](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html#:~:text=Can%20WGCNA%20be%20used%20to,Yes.&text=Whether%20one%20uses%20RPKM%2C%20FPKM,were%20processed%20the%20same%20way.). @@ -647,7 +648,7 @@ This helps make your code more reproducible by recording what versions of softwa ```{r} # Print session info -sessionInfo() +sessioninfo::session_info() ``` # References diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html index b294fb8d..1311bc5f 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html @@ -3779,70 +3779,17 @@

1 Purpose of this analysis

2 How to run this example

-

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

-
-

2.1 Obtain the .Rmd file

-

To run this example yourself, download the .Rmd for this analysis by clicking this link.

-

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

-

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

-
-
-

2.2 Set up your analysis folders

-

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

-

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

-
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
-  dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
-  dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir)
-}
-

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

-
-
-

2.3 Obtain the dataset from refine.bio

-

For general information about downloading data for these examples, see our ‘Getting Started’ section.

-

Go to this dataset’s page on refine.bio.

-

Click the “Download Now” button on the right side of this screen.

-

-

Fill out the pop up window with your email and our Terms and Conditions:

-

-

We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples”. Note that this option will only be available for RNA-seq datasets.

-

-

It may take a few minutes for the dataset to process. You will get an email when it is ready.

-
-
-

2.4 About the dataset we are using for this example

-

For this example analysis, we will use this prostate cancer dataset. The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens.

-
-
-

2.5 Place the dataset in your new data/ folder

-

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

-

-

For more details on the contents of this folder see these docs on refine.bio.

-

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

-

Copy and paste the SRP133573 folder into your newly created data/ folder.

-
-
-

2.6 Check out our file structure!

-

Your new analysis folder should contain:

+

For general information about our tutorials and the basic software packages required, please see our ‘Getting Started’ pages. Below are some brief instructions about the files and directory structure this example expects. If you need more detailed instructions about how to obtain the data files from refine.bio, please consult one of the earlier examples, such as this one about clustering and heatmaps.

+
+

2.1 Directory structure and required files

+

To run this example yourself, download the .Rmd for this analysis by clicking this link and moving the .Rmd file to your preferred analysis folder.

+

The data we are using is from the SRA project SRP133573 as processed by refine.bio. To obtain this data, go to the dataset’s page on refine.bio and click the “Download Now” button. Follow the instructions, being sure to select “Skip quantile normalization for RNA-seq samples”.

+

Once you have downloaded and unzipped the refine.bio dataset, place the SRP133573 folder in a data subdirectory of your analysis folder.

+

You will also want to create plots and results folders for future use. At this point, your analysis directory should contain:

    -
  • The example analysis .Rmd you downloaded
    +
  • The example analysis .Rmd
  • -
  • A folder called “data” which contains: +
  • A folder called data which contains:
    • The SRP133573 folder which contains:
        @@ -3852,41 +3799,45 @@

        2.6 Check out our file structure!

  • -
  • A folder for plots (currently empty)
  • -
  • A folder for results (currently empty)
  • +
  • A plots folder (currently empty)
  • +
  • A results folder (currently empty)
-

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

+

Your analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

-

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

-

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

-
# Define the file path to the data directory
-data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
-

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

-
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
+
+
+

2.2 Define file paths

+

We will define variables for the files and directories we are using in the chunk below, assuming the folder structure we have defined.

+
# Define the file path to the accession data directory
+data_dir <- file.path("data", "SRP133573")
+
+# path to the refine.bio expression matrix
+data_file <- file.path(data_dir, "SRP133573.tsv")
+# file path to the refine.bio metadata file
+metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv")
+
+# Define the file path to the plots directory
+plots_dir <- "plots"
+
+# Define the file path to the results directory
+results_dir <- "results"
+

It is always worth checking that the paths we defined above are correct and the files are where we expect!

+
# Check the gene expression matrix file
+file.exists(data_file)
## [1] TRUE
-
# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
+
# Check for the metadata file
+file.exists(metadata_file)
## [1] TRUE
-

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

-

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

-
-
-

3 Using a different refine.bio dataset with this analysis?

-

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

+
+

2.3 Using a different refine.bio dataset with this analysis?

+

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). From here you can customize this analysis example to fit your own scientific questions and preferences.

-

3.0.1 Sample size

-

Keep in mind when using a different refine.bio dataset with this example, that WGCNA requires at least 15 samples to produce a meaningful result according to its authors. So you will need to make sure the dataset you use is sufficiently large. However, note that very large datasets will be difficult to run locally (on a personal laptop) due to the required computing power. While you can adjust some parameters to make this more doable on a laptop, it may decrease the reliability of your result if taken to an extreme (more on this parameter, called maxBlockSize, in the Run WGCNA! section).

+

2.3.1 Sample size

+

Keep in mind when using a different refine.bio dataset with this example that WGCNA requires at least 15 samples to produce a meaningful result according to its authors. So you will need to make sure the dataset you use is sufficiently large. However, note that very large datasets will be difficult to run locally (on a personal laptop) due to the required computing power. While you can adjust some parameters to make this more doable on a laptop, it may decrease the reliability of your result if taken to an extreme (more on this parameter, called maxBlockSize, in the Run WGCNA! section).

-
-

3.0.2 Microarray vs RNA-seq

+
+

2.3.2 Microarray vs. RNA-seq

WGCNA can be used with both RNA-seq and microarray datasets so long as they are well normalized and filtered. In this example we use RNA-seq and normalize and transform the data with DESeq2’s vst(), which not only is a method and package we recommend in general, but is also the authors’ specific recommendations for using WGCNA with RNA-seq data.

If you end up wanting to run WGCNA with a microarray dataset, the normalization done by refine.bio should be sufficient, but you will likely want to apply a minimum expression filter as we do in this example. If you have troubles finding a power parameter that yields a sufficient R^2 even after applying a stringent cutoff, you may want to look into using a different dataset.


@@ -3894,41 +3845,42 @@

3.0.2 Microarray vs RNA-seq

 

+
-

4 Identifying co-expression gene modules with WGCNA - RNA-seq

+

3 Identifying co-expression gene modules with WGCNA - RNA-seq

-

4.1 Install libraries

+

3.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

We will be using DESeq2 to normalize and transform our RNA-seq data before running WGCNA, so we will need to install that (Love et al. 2014).

Of course, we will need the WGCNA package (Langfelder and Horvath 2008). But WGCNA also requires a package called impute that it sometimes has trouble installing so we recommend installing that first (Hastie et al. 2020).

For plotting purposes will be creating a sina plot and heatmaps which we will need a ggplot2 companion package for, called ggforce as well as the ComplexHeatmap package (Gu 2020).

-
if (!("DESeq2" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("DESeq2", update = FALSE)
-}
-
-if (!("impute" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("impute")
-}
-
-if (!("WGCNA" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  install.packages("WGCNA")
-}
-
-if (!("ggforce" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  install.packages("ggforce")
-}
-
-if (!("ComplexHeatmap" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  install.packages("ComplexHeatmap")
-}
+
if (!("DESeq2" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("DESeq2", update = FALSE)
+}
+
+if (!("impute" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("impute")
+}
+
+if (!("WGCNA" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  install.packages("WGCNA")
+}
+
+if (!("ggforce" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  install.packages("ggforce")
+}
+
+if (!("ComplexHeatmap" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  install.packages("ComplexHeatmap")
+}

Attach some of the packages we need for this analysis.

-
# Attach the DESeq2 library
-library(DESeq2)
+
# Attach the DESeq2 library
+library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
@@ -3995,11 +3947,11 @@

4.1 Install libraries

## The following objects are masked from 'package:matrixStats':
 ## 
 ##     anyMissing, rowMedians
-
# We will need this so we can use the pipe: %>%
-library(magrittr)
-
-# We'll need this for finding gene modules
-library(WGCNA)
+
# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+# We'll need this for finding gene modules
+library(WGCNA)
## Loading required package: dynamicTreeCut
## Loading required package: fastcluster
## 
@@ -4019,17 +3971,17 @@ 

4.1 Install libraries

## The following object is masked from 'package:stats':
 ## 
 ##     cor
-
# We'll be making some plots
-library(ggplot2)
+
# We'll be making some plots
+library(ggplot2)
-

4.2 Import and set up data

+

3.2 Import and set up data

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

We stored our file paths as objects named metadata_file and data_file in this previous step.

-
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
## 
-## ── Column specification ───────────────────────────────────────────────────────────────────────────────
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────
 ## cols(
 ##   .default = col_character(),
 ##   refinebio_age = col_logical(),
@@ -4044,81 +3996,83 @@ 

4.2 Import and set up data

## refinebio_time = col_logical() ## ) ## ℹ Use `spec()` for the full column specifications.
-
# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
-  # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
-  tibble::column_to_rownames("Gene")
+
# Read in data TSV file
+df <- readr::read_tsv(data_file) %>%
+  # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
+  tibble::column_to_rownames("Gene")
## 
-## ── Column specification ───────────────────────────────────────────────────────────────────────────────
+## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────
 ## cols(
 ##   .default = col_double(),
 ##   Gene = col_character()
 ## )
 ## ℹ Use `spec()` for the full column specifications.

Let’s ensure that the metadata and data are in the same sample order.

-
# Make the data in the order of the metadata
-df <- df %>%
-  dplyr::select(metadata$refinebio_accession_code)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$refinebio_accession_code)
+
# Make the data in the order of the metadata
+df <- df %>%
+  dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$refinebio_accession_code)
## [1] TRUE
-

4.2.1 Prepare data for DESeq2

+

3.2.1 Prepare data for DESeq2

There are two things we neeed to do to prep our expression data for DESeq2.

First, we need to make sure all of the values in our data are converted to integers as required by a DESeq2 function we will use later.

Then, we need to filter out the genes that have not been expressed or that have low expression counts. This is recommended by WGCNA docs for RNA-seq data. Removing low count genes can also help improve your WGCNA results. We are going to do some pre-filtering to keep only genes with 50 or more reads in total across the samples.

-
# The next DESeq2 functions need the values to be converted to integers
-df <- round(df) %>%
-  # The next steps require a data frame and round() returns a matrix
-  as.data.frame() %>%
-  # Only keep rows that have total counts above the cutoff
-  dplyr::filter(rowSums(.) >= 50)
+
# The next DESeq2 functions need the values to be converted to integers
+df <- round(df) %>%
+  # The next steps require a data frame and round() returns a matrix
+  as.data.frame() %>%
+  # Only keep rows that have total counts above the cutoff
+  dplyr::filter(rowSums(.) >= 50)

Another thing we need to do is make sure our main experimental group label is set up. In this case refinebio_treatment has two groups: pre-adt and post-adt. To keep these two treatments in logical (rather than alphabetical) order, we will convert this to a factor with pre-adt as the first level.

-
metadata <- metadata %>%
-  dplyr::mutate(refinebio_treatment = factor(refinebio_treatment,
-    levels = c("pre-adt", "post-adt")
-  ))
+
metadata <- metadata %>%
+  dplyr::mutate(refinebio_treatment = factor(refinebio_treatment,
+    levels = c("pre-adt", "post-adt")
+  ))

Let’s double check that our factor set up is right.

-
levels(metadata$refinebio_treatment)
+
levels(metadata$refinebio_treatment)
## [1] "pre-adt"  "post-adt"
-

4.3 Create a DESeqDataset

+

3.3 Create a DESeqDataset

We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

-
# Create a `DESeqDataSet` object
-dds <- DESeqDataSetFromMatrix(
-  countData = df, # Our prepped data frame with counts
-  colData = metadata, # Data frame with annotation for our samples
-  design = ~1 # Here we are not specifying a model
-)
+
# Create a `DESeqDataSet` object
+dds <- DESeqDataSetFromMatrix(
+  countData = df, # Our prepped data frame with counts
+  colData = metadata, # Data frame with annotation for our samples
+  design = ~1 # Here we are not specifying a model
+)
## converting counts to integer mode
-

4.4 Perform DESeq2 normalization and transformation

+

3.4 Perform DESeq2 normalization and transformation

We often suggest normalizing and transforming your data for various applications and in this instance WGCNA’s authors suggest using variance stabilizing transformation before running WGCNA.
We are going to use the vst() function from the DESeq2 package to normalize and transform the data. For more information about these transformation methods, see here.

-
# Normalize and transform the data in the `DESeqDataSet` object using the `vst()`
-# function from the `DESEq2` R package
-dds_norm <- vst(dds)
-

At this point, if your data has any outliers, you should look into removing them as they can affect your WGCNA results. WGCNA’s tutorial has an example of exploring your data for outliers you can reference.

+
# Normalize and transform the data in the `DESeqDataSet` object using the `vst()`
+# function from the `DESEq2` R package
+dds_norm <- vst(dds)
+

At this point, if your data set has any outlier samples, you should look into removing them as they can affect your WGCNA results.

+

WGCNA’s tutorial has an example of exploring your data for outliers you can reference.

+

For this example data set, we will skip this step (there are no obvious outliers) and proceed.

-

4.5 Format normalized data for WGCNA

+

3.5 Format normalized data for WGCNA

Extract the normalized counts to a matrix and transpose it so we can pass it to WGCNA.

-
# Retrieve the normalized data from the `DESeqDataSet`
-normalized_counts <- assay(dds_norm) %>%
-  t() # Transpose this data
+
# Retrieve the normalized data from the `DESeqDataSet`
+normalized_counts <- assay(dds_norm) %>%
+  t() # Transpose this data
-

4.6 Determine parameters for WGCNA

+

3.6 Determine parameters for WGCNA

To identify which genes are in the same modules, WGCNA first creates a weighted network to define which genes are near each other. The measure of “adjacency” it uses is based on the correlation matrix, but requires the definition of a threshold value, which in turn depends on a “power” parameter that defines the exponent used when transforming the correlation values. The choice of power parameter will affect the number of modules identified, and the WGCNA modules provides the pickSoftThreshold() function to help identify good choices for this parameter.

-
sft <- pickSoftThreshold(normalized_counts,
-  dataIsExpr = TRUE,
-  corFnc = cor,
-  networkType = "signed"
-)
+
sft <- pickSoftThreshold(normalized_counts,
+  dataIsExpr = TRUE,
+  corFnc = cor,
+  networkType = "signed"
+)
## Warning: executing %dopar% sequentially: no parallel backend registered
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
 ## 1      1  0.58200 12.200          0.957 13500.0   13600.0  15500
@@ -4137,24 +4091,24 @@ 

4.6 Determine parameters for WGCN ## 14 18 0.83800 -2.360 0.985 28.6 18.9 227 ## 15 20 0.84500 -2.350 0.987 18.5 11.2 167

This sft object has a lot of information, we will want to plot some of it to figure out what our power soft-threshold should be. We have to first calculate a measure of the model fit, the signed \(R^2\), and make that a new variable.

-
sft_df <- data.frame(sft$fitIndices) %>%
-  dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq)
+
sft_df <- data.frame(sft$fitIndices) %>%
+  dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq)

Now, let’s plot the model fitting by the power soft threshold so we can decide on a soft-threshold for power.

-
ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) +
-  # Plot the points
-  geom_point() +
-  # We'll put the Power labels slightly above the data points
-  geom_text(nudge_y = 0.1) +
-  # We will plot what WGCNA recommends as an R^2 cutoff
-  geom_hline(yintercept = 0.80, col = "red") +
-  # Just in case our values are low, we want to make sure we can still see the 0.80 level
-  ylim(c(min(sft_df$model_fit), 1)) +
-  # We can add more sensible labels for our axis
-  xlab("Soft Threshold (power)") +
-  ylab("Scale Free Topology Model Fit, signed R^2") +
-  ggtitle("Scale independence") +
-  # This adds some nicer aesthetics to our plot
-  theme_classic()
+
ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) +
+  # Plot the points
+  geom_point() +
+  # We'll put the Power labels slightly above the data points
+  geom_text(nudge_y = 0.1) +
+  # We will plot what WGCNA recommends as an R^2 cutoff
+  geom_hline(yintercept = 0.80, col = "red") +
+  # Just in case our values are low, we want to make sure we can still see the 0.80 level
+  ylim(c(min(sft_df$model_fit), 1)) +
+  # We can add more sensible labels for our axis
+  xlab("Soft Threshold (power)") +
+  ylab("Scale Free Topology Model Fit, signed R^2") +
+  ggtitle("Scale independence") +
+  # This adds some nicer aesthetics to our plot
+  theme_classic()

Using this plot we can decide on a power parameter. WGCNA’s authors recommend using a power that has an signed \(R^2\) above 0.80, otherwise they warn your results may be too noisy to be meaningful.

If you have multiple power values with signed \(R^2\) above 0.80, then picking the one at an inflection point, in other words where the \(R^2\) values seem to have reached their saturation (Zhang and Horvath 2005). You want to a power that gives you a big enough \(R^2\) but is not excessively large.

@@ -4162,7 +4116,7 @@

4.6 Determine parameters for WGCN

If you find you have all very low \(R^2\) values this may be because there are too many genes with low expression values that are cluttering up the calculations. You can try returning to gene filtering step and choosing a more stringent cutoff (you’ll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped).

-

4.7 Run WGCNA!

+

3.7 Run WGCNA!

We will use the blockwiseModules() function to find gene co-expression modules in WGCNA, using 16 for the power argument like we determined above.

This next step takes some time to run. The blockwise part of the blockwiseModules() function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources.

Here we are using the default maxBlockSize, 5000 but, you may want to adjust the maxBlockSize argument depending on your computer’s memory. The authors of WGCNA recommend running the largest block your computer can handle and they provide some approximations as to GB of memory of a laptop and what maxBlockSize it should be able to handle:

@@ -4170,32 +4124,32 @@

4.7 Run WGCNA!

• If the reader has access to a large workstation with more than 4 GB of memory, the parameter maxBlockSize can be increased. A 16GB workstation should handle up to 20000 probes; a 32GB workstation should handle perhaps 30000. A 4GB standard desktop or a laptop may handle up to 8000-10000 probes, depending on operating system and other running programs.

(Langfelder and Horvath 2016)

-
bwnet <- blockwiseModules(normalized_counts,
-  maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
-  TOMType = "signed", # topological overlap matrix
-  power = 16, # soft threshold for network construction
-  numericLabels = TRUE, # Let's use numbers instead of colors for module labels
-  randomSeed = 1234, # there's some randomness associated with this calculation
-  # so we should set a seed
-)
+
bwnet <- blockwiseModules(normalized_counts,
+  maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
+  TOMType = "signed", # topological overlap matrix
+  power = 16, # soft threshold for network construction
+  numericLabels = TRUE, # Let's use numbers instead of colors for module labels
+  randomSeed = 1234, # there's some randomness associated with this calculation
+  # so we should set a seed
+)

The TOMtype argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. You can safely assume for most situations a signed network represents what you want – we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an unsigned network, where positive/negative is ignored see this article to help you figure that out (Langfelder 2018).

There are a lot of other settings you can tweak – look at ?blockwiseModules help page as well as the WGCNA tutorial (Langfelder and Horvath 2016).

-

4.8 Write main WGCNA results object to file

+

3.8 Write main WGCNA results object to file

We will save our whole results object to an RDS file in case we want to return to our original WGCNA results.

-
readr::write_rds(bwnet,
-  file = file.path("results", "SRP133573_wgcna_results.RDS")
-)
+
readr::write_rds(bwnet,
+  file = file.path("results", "SRP133573_wgcna_results.RDS")
+)
-

4.9 Explore our WGCNA results

+

3.9 Explore our WGCNA results

The bwnet object has many parts, storing a lot of information. We can pull out the parts we are most interested in and may want to use use for plotting.

In bwnet we have a data frame of eigengene module data for each sample in the MEs slot. These represent the collapsed, combined, and normalized expression of the genes that make up each module.

-
module_eigengenes <- bwnet$MEs
-
-# Print out a preview
-head(module_eigengenes)
+
module_eigengenes <- bwnet$MEs
+
+# Print out a preview
+head(module_eigengenes)

Let’s save this gene to module key to a TSV file for future use.

-
readr::write_tsv(gene_module_key,
-  file = file.path("results", "SRP133573_wgcna_gene_to_module.tsv")
-)
+
readr::write_tsv(gene_module_key,
+  file = file.path("results", "SRP133573_wgcna_gene_to_module.tsv")
+)
-

4.13 Make a custom heatmap function

+

3.13 Make a custom heatmap function

We will make a heatmap that summarizes our differentially expressed module. Because we will make a couple of these, it makes sense to make a custom function for making this heatmap.

-
make_module_heatmap <- function(module_name,
-                                expression_mat = normalized_counts,
-                                metadata_df = metadata,
-                                gene_module_key_df = gene_module_key,
-                                module_eigengenes_df = module_eigengenes) {
-  # Create a summary heatmap of a given module.
-  #
-  # Args:
-  # module_name: a character indicating what module should be plotted, e.g. "ME52"
-  # expression_mat: The full gene expression matrix. Default is `normalized_counts`.
-  # metadata_df: a data frame with refinebio_accession_code and refinebio_treatment
-  #              as columns. Default is `metadata`.
-  # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`.
-  # module_eigengenes: a sample x eigengene data.frame with samples as rownames. Default is `module_eigengenes`.
-  #
-  # Returns:
-  # A heatmap of expression matrix for a module's genes, with a barplot of the
-  # eigengene expression for that module.
-
-  # Set up the module eigengene with its refinebio_accession_code
-  module_eigengene <- module_eigengenes_df %>%
-    dplyr::select(module_name) %>%
-    tibble::rownames_to_column("refinebio_accession_code")
-
-  # Set up column annotation from metadata
-  col_annot_df <- metadata_df %>%
-    # Only select the treatment and sample ID columns
-    dplyr::select(refinebio_accession_code, refinebio_treatment) %>%
-    # Add on the eigengene expression by joining with sample IDs
-    dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>%
-    # Arrange by treatment
-    dplyr::arrange(refinebio_treatment, refinebio_accession_code) %>%
-    # Store sample
-    tibble::column_to_rownames("refinebio_accession_code")
-
-  # Create the ComplexHeatmap column annotation object
-  col_annot <- ComplexHeatmap::HeatmapAnnotation(
-    # Supply treatment labels
-    refinebio_treatment = col_annot_df$refinebio_treatment,
-    # Add annotation barplot
-    module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)),
-    # Pick colors for each experimental group in refinebio_treatment
-    col = list(refinebio_treatment = c("post-adt" = "#f1a340", "pre-adt" = "#998ec3"))
-  )
-
-  # Get a vector of the Ensembl gene IDs that correspond to this module
-  module_genes <- gene_module_key_df %>%
-    dplyr::filter(module == module_name) %>%
-    dplyr::pull(gene)
-
-  # Set up the gene expression data frame
-  mod_mat <- expression_mat %>%
-    t() %>%
-    as.data.frame() %>%
-    # Only keep genes from this module
-    dplyr::filter(rownames(.) %in% module_genes) %>%
-    # Order the samples to match col_annot_df
-    dplyr::select(rownames(col_annot_df)) %>%
-    # Data needs to be a matrix
-    as.matrix()
-
-  # Normalize the gene expression values
-  mod_mat <- mod_mat %>%
-    # Scale can work on matrices, but it does it by column so we will need to
-    # transpose first
-    t() %>%
-    scale() %>%
-    # And now we need to transpose back
-    t()
-
-  # Create a color function based on standardized scale
-  color_func <- circlize::colorRamp2(
-    c(-2, 0, 2),
-    c("#67a9cf", "#f7f7f7", "#ef8a62")
-  )
-
-  # Plot on a heatmap
-  heatmap <- ComplexHeatmap::Heatmap(mod_mat,
-    name = module_name,
-    # Supply color function
-    col = color_func,
-    # Supply column annotation
-    bottom_annotation = col_annot,
-    # We don't want to cluster samples
-    cluster_columns = FALSE,
-    # We don't need to show sample or gene labels
-    show_row_names = FALSE,
-    show_column_names = FALSE
-  )
-
-  # Return heatmap
-  return(heatmap)
-}
+
make_module_heatmap <- function(module_name,
+                                expression_mat = normalized_counts,
+                                metadata_df = metadata,
+                                gene_module_key_df = gene_module_key,
+                                module_eigengenes_df = module_eigengenes) {
+  # Create a summary heatmap of a given module.
+  #
+  # Args:
+  # module_name: a character indicating what module should be plotted, e.g. "ME52"
+  # expression_mat: The full gene expression matrix. Default is `normalized_counts`.
+  # metadata_df: a data frame with refinebio_accession_code and refinebio_treatment
+  #              as columns. Default is `metadata`.
+  # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`.
+  # module_eigengenes: a sample x eigengene data.frame with samples as rownames. Default is `module_eigengenes`.
+  #
+  # Returns:
+  # A heatmap of expression matrix for a module's genes, with a barplot of the
+  # eigengene expression for that module.
+
+  # Set up the module eigengene with its refinebio_accession_code
+  module_eigengene <- module_eigengenes_df %>%
+    dplyr::select(module_name) %>%
+    tibble::rownames_to_column("refinebio_accession_code")
+
+  # Set up column annotation from metadata
+  col_annot_df <- metadata_df %>%
+    # Only select the treatment and sample ID columns
+    dplyr::select(refinebio_accession_code, refinebio_treatment) %>%
+    # Add on the eigengene expression by joining with sample IDs
+    dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>%
+    # Arrange by treatment
+    dplyr::arrange(refinebio_treatment, refinebio_accession_code) %>%
+    # Store sample
+    tibble::column_to_rownames("refinebio_accession_code")
+
+  # Create the ComplexHeatmap column annotation object
+  col_annot <- ComplexHeatmap::HeatmapAnnotation(
+    # Supply treatment labels
+    refinebio_treatment = col_annot_df$refinebio_treatment,
+    # Add annotation barplot
+    module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)),
+    # Pick colors for each experimental group in refinebio_treatment
+    col = list(refinebio_treatment = c("post-adt" = "#f1a340", "pre-adt" = "#998ec3"))
+  )
+
+  # Get a vector of the Ensembl gene IDs that correspond to this module
+  module_genes <- gene_module_key_df %>%
+    dplyr::filter(module == module_name) %>%
+    dplyr::pull(gene)
+
+  # Set up the gene expression data frame
+  mod_mat <- expression_mat %>%
+    t() %>%
+    as.data.frame() %>%
+    # Only keep genes from this module
+    dplyr::filter(rownames(.) %in% module_genes) %>%
+    # Order the samples to match col_annot_df
+    dplyr::select(rownames(col_annot_df)) %>%
+    # Data needs to be a matrix
+    as.matrix()
+
+  # Normalize the gene expression values
+  mod_mat <- mod_mat %>%
+    # Scale can work on matrices, but it does it by column so we will need to
+    # transpose first
+    t() %>%
+    scale() %>%
+    # And now we need to transpose back
+    t()
+
+  # Create a color function based on standardized scale
+  color_func <- circlize::colorRamp2(
+    c(-2, 0, 2),
+    c("#67a9cf", "#f7f7f7", "#ef8a62")
+  )
+
+  # Plot on a heatmap
+  heatmap <- ComplexHeatmap::Heatmap(mod_mat,
+    name = module_name,
+    # Supply color function
+    col = color_func,
+    # Supply column annotation
+    bottom_annotation = col_annot,
+    # We don't want to cluster samples
+    cluster_columns = FALSE,
+    # We don't need to show sample or gene labels
+    show_row_names = FALSE,
+    show_column_names = FALSE
+  )
+
+  # Return heatmap
+  return(heatmap)
+}
-

4.14 Make module heatmaps

+

3.14 Make module heatmaps

Let’s try out the custom heatmap function with module 52 (our most differentially expressed module).

-
mod_52_heatmap <- make_module_heatmap(module_name = "ME52")
+
mod_52_heatmap <- make_module_heatmap(module_name = "ME52")
## Note: Using an external vector in selections is ambiguous.
 ## ℹ Use `all_of(module_name)` instead of `module_name` to silence this message.
 ## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
 ## This message is displayed once per session.
-
# Print out the plot
-mod_52_heatmap
-

+
# Print out the plot
+mod_52_heatmap
+

From the barplot portion of our plot, we can see post-adt samples have higher values for this eigengene for module 52. In the heatmap portion, we can see how the individual genes that make up module 52 have more extreme values (very high or very low) in the post-adt samples.

We can save this plot to PDF.

-
pdf(file.path("results", "SRP133573_module_52_heatmap.pdf"))
-mod_52_heatmap
-dev.off()
+
pdf(file.path("results", "SRP133573_module_52_heatmap.pdf"))
+mod_52_heatmap
+dev.off()
## png 
 ##   2

For comparison, let’s try out the custom heatmap function with a different, not differentially expressed module.

-
mod_10_heatmap <- make_module_heatmap(module_name = "ME10")
-
-# Print out the plot
-mod_10_heatmap
-

+
mod_10_heatmap <- make_module_heatmap(module_name = "ME10")
+
+# Print out the plot
+mod_10_heatmap
+

In this non-significant module’s heatmap, there’s not a particularly strong pattern between pre and post ADT samples. In general the expression of genes in module 10 does not vary much between groups, staying near the overall mean. There are a few samples and some genes that show higher expression, but it is not surprising this does not results in a significant overall difference between the groups.

Save this plot also.

-
pdf(file.path("results", "SRP133573_module_10_heatmap.pdf"))
-mod_10_heatmap
-dev.off()
+
pdf(file.path("results", "SRP133573_module_10_heatmap.pdf"))
+mod_10_heatmap
+dev.off()
## png 
 ##   2
-

5 Resources for further learning

+

4 Resources for further learning

  • WGCNA FAQ page (Langfelder and Horvath 2016).
  • WGCNA tutorial (Langfelder and Horvath 2016).
  • @@ -4413,78 +4367,152 @@

    5 Resources for further learning<

-

6 Session info

+

5 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

-
# Print session info
-sessionInfo()
-
## R version 4.0.2 (2020-06-22)
-## Platform: x86_64-pc-linux-gnu (64-bit)
-## Running under: Ubuntu 20.04 LTS
-## 
-## Matrix products: default
-## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
-## 
-## locale:
-##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
-##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
-##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
-##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
-##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
-## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
-## 
-## attached base packages:
-## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
-## [8] methods   base     
+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ───────────────────────────────────────────────────────────────
+##  setting  value                       
+##  version  R version 4.0.2 (2020-06-22)
+##  os       Ubuntu 20.04 LTS            
+##  system   x86_64, linux-gnu           
+##  ui       X11                         
+##  language (EN)                        
+##  collate  en_US.UTF-8                 
+##  ctype    en_US.UTF-8                 
+##  tz       Etc/UTC                     
+##  date     2020-11-24                  
 ## 
-## other attached packages:
-##  [1] ggplot2_3.3.2               WGCNA_1.69                 
-##  [3] fastcluster_1.1.25          dynamicTreeCut_1.63-1      
-##  [5] magrittr_1.5                DESeq2_1.30.0              
-##  [7] SummarizedExperiment_1.20.0 Biobase_2.50.0             
-##  [9] MatrixGenerics_1.2.0        matrixStats_0.57.0         
-## [11] GenomicRanges_1.42.0        GenomeInfoDb_1.26.0        
-## [13] IRanges_2.24.0              S4Vectors_0.28.0           
-## [15] BiocGenerics_0.36.0         optparse_1.6.6             
+## ─ Packages ───────────────────────────────────────────────────────────────────
+##  package              * version  date       lib source        
+##  annotate               1.68.0   2020-10-27 [1] Bioconductor  
+##  AnnotationDbi          1.52.0   2020-10-27 [1] Bioconductor  
+##  assertthat             0.2.1    2019-03-21 [1] RSPM (R 4.0.0)
+##  backports              1.1.10   2020-09-15 [1] RSPM (R 4.0.2)
+##  base64enc              0.1-3    2015-07-28 [1] RSPM (R 4.0.0)
+##  Biobase              * 2.50.0   2020-10-27 [1] Bioconductor  
+##  BiocGenerics         * 0.36.0   2020-10-27 [1] Bioconductor  
+##  BiocParallel           1.24.1   2020-11-06 [1] Bioconductor  
+##  bit                    4.0.4    2020-08-04 [1] RSPM (R 4.0.2)
+##  bit64                  4.0.5    2020-08-30 [1] RSPM (R 4.0.2)
+##  bitops                 1.0-6    2013-08-17 [1] RSPM (R 4.0.0)
+##  blob                   1.2.1    2020-01-20 [1] RSPM (R 4.0.0)
+##  Cairo                  1.5-12.2 2020-07-07 [1] RSPM (R 4.0.2)
+##  checkmate              2.0.0    2020-02-06 [1] RSPM (R 4.0.0)
+##  circlize               0.4.10   2020-06-15 [1] RSPM (R 4.0.0)
+##  cli                    2.1.0    2020-10-12 [1] RSPM (R 4.0.2)
+##  clue                   0.3-57   2019-02-25 [1] RSPM (R 4.0.0)
+##  cluster                2.1.0    2019-06-19 [1] RSPM (R 4.0.0)
+##  codetools              0.2-16   2018-12-24 [2] CRAN (R 4.0.2)
+##  colorspace             1.4-1    2019-03-18 [1] RSPM (R 4.0.0)
+##  ComplexHeatmap         2.6.2    2020-11-12 [1] Bioconductor  
+##  crayon                 1.3.4    2017-09-16 [1] RSPM (R 4.0.0)
+##  data.table             1.13.0   2020-07-24 [1] RSPM (R 4.0.2)
+##  DBI                    1.1.0    2019-12-15 [1] RSPM (R 4.0.0)
+##  DelayedArray           0.16.0   2020-10-27 [1] Bioconductor  
+##  DESeq2               * 1.30.0   2020-10-27 [1] Bioconductor  
+##  digest                 0.6.25   2020-02-23 [1] RSPM (R 4.0.0)
+##  doParallel             1.0.15   2019-08-02 [1] RSPM (R 4.0.0)
+##  dplyr                  1.0.2    2020-08-18 [1] RSPM (R 4.0.2)
+##  dynamicTreeCut       * 1.63-1   2016-03-11 [1] RSPM (R 4.0.0)
+##  ellipsis               0.3.1    2020-05-15 [1] RSPM (R 4.0.0)
+##  evaluate               0.14     2019-05-28 [1] RSPM (R 4.0.0)
+##  fansi                  0.4.1    2020-01-08 [1] RSPM (R 4.0.0)
+##  farver                 2.0.3    2020-01-16 [1] RSPM (R 4.0.0)
+##  fastcluster          * 1.1.25   2018-06-07 [1] RSPM (R 4.0.0)
+##  foreach                1.5.0    2020-03-30 [1] RSPM (R 4.0.0)
+##  foreign                0.8-80   2020-05-24 [2] CRAN (R 4.0.2)
+##  Formula                1.2-3    2018-05-03 [1] RSPM (R 4.0.0)
+##  genefilter             1.72.0   2020-10-27 [1] Bioconductor  
+##  geneplotter            1.68.0   2020-10-27 [1] Bioconductor  
+##  generics               0.0.2    2018-11-29 [1] RSPM (R 4.0.0)
+##  GenomeInfoDb         * 1.26.0   2020-10-27 [1] Bioconductor  
+##  GenomeInfoDbData       1.2.4    2020-11-20 [1] Bioconductor  
+##  GenomicRanges        * 1.42.0   2020-10-27 [1] Bioconductor  
+##  getopt                 1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
+##  GetoptLong             1.0.3    2020-10-01 [1] RSPM (R 4.0.2)
+##  ggforce                0.3.2    2020-06-23 [1] RSPM (R 4.0.2)
+##  ggplot2              * 3.3.2    2020-06-19 [1] RSPM (R 4.0.1)
+##  GlobalOptions          0.1.2    2020-06-10 [1] RSPM (R 4.0.0)
+##  glue                   1.4.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  GO.db                  3.12.1   2020-11-20 [1] Bioconductor  
+##  gridExtra              2.3      2017-09-09 [1] RSPM (R 4.0.0)
+##  gtable                 0.3.0    2019-03-25 [1] RSPM (R 4.0.0)
+##  Hmisc                  4.4-1    2020-08-10 [1] RSPM (R 4.0.2)
+##  hms                    0.5.3    2020-01-08 [1] RSPM (R 4.0.0)
+##  htmlTable              2.1.0    2020-09-16 [1] RSPM (R 4.0.2)
+##  htmltools              0.5.0    2020-06-16 [1] RSPM (R 4.0.1)
+##  htmlwidgets            1.5.2    2020-10-03 [1] RSPM (R 4.0.2)
+##  httr                   1.4.2    2020-07-20 [1] RSPM (R 4.0.2)
+##  impute                 1.64.0   2020-10-27 [1] Bioconductor  
+##  IRanges              * 2.24.0   2020-10-27 [1] Bioconductor  
+##  iterators              1.0.12   2019-07-26 [1] RSPM (R 4.0.0)
+##  jpeg                   0.1-8.1  2019-10-24 [1] RSPM (R 4.0.0)
+##  jsonlite               1.7.1    2020-09-07 [1] RSPM (R 4.0.2)
+##  knitr                  1.30     2020-09-22 [1] RSPM (R 4.0.2)
+##  labeling               0.3      2014-08-23 [1] RSPM (R 4.0.0)
+##  lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
+##  latticeExtra           0.6-29   2019-12-19 [1] RSPM (R 4.0.0)
+##  lifecycle              0.2.0    2020-03-06 [1] RSPM (R 4.0.0)
+##  limma                  3.46.0   2020-10-27 [1] Bioconductor  
+##  locfit                 1.5-9.4  2020-03-25 [1] RSPM (R 4.0.0)
+##  magick                 2.4.0    2020-06-23 [1] RSPM (R 4.0.2)
+##  magrittr             * 1.5      2014-11-22 [1] RSPM (R 4.0.0)
+##  MASS                   7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+##  Matrix                 1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
+##  MatrixGenerics       * 1.2.0    2020-10-27 [1] Bioconductor  
+##  matrixStats          * 0.57.0   2020-09-25 [1] RSPM (R 4.0.2)
+##  memoise                1.1.0    2017-04-21 [1] RSPM (R 4.0.0)
+##  munsell                0.5.0    2018-06-12 [1] RSPM (R 4.0.0)
+##  nnet                   7.3-14   2020-04-26 [2] CRAN (R 4.0.2)
+##  optparse             * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
+##  pillar                 1.4.6    2020-07-10 [1] RSPM (R 4.0.2)
+##  pkgconfig              2.0.3    2019-09-22 [1] RSPM (R 4.0.0)
+##  png                    0.1-7    2013-12-03 [1] RSPM (R 4.0.0)
+##  polyclip               1.10-0   2019-03-14 [1] RSPM (R 4.0.0)
+##  preprocessCore         1.52.0   2020-10-27 [1] Bioconductor  
+##  ps                     1.4.0    2020-10-07 [1] RSPM (R 4.0.2)
+##  purrr                  0.3.4    2020-04-17 [1] RSPM (R 4.0.0)
+##  R.cache                0.14.0   2019-12-06 [1] RSPM (R 4.0.0)
+##  R.methodsS3            1.8.1    2020-08-26 [1] RSPM (R 4.0.2)
+##  R.oo                   1.24.0   2020-08-26 [1] RSPM (R 4.0.2)
+##  R.utils                2.10.1   2020-08-26 [1] RSPM (R 4.0.2)
+##  R6                     2.4.1    2019-11-12 [1] RSPM (R 4.0.0)
+##  RColorBrewer           1.1-2    2014-12-07 [1] RSPM (R 4.0.0)
+##  Rcpp                   1.0.5    2020-07-06 [1] RSPM (R 4.0.2)
+##  RCurl                  1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0)
+##  readr                  1.4.0    2020-10-05 [1] RSPM (R 4.0.2)
+##  rematch2               2.1.2    2020-05-01 [1] RSPM (R 4.0.0)
+##  rjson                  0.2.20   2018-06-08 [1] RSPM (R 4.0.0)
+##  rlang                  0.4.8    2020-10-08 [1] RSPM (R 4.0.2)
+##  rmarkdown              2.4      2020-09-30 [1] RSPM (R 4.0.2)
+##  rpart                  4.1-15   2019-04-12 [2] CRAN (R 4.0.2)
+##  RSQLite                2.2.1    2020-09-30 [1] RSPM (R 4.0.2)
+##  rstudioapi             0.11     2020-02-07 [1] RSPM (R 4.0.0)
+##  S4Vectors            * 0.28.0   2020-10-27 [1] Bioconductor  
+##  scales                 1.1.1    2020-05-11 [1] RSPM (R 4.0.0)
+##  sessioninfo            1.1.1    2018-11-05 [1] RSPM (R 4.0.0)
+##  shape                  1.4.5    2020-09-13 [1] RSPM (R 4.0.2)
+##  stringi                1.5.3    2020-09-09 [1] RSPM (R 4.0.2)
+##  stringr                1.4.0    2019-02-10 [1] RSPM (R 4.0.0)
+##  styler                 1.3.2    2020-02-23 [1] RSPM (R 4.0.0)
+##  SummarizedExperiment * 1.20.0   2020-10-27 [1] Bioconductor  
+##  survival               3.1-12   2020-04-10 [2] CRAN (R 4.0.2)
+##  tibble                 3.0.4    2020-10-12 [1] RSPM (R 4.0.2)
+##  tidyselect             1.1.0    2020-05-11 [1] RSPM (R 4.0.0)
+##  tweenr                 1.0.1    2018-12-14 [1] RSPM (R 4.0.2)
+##  vctrs                  0.3.4    2020-08-29 [1] RSPM (R 4.0.2)
+##  WGCNA                * 1.69     2020-02-28 [1] RSPM (R 4.0.2)
+##  withr                  2.3.0    2020-09-22 [1] RSPM (R 4.0.2)
+##  xfun                   0.18     2020-09-29 [1] RSPM (R 4.0.2)
+##  XML                    3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2)
+##  xtable                 1.8-4    2019-04-21 [1] RSPM (R 4.0.0)
+##  XVector                0.30.0   2020-10-27 [1] Bioconductor  
+##  yaml                   2.2.1    2020-02-01 [1] RSPM (R 4.0.0)
+##  zlibbioc               1.36.0   2020-10-27 [1] Bioconductor  
 ## 
-## loaded via a namespace (and not attached):
-##   [1] colorspace_1.4-1       rjson_0.2.20           ellipsis_0.3.1        
-##   [4] circlize_0.4.10        htmlTable_2.1.0        XVector_0.30.0        
-##   [7] GlobalOptions_0.1.2    base64enc_0.1-3        clue_0.3-57           
-##  [10] rstudioapi_0.11        farver_2.0.3           getopt_1.20.3         
-##  [13] bit64_4.0.5            AnnotationDbi_1.52.0   fansi_0.4.1           
-##  [16] codetools_0.2-16       splines_4.0.2          R.methodsS3_1.8.1     
-##  [19] doParallel_1.0.15      impute_1.64.0          geneplotter_1.68.0    
-##  [22] knitr_1.30             polyclip_1.10-0        jsonlite_1.7.1        
-##  [25] Formula_1.2-3          Cairo_1.5-12.2         annotate_1.68.0       
-##  [28] cluster_2.1.0          GO.db_3.12.1           png_0.1-7             
-##  [31] R.oo_1.24.0            ggforce_0.3.2          readr_1.4.0           
-##  [34] compiler_4.0.2         httr_1.4.2             backports_1.1.10      
-##  [37] assertthat_0.2.1       Matrix_1.2-18          limma_3.46.0          
-##  [40] cli_2.1.0              tweenr_1.0.1           htmltools_0.5.0       
-##  [43] tools_4.0.2            gtable_0.3.0           glue_1.4.2            
-##  [46] GenomeInfoDbData_1.2.4 dplyr_1.0.2            Rcpp_1.0.5            
-##  [49] styler_1.3.2           vctrs_0.3.4            preprocessCore_1.52.0 
-##  [52] iterators_1.0.12       xfun_0.18              stringr_1.4.0         
-##  [55] ps_1.4.0               lifecycle_0.2.0        XML_3.99-0.5          
-##  [58] MASS_7.3-51.6          zlibbioc_1.36.0        scales_1.1.1          
-##  [61] hms_0.5.3              rematch2_2.1.2         RColorBrewer_1.1-2    
-##  [64] ComplexHeatmap_2.6.0   yaml_2.2.1             memoise_1.1.0         
-##  [67] gridExtra_2.3          rpart_4.1-15           latticeExtra_0.6-29   
-##  [70] stringi_1.5.3          RSQLite_2.2.1          genefilter_1.72.0     
-##  [73] foreach_1.5.0          checkmate_2.0.0        BiocParallel_1.24.1   
-##  [76] shape_1.4.5            rlang_0.4.8            pkgconfig_2.0.3       
-##  [79] bitops_1.0-6           evaluate_0.14          lattice_0.20-41       
-##  [82] purrr_0.3.4            htmlwidgets_1.5.2      labeling_0.3          
-##  [85] bit_4.0.4              tidyselect_1.1.0       R6_2.4.1              
-##  [88] magick_2.4.0           generics_0.0.2         Hmisc_4.4-1           
-##  [91] DelayedArray_0.16.0    DBI_1.1.0              pillar_1.4.6          
-##  [94] foreign_0.8-80         withr_2.3.0            survival_3.1-12       
-##  [97] RCurl_1.98-1.2         nnet_7.3-14            tibble_3.0.4          
-## [100] crayon_1.3.4           rmarkdown_2.4          GetoptLong_1.0.3      
-## [103] jpeg_0.1-8.1           locfit_1.5-9.4         grid_4.0.2            
-## [106] data.table_1.13.0      blob_1.2.1             digest_0.6.25         
-## [109] xtable_1.8-4           R.cache_0.14.0         R.utils_2.10.1        
-## [112] munsell_0.5.0
+## [1] /usr/local/lib/R/site-library +## [2] /usr/local/lib/R/library

References

From c6494ac0f2bab66f395d9525900b94dc57b7b8b3 Mon Sep 17 00:00:00 2001 From: Joshua Shapiro Date: Tue, 24 Nov 2020 13:43:25 -0500 Subject: [PATCH 3/7] Spelling and dictionaries --- 04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd | 4 ++-- 04-advanced-topics/network-analysis_rnaseq_01_wgcna.html | 6 +++--- components/dictionary.txt | 3 +++ 3 files changed, 8 insertions(+), 5 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index 8d898d40..ede756c5 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -13,7 +13,7 @@ output: In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules [@Langfelder2008]. WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. -This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNAseq data. +This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNA-seq data. As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) [@Langfelder2007]. This eigengene x sample data can, in many instances, be used as you would the original gene expression values. @@ -196,7 +196,7 @@ all.equal(colnames(df), metadata$refinebio_accession_code) ### Prepare data for `DESeq2` -There are two things we neeed to do to prep our expression data for DESeq2. +There are two things we need to do to prep our expression data for DESeq2. First, we need to make sure all of the values in our data are converted to integers as required by a `DESeq2` function we will use later. diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html index 1311bc5f..7a66b4fd 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html @@ -3772,7 +3772,7 @@

November 2020

1 Purpose of this analysis

-

In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules (Langfelder and Horvath 2008). WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNAseq data.

+

In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules (Langfelder and Horvath 2008). WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNA-seq data.

As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) (Langfelder and Horvath 2007). This eigengene x sample data can, in many instances, be used as you would the original gene expression values. In this example, we use eigengene x sample data to identify differentially expressed modules between our treatment and control group

This method does require some computing power, but can still be run locally (on your own computer) for most refine.bio datasets. As with many clustering and network methods, there are some parameters that may need tweaking.

⬇️ Jump to the analysis code ⬇️

@@ -4017,7 +4017,7 @@

3.2 Import and set up data

## [1] TRUE

3.2.1 Prepare data for DESeq2

-

There are two things we neeed to do to prep our expression data for DESeq2.

+

There are two things we need to do to prep our expression data for DESeq2.

First, we need to make sure all of the values in our data are converted to integers as required by a DESeq2 function we will use later.

Then, we need to filter out the genes that have not been expressed or that have low expression counts. This is recommended by WGCNA docs for RNA-seq data. Removing low count genes can also help improve your WGCNA results. We are going to do some pre-filtering to keep only genes with 50 or more reads in total across the samples.

# The next DESeq2 functions need the values to be converted to integers
@@ -4205,7 +4205,7 @@ 

3.11 Let’s make plot of module ) + ggforce::geom_sina() + theme_classic()

-

+

This makes sense! Looks like module 52 has elevated expression post treatment.

diff --git a/components/dictionary.txt b/components/dictionary.txt index 2290522d..1cb5098d 100644 --- a/components/dictionary.txt +++ b/components/dictionary.txt @@ -26,6 +26,7 @@ Cmd ColorBrewer Compara ComplexHeatmap +ComplexHeatmap's CREB Crouser cytosolic @@ -122,6 +123,8 @@ RMA RPKMs RStudio ssGSEA +SRA +SRP StatQuest Subramanian SuperSeries From 131e007428328fe15c18cd3970f8872e76ceaa9b Mon Sep 17 00:00:00 2001 From: Joshua Shapiro Date: Tue, 24 Nov 2020 14:48:29 -0500 Subject: [PATCH 4/7] alphabatize dict --- components/dictionary.txt | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/components/dictionary.txt b/components/dictionary.txt index 1cb5098d..c0e0686e 100644 --- a/components/dictionary.txt +++ b/components/dictionary.txt @@ -1,14 +1,15 @@ +`.Rmd` actin ADT al AML AnaLysis +arounds ASXL Azacytidine -arounds Baumer -Benjamini BeadArray +Benjamini bioconductor bioinformatics Brainarray @@ -45,20 +46,21 @@ displaystyle DocToc dorsoventral duplications +e.g. ECM edgeR edgeR's eigengene -e.g. ENSDARG Ensembl ENSG Entrez et FACS +FPKM frac functionalize -FPKM +GC GeneChips Genenames generalizable @@ -66,7 +68,6 @@ ggplot GitHub glioblastoma glioma -GC GSE GSEA GSVA @@ -90,8 +91,8 @@ maxBlockSize medulloblastoma microarray’s molecularly -musculus MSigDB +musculus myeloid nb NES @@ -109,28 +110,27 @@ PLX PNAS PNG polymerase +pre prostatectomy PRPS -pre -`.Rmd` -QuSAGE -qusage QN +qusage +QuSAGE README rerio ribosomes RMA RPKMs RStudio -ssGSEA SRA SRP +ssGSEA StatQuest -Subramanian -SuperSeries str +Subramanian subtype subtypes +SuperSeries tidyverse TPM TPMs From de2eca53989329e16057f2513cdc5555f3160c2b Mon Sep 17 00:00:00 2001 From: Joshua Shapiro Date: Wed, 25 Nov 2020 10:37:37 -0500 Subject: [PATCH 5/7] Change SRP in intro Other changes will come with merging in data section. --- .../network-analysis_rnaseq_01_wgcna.Rmd | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index ede756c5..1fe53d35 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -35,18 +35,21 @@ If you need more detailed instructions about how to obtain the data files from r To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential_expression_rnaseq_01_rnaseq.Rmd) and moving the `.Rmd` file to your preferred analysis folder. -The data we are using is from the SRA project SRP133573 as processed by refine.bio. -To obtain this data, go to the [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP133573/identification-of-transcription-factor-relationships-associated-with-androgen-deprivation-therapy-response-and-metastatic-progression-in-prostate-cancer) and click the "Download Now" button. +The data we are using is from the SRA project SRP140558 as processed by refine.bio. +To obtain this data, go to the [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP140558/identification-of-transcription-factor-relationships-associated-with-androgen-deprivation-therapy-response-and-metastatic-progression-in-prostate-cancer) and click the "Download Now" button. Follow the instructions, being sure to select "Skip quantile normalization for RNA-seq samples". -Once you have downloaded and unzipped the refine.bio dataset, place the `SRP133573` folder in a `data` subdirectory of your analysis folder. + + + +Once you have downloaded and unzipped the refine.bio dataset, place the `SRP140558` folder in a `data` subdirectory of your analysis folder. You will also want to create `plots` and `results` folders for future use. At this point, your analysis directory should contain: - The example analysis `.Rmd` - A folder called `data` which contains: - - The `SRP133573` folder which contains: + - The `SRP140558` folder which contains: - The gene expression - The metadata TSV - A `plots` folder (currently empty) @@ -62,12 +65,12 @@ We will define variables for the files and directories we are using in the chunk ```{r} # Define the file path to the accession data directory -data_dir <- file.path("data", "SRP133573") +data_dir <- file.path("data", "SRP140558") # path to the refine.bio expression matrix -data_file <- file.path(data_dir, "SRP133573.tsv") +data_file <- file.path(data_dir, "SRP140558.tsv") # file path to the refine.bio metadata file -metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") +metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv") # Define the file path to the plots directory plots_dir <- "plots" From 8c5ceebcc59100d05c533071c71ed949e19a4719 Mon Sep 17 00:00:00 2001 From: Joshua Shapiro Date: Wed, 25 Nov 2020 12:46:05 -0500 Subject: [PATCH 6/7] Rewording and adding back directory creation. --- .../network-analysis_rnaseq_01_wgcna.Rmd | 24 ++- .../network-analysis_rnaseq_01_wgcna.html | 166 ++++++++++-------- 2 files changed, 107 insertions(+), 83 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index 7146ab8b..d2de7a14 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -41,27 +41,26 @@ Follow the instructions, being sure to select "Skip quantile normalization for R - Once you have downloaded and unzipped the refine.bio dataset, place the `SRP140558` folder in a `data` subdirectory of your analysis folder. -You will also want to create `plots` and `results` folders for future use. -At this point, your analysis directory should contain: +We will also create `plots` and `results` folders for future use. +The analysis folder will have the following content: -- The example analysis `.Rmd` +- The example analysis `.Rmd` notebook - A folder called `data` which contains: - The `SRP140558` folder which contains: - The gene expression - The metadata TSV -- A `plots` folder (currently empty) -- A `results` folder (currently empty) +- A `plots` folder +- A `results` folder -Your analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): +Your analysis folder will end up looking something like this (except with respective experiment accession ID and analysis notebook name you are using): ## Define file paths -We will define variables for the files and directories we are using in the chunk below, assuming the folder structure we have defined. +We will define variables for the files and directories we are using in the chunk below. ```{r} # Define the file path to the accession data directory @@ -73,10 +72,17 @@ data_file <- file.path(data_dir, "SRP140558.tsv") metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv") # Define the file path to the plots directory +# (create it if missing) plots_dir <- "plots" +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} # Define the file path to the results directory results_dir <- "results" +if (!dir.exists(results_dir)) { + dir.create(results_dir) +} ``` It is always worth checking that the paths we defined above are correct and the files are where we expect! @@ -89,6 +95,7 @@ file.exists(data_file) file.exists(metadata_file) ``` + ## Using a different refine.bio dataset with this analysis? If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). @@ -109,7 +116,6 @@ In this example we use RNA-seq and [normalize and transform the data with DESeq2 If you end up wanting to run WGCNA with a microarray dataset, the normalization done by refine.bio _should_ be sufficient, but you will likely want to [apply a minimum expression filter](#define-a-minimum-counts-cutoff) as we do in this example. If you have troubles finding a `power` parameter that yields a sufficient R^2 even after applying a stringent cutoff, you may want to look into using a different dataset. -***   diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html index 4853c2ea..d97c41c4 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html @@ -3783,12 +3783,12 @@

2 How to run this example

2.1 Directory structure and required files

To run this example yourself, download the .Rmd for this analysis by clicking this link and moving the .Rmd file to your preferred analysis folder.

-

The data we are using is from the SRA project SRP133573 as processed by refine.bio. To obtain this data, go to the dataset’s page on refine.bio and click the “Download Now” button. Follow the instructions, being sure to select “Skip quantile normalization for RNA-seq samples”.

-

Once you have downloaded and unzipped the refine.bio dataset, place the SRP133573 folder in a data subdirectory of your analysis folder.

-

You will also want to create plots and results folders for future use. At this point, your analysis directory should contain:

+

The data we are using is from the SRA project SRP140558 as processed by refine.bio. To obtain this data, go to the dataset’s page on refine.bio and click the “Download Now” button. Follow the instructions, being sure to select “Skip quantile normalization for RNA-seq samples”.

+

+

Once you have downloaded and unzipped the refine.bio dataset, place the SRP140558 folder in a data subdirectory of your analysis folder.

+

We will also create plots and results folders for future use. The analysis folder will have the following content:

    -
  • The example analysis .Rmd
    -
  • +
  • The example analysis .Rmd notebook
  • A folder called data which contains:
    • The SRP140558 folder which contains: @@ -3799,28 +3799,35 @@

      2.1 Directory structure and requi

-
  • A plots folder (currently empty)
  • -
  • A results folder (currently empty)
  • +
  • A plots folder
  • +
  • A results folder
  • -

    Your analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

    +

    Your analysis folder will end up looking something like this (except with respective experiment accession ID and analysis notebook name you are using):

    2.2 Define file paths

    -

    We will define variables for the files and directories we are using in the chunk below, assuming the folder structure we have defined.

    +

    We will define variables for the files and directories we are using in the chunk below.

    # Define the file path to the accession data directory
    -data_dir <- file.path("data", "SRP133573")
    +data_dir <- file.path("data", "SRP140558")
     
     # path to the refine.bio expression matrix
    -data_file <- file.path(data_dir, "SRP133573.tsv")
    +data_file <- file.path(data_dir, "SRP140558.tsv")
     # file path to the refine.bio metadata file
    -metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv")
    +metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv")
     
     # Define the file path to the plots directory
    -plots_dir <- "plots"
    -
    -# Define the file path to the results directory
    -results_dir <- "results"
    +# (create it if missing) +plots_dir <- "plots" +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} + +# Define the file path to the results directory +results_dir <- "results" +if (!dir.exists(results_dir)) { + dir.create(results_dir) +}

    It is always worth checking that the paths we defined above are correct and the files are where we expect!

    # Check the gene expression matrix file
     file.exists(data_file)
    @@ -3840,7 +3847,6 @@

    2.3.1 Sample size

    2.3.2 Microarray vs. RNA-seq

    WGCNA can be used with both RNA-seq and microarray datasets so long as they are well normalized and filtered. In this example we use RNA-seq and normalize and transform the data with DESeq2’s vst(), which not only is a method and package we recommend in general, but is also the authors’ specific recommendations for using WGCNA with RNA-seq data.

    If you end up wanting to run WGCNA with a microarray dataset, the normalization done by refine.bio should be sufficient, but you will likely want to apply a minimum expression filter as we do in this example. If you have troubles finding a power parameter that yields a sufficient R^2 even after applying a stringent cutoff, you may want to look into using a different dataset.

    -

     

    @@ -3981,7 +3987,7 @@

    3.2 Import and set up data

    # Read in metadata TSV file
     metadata <- readr::read_tsv(metadata_file)
    ## 
    -## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────
    +## ── Column specification ───────────────────────────────────────────────────────────────────────────────────
     ## cols(
     ##   .default = col_logical(),
     ##   refinebio_accession_code = col_character(),
    @@ -3998,7 +4004,7 @@ 

    3.2 Import and set up data

    # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later tibble::column_to_rownames("Gene")
    ## 
    -## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────
    +## ── Column specification ───────────────────────────────────────────────────────────────────────────────────
     ## cols(
     ##   .default = col_double(),
     ##   Gene = col_character()
    @@ -4023,14 +4029,23 @@ 

    3.2.1 Prepare data for DESe as.data.frame() %>% # Only keep rows that have total counts above the cutoff dplyr::filter(rowSums(.) >= 50)

    -

    Another thing we need to do is make sure our main experimental group label is set up. In this case refinebio_treatment has two groups: pre-adt and post-adt. To keep these two treatments in logical (rather than alphabetical) order, we will convert this to a factor with pre-adt as the first level.

    +

    Another thing we need to do is set up our main experimental group variable. Unfortunately the metadata for this dataset are not set up into separate, neat columns, but we can accomplish that ourselves.

    +

    For this study, PBMCs were collected at two time points: during the patients’ first, acute bronchiolitis visit (abbreviated “AV”) and their recovery visit, (called post-convalescence and abbreviated “CV”).

    +

    For handier use of this information, we can create a new variable, time_point, that states this info more clearly. This new time_point variable will have two labels: acute illness and recovering based on the AV or CV coding located in the refinebio_title string variable.

    metadata <- metadata %>%
    -  dplyr::mutate(refinebio_treatment = factor(refinebio_treatment,
    -    levels = c("pre-adt", "post-adt")
    -  ))
    -

    Let’s double check that our factor set up is right.

    -
    levels(metadata$refinebio_treatment)
    -
    ## [1] "pre-adt"  "post-adt"
    + dplyr::mutate( + time_point = dplyr::case_when( + # Create our new variable based on refinebio_title containing AV/CV + stringr::str_detect(refinebio_title, "_AV_") ~ "acute illness", + stringr::str_detect(refinebio_title, "_CV_") ~ "recovering" + ), + # It's easier for future items if this is already set up as a factor + time_point = as.factor(time_point) + )
    +

    Let’s double check that our factor set up is right. We want acute illness to be the first level since it was the first time point collected.

    +
    levels(metadata$time_point)
    +
    ## [1] "acute illness" "recovering"
    +

    Great! We’re all set.

    @@ -4099,14 +4114,14 @@

    3.6 Determine parameters for WGCN # We will plot what WGCNA recommends as an R^2 cutoff geom_hline(yintercept = 0.80, col = "red") + # Just in case our values are low, we want to make sure we can still see the 0.80 level - ylim(c(min(sft_df$model_fit), 1)) + + ylim(c(min(sft_df$model_fit), 1.05)) + # We can add more sensible labels for our axis xlab("Soft Threshold (power)") + ylab("Scale Free Topology Model Fit, signed R^2") + ggtitle("Scale independence") + # This adds some nicer aesthetics to our plot theme_classic()

    -

    +

    Using this plot we can decide on a power parameter. WGCNA’s authors recommend using a power that has an signed \(R^2\) above 0.80, otherwise they warn your results may be too noisy to be meaningful.

    If you have multiple power values with signed \(R^2\) above 0.80, then picking the one at an inflection point, in other words where the \(R^2\) values seem to have reached their saturation (Zhang and Horvath 2005). You want to a power that gives you a big enough \(R^2\) but is not excessively large.

    So using the plot above, going with a power soft-threshold of 7!

    @@ -4114,7 +4129,7 @@

    3.6 Determine parameters for WGCN

    3.7 Run WGCNA!

    -

    We will use the blockwiseModules() function to find gene co-expression modules in WGCNA, using 16 for the power argument like we determined above.

    +

    We will use the blockwiseModules() function to find gene co-expression modules in WGCNA, using 7 for the power argument like we determined above.

    This next step takes some time to run. The blockwise part of the blockwiseModules() function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources.

    Here we are using the default maxBlockSize, 5000 but, you may want to adjust the maxBlockSize argument depending on your computer’s memory. The authors of WGCNA recommend running the largest block your computer can handle and they provide some approximations as to GB of memory of a laptop and what maxBlockSize it should be able to handle:

    @@ -4124,7 +4139,7 @@

    3.7 Run WGCNA!

    bwnet <- blockwiseModules(normalized_counts,
       maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
       TOMType = "signed", # topological overlap matrix
    -  power = 16, # soft threshold for network construction
    +  power = 7, # soft threshold for network construction
       numericLabels = TRUE, # Let's use numbers instead of colors for module labels
       randomSeed = 1234, # there's some randomness associated with this calculation
       # so we should set a seed
    @@ -4136,7 +4151,7 @@ 

    3.7 Run WGCNA!

    3.8 Write main WGCNA results object to file

    We will save our whole results object to an RDS file in case we want to return to our original WGCNA results.

    readr::write_rds(bwnet,
    -  file = file.path("results", "SRP133573_wgcna_results.RDS")
    +  file = file.path("results", "SRP140558_wgcna_results.RDS")
     )
    @@ -4158,8 +4173,8 @@

    3.10 Which modules have biggest d

    We can also see if our eigengenes relate to our metadata labels. First we double check that our samples are still in order.

    all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes))
    ## [1] TRUE
    -
    # Create the design matrix from the refinebio_treatment variable
    -des_mat <- model.matrix(~ metadata$refinebio_treatment)
    +
    # Create the design matrix from the `time_point` variable
    +des_mat <- model.matrix(~ metadata$time_point)

    Run linear model on each module. Limma wants our tests to be per row, so we also need to transpose so the eigengenes are rows

    # lmFit() needs a transposed version of the matrix
     fit <- limma::lmFit(t(module_eigengenes), design = des_mat)
    @@ -4180,40 +4195,43 @@ 

    3.10 Which modules have biggest d

    Module 19 seems to be the most differentially expressed across time_point groups. Now we can do some investigation into this module.

    -
    -

    3.11 Let’s make plot of module 52

    -

    As a sanity check, let’s use ggplot to see what module 52’s eigengene looks like between treatment groups.

    +
    +

    3.11 Let’s make plot of module 19

    +

    As a sanity check, let’s use ggplot to see what module 18’s eigengene looks like between treatment groups.

    First we need to set up the module eigengene for this module with the sample metadata labels we need.

    -
    module_52_df <- module_eigengenes %>%
    +
    module_19_df <- module_eigengenes %>%
       tibble::rownames_to_column("accession_code") %>%
       # Here we are performing an inner join with a subset of metadata
       dplyr::inner_join(metadata %>%
    -    dplyr::select(refinebio_accession_code, refinebio_treatment),
    +    dplyr::select(refinebio_accession_code, time_point),
       by = c("accession_code" = "refinebio_accession_code")
       )

    Now we are ready for plotting.

    ggplot(
    -  module_52_df,
    +  module_19_df,
       aes(
    -    x = refinebio_treatment,
    -    y = ME52,
    -    color = refinebio_treatment
    +    x = time_point,
    +    y = ME19,
    +    color = time_point
       )
     ) +
    -  ggforce::geom_sina() +
    -  theme_classic()
    -

    -

    This makes sense! Looks like module 52 has elevated expression post treatment.

    + # a boxplot with outlier points hidden (they will be in the sina plot) + geom_boxplot(width = 0.2, outlier.shape = NA) + + # A sina plot to show all of the individual data points + ggforce::geom_sina(maxwidth = 0.3) + + theme_classic()
    +

    +

    This makes sense! Looks like module 19 has elevated expression during the acute illness but not when recovering.

    -
    -

    3.12 What genes are a part of module 52?

    +
    +

    3.12 What genes are a part of module 19?

    If you want to know which of your genes make up a modules, you can look at the $colors slot. This is a named list which associates the genes with the module they are a part of. We can turn this into a data frame for handy use.

    gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module") %>%
       # Let's add the `ME` part so its more clear what these numbers are and it matches elsewhere
       dplyr::mutate(module = paste0("ME", module))
    -

    Now we can find what genes are a part of module 52.

    +

    Now we can find what genes are a part of module 19.

    gene_module_key %>%
    -  dplyr::filter(module == "ME52")
    + dplyr::filter(module == "ME19")