AlexsLemonade · cansavvy · Nov 20, 2020 · Nov 16, 2020 · Nov 16, 2020 · Nov 17, 2020
diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd
@@ -233,6 +233,9 @@ library(magrittr)
 
 # We'll need this for finding gene modules
 library(WGCNA)
+
+# We'll be making some plots
+library(ggplot2)
 ```
 
 ## Import and set up data
@@ -320,6 +323,103 @@ normalized_counts <- assay(dds_norm) %>%
  t() # Transpose this data
 ```
 
+## Determine parameters for WGCNA
+
+To identify which genes are in the same modules, WGCNA first creates a weighted network to define which genes are near each other. 
+The measure of "adjacency" it uses is based on the correlation matrix, but requires the definition of a threshold value, which in turn depends on a "power" parameter that defines the exponent used when transforming the correlation values. 
+The choice of power parameter will affect the number of modules identified, and the WGCNA modules provides the `pickSoftThreshold()` function to help identify good choices for this parameter. 
+
+```{r}
+sft <- pickSoftThreshold(normalized_counts,
+ dataIsExpr = TRUE,
+ corFnc = cor,
+ networkType = "signed"
+)
+```
+
+This `sft` object has a lot of information, we will want to plot some of it to figure out what our `power` soft-threshold should be. 
+We have to first calculate a measure of the model fit, the signed $R^2$, and make that a new variable. 
+
+```{r}
+sft_df <- data.frame(sft$fitIndices) %>%
+ dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq)
+```
+
+Now, let's plot the model fitting by the `power` soft threshold so we can decide on a soft-threshold for power. 
+
+```{r}
+ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) +
+ # Plot the points
+ geom_point() +
+ # We'll put the Power labels slightly above the data points
+ geom_text(nudge_y = 0.1) +
+ # We will plot what WGCNA recommends as an R^2 cutoff
+ geom_hline(yintercept = 0.80, col = "red") +
+ # Just in case our values are low, we want to make sure we can still see the 0.80 level
+ ylim(c(min(sft_df$model_fit), 1)) +
+ # We can add more sensible labels for our axis
+ xlab("Soft Threshold (power)") +
+ ylab("Scale Free Topology Model Fit, signed R^2") +
+ ggtitle("Scale independence") +
+ # This adds some nicer aesthetics to our plot
+ theme_classic()
+```
+
+Using this plot we can decide on a power parameter. 
+WGCNA's authors recommend using a `power` that has an signed $R^2$ above `0.80`, otherwise they warn your results may be too noisy to be meaningful. 
+
+If you have multiple power values with signed $R^2$ above `0.80`, then picking the one at an inflection point, in other words where the $R^2$ values seem to have reached their saturation [@Zhang2005].
+You want to a `power` that gives you a big enough $R^2$ but is not excessively large.
+
+So using the plot above, going with a power soft-threshold of `16`!
+
+If you find you have all very low $R^2$ values this may be because there are too many genes with low expression values that are cluttering up the calculations. 
+You can try returning to [gene filtering step](#define-a-minimum-counts-cutoff) and choosing a more stringent cutoff (you'll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped). 
+
+## Run WGCNA!
+
+We will use the `blockwiseModules()` function to find gene co-expression modules in WGCNA, using `16` for the `power` argument like we determined above. 
+
+This next step takes some time to run.
+The `blockwise` part of the `blockwiseModules()` function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources.
+
+Here we are using the default `maxBlockSize`, 5000 but, you may want to adjust the `maxBlockSize` argument depending on your computer's memory.
+The authors of WGCNA recommend running [the largest block your computer can handle](https://peterlangfelder.com/2018/11/25/blockwise-network-analysis-of-large-data/) and they provide some approximations as to GB of memory of a laptop and what `maxBlockSize` it should be able to handle:
+
+> • If the reader has access to a large workstation with more than 4 GB of memory, the parameter maxBlockSize
+can be increased. A 16GB workstation should handle up to 20000 probes; a 32GB workstation should handle
+perhaps 30000. A 4GB standard desktop or a laptop may handle up to 8000-10000 probes, depending on
+operating system and other running programs.
+
+[@Langfelder2016]
+
+```{r}
+bwnet <- blockwiseModules(normalized_counts,
+ maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
+ TOMType = "signed", # topological overlap matrix
+ power = 16, # soft threshold for network construction
+ numericLabels = TRUE, # Let's use numbers instead of colors for module labels
+ randomSeed = 1234, # there's some randomness associated with this calculation
+ # so we should set a seed
+)
+```
+
+The `TOMtype` argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules.
+You can safely assume for most situations a `signed` network represents what you want -- we want WGCNA to pay attention to directionality. 
+However if you suspect you may benefit from an `unsigned` network, where positive/negative is ignored see [this article](https://peterlangfelder.com/2018/11/25/signed-or-unsigned-which-network-type-is-preferable/) to help you figure that out [@Langfelder2018].
+
+There are a lot of other settings you can tweak -- look at `?blockwiseModules` help page as well as the [WGCNA tutorial](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/) [@Langfelder2016].
+
+## Write main WGCNA results object to file
+
+We will save our whole results object to an RDS file in case we want to return to our original WGCNA results.
+
+```{r}
+readr::write_rds(bwnet,
+ file = file.path("results", "SRP133573_wgcna_results.RDS")
+)
+```
+
 _Next sections addressed in upcoming PR_
 
 # Resources for further learning