ready for release

hds-sandbox · Aug 24, 2022 · e991841 · e991841
1 parent 1a9a26f
commit e991841
Show file tree

Hide file tree

Showing 100 changed files with 24,614 additions and 208 deletions.
diff --git a/Notebooks/06a_count_matrix.Rmd b/Notebooks/06a_count_matrix.Rmd
@@ -10,7 +10,13 @@ knit: (function(inputFile, encoding) {
 output:
   # To create PDF report, uncomment below
   #pdf_document:
-  #  toc: yes
+  #  toc: yes  
+  html_document:
+    theme: yeti # nice theme for the webpage
+    toc: yes # table of contents
+    toc_float: yes # table of contents "floats" in the document
+    df_print: paged # data frames are interactive
+    dev: png # what format do you want for the figures?
 ---
 
 ```{r knitr, include = FALSE}
@@ -21,12 +27,10 @@ knitr::opts_chunk$set(autodep        = TRUE,
                       echo           = TRUE,
                       error          = FALSE,
                       fig.align      = "center",
-                      fig.path       = paste0("./img/", DOCNAME, "/"),
-                      #fig.width      = 10,
-                      #fig.height     = 8,
+                      fig.path       = paste0("./img/", DOCNAME, "/"), #images will be put in this folder, under the notebook name
                       message        = FALSE,
                       warning        = FALSE,
-                      eval           = FALSE)
+                      eval           = TRUE)
 ```
 
 Approximate time: 20 minutes
@@ -37,28 +41,28 @@ Approximate time: 20 minutes
 -   Describe the RNA-seq and the differential gene expression analysis workflow
 -   Explain why negative binomial distribution is used to model RNA-seq count data
 
-### Loading libraries
+## Loading libraries
 
 For this analysis we will be using several R packages, some which have been installed from CRAN and others from Bioconductor. To use these packages (and the functions contained within them), we need to **load the libraries.** 
 
 ```{r}
 library(tidyverse)
-library(RColorBrewer)
 library(DESeq2)
-library(pheatmap)
-library(DEGreport)
-library(ggrepel)
+
+# And with this last line of code, we set our working directory to the folder with this notebook.
+# This way, the relative paths will work without issues
+setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
 ```
 
-### Loading data
+## Loading data
 
 To load the data into our current environment, we will be using the `read_table` function. We need to provide the path to each file. By default the function expects tab-delimited files, which is what we have.
 
 ```{r}
 ## Load in data
-data <- read_table("/work/introduction_bulkRNAseq_analysis/Data/Mov10_full_counts.txt") 
+data <- read_table("../Data/Mov10_full_counts.txt") 
 
-meta <- read_table("/work/introduction_bulkRNAseq_analysis/Data/Mov10_full_meta.txt")
+meta <- read_table("../Data/Mov10_full_meta.txt")
 ```
 
 Use `class()` to inspect our data and make sure we are working with data frames:
@@ -69,20 +73,20 @@ class(meta)
 class(data)
 ```
 
-### Viewing data
+## Viewing data
 
 Make sure your datasets contain the expected samples / information before proceeding to perfom any type of analysis.
 
 ```{r}
-View(meta)
-View(data)
+head(meta)
+head(data)
 ```
 
-### RNA-seq count distribution
+## RNA-seq count distribution
 
 To determine the appropriate statistical model, we need information about the distribution of counts. To get an idea about how RNA-seq counts are distributed, let's plot the counts of all the samples: 
 
-```{r}
+```{r count_distribution}
 pdata <- data %>% 
   gather(key = Sample, value = Count, -GeneSymbol)
 
@@ -96,28 +100,27 @@ ggplot(pdata) +
 
 If we zoom in close to zero, we can see a large number of genes with counts of zero:
 
-```{r}
+```{r count_distribution_zoom}
 ggplot(pdata) +
    geom_histogram(aes(x = Count), stat = "bin", bins = 200) + 
    xlim(-5, 500)  +
    xlab("Raw expression counts") +
    ylab("Number of genes")
 ```
 
-
 These images illustrate some common features of RNA-seq count data, including a **low number of counts associated with a large proportion of genes**, and a long right tail due to the **lack of any upper limit for expression**. Unlike microarray data, which has a dynamic range maximum limited due to when the probes max out, there is no limit of maximum expression for RNA-seq data. Due to the differences in these technologies, the statistical models used to fit the data are different between the two methods.
 
-### Modeling count data
+## Modeling count data
 
 RNAseq count data can be modeled using a **Poisson distribution**. this particular distribution is fitting for data where the **number of cases is very large but the probability of an event occurring is very small**. To give you an example, think of the lottery: many people buy lottery tickets (high number of cases), but only very few win (the probability of the event is small).
 
 With RNA-Seq data, **a very large number of RNAs are represented and the probability of pulling out a particular transcript is very small**. Thus, it would be an appropriate situation to use the Poisson distribution. However, a unique property of this distribution is that the mean == variance. Realistically, with RNA-Seq data there is always some biological variation present across the replicates (within a sample class). Genes with larger average expression levels will tend to have larger observed variances across replicates.
 
 The model that fits best, given this type of variability observed for replicates, is the **Negative Binomial (NB) model**. Essentially, **the NB model is a good approximation for data where the mean \< variance**, as is the case with RNA-Seq count data.
 
-Run the following code to plot the *mean versus variance* for the 'Mov10 overexpression' replicates:
+Run the following code to plot the *mean versus variance* for our data:
 
-```{r}
+```{r mean_vs_variance}
 df <- data %>% 
   rowwise() %>% 
   summarise(mean_counts = mean(Mov10_kd_2:Irrel_kd_3), 

diff --git a/Notebooks/06b_count_normalization.Rmd b/Notebooks/06b_count_normalization.Rmd
@@ -10,8 +10,13 @@ knit: (function(inputFile, encoding) {
 output:
   # To create PDF report, uncomment below
   #pdf_document:
-  #  toc: yes
-
+  #  toc: yes  
+  html_document:
+    theme: yeti # nice theme for the webpage
+    toc: yes # table of contents
+    toc_float: yes # table of contents "floats" in the document
+    df_print: paged # data frames are interactive
+    dev: png # what format do you want for the figures?
 ---
 
 ```{r knitr, include = FALSE}
@@ -22,14 +27,28 @@ knitr::opts_chunk$set(autodep        = TRUE,
                       echo           = TRUE,
                       error          = FALSE,
                       fig.align      = "center",
-                      fig.path       = paste0("./img/", DOCNAME, "/"),
-                      #fig.width      = 10,
-                      #fig.height     = 8,
+                      fig.path       = paste0("./img/", DOCNAME, "/"), #images will be put in this folder, under the notebook name
                       message        = FALSE,
                       warning        = FALSE,
-                      eval           = FALSE)
+                      eval           = TRUE)
+```
+
+```{r setup, include = FALSE, echo = FALSE}
+# DO NOT RUN IF YOU HAVE ALREADY RUN PREVIOUS NOTEBOOKS
+# This chunk is ONLY necessary if you want to knit this document into a pdf!!
+library(tidyverse)
+library(DESeq2)
+
+# And with this last line of code, we set our working directory to the folder with this notebook.
+# This way, the relative paths will work without issues
+setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
+
+# Load objects so it can be knitted
+data <- read_table("../Data/Mov10_full_counts.txt") 
+meta <- read_table("../Data/Mov10_full_meta.txt")
 ```
 
+
 Approximate time: 40 minutes
 
 ## Learning Objectives 
@@ -113,15 +132,15 @@ Let's start by creating the `DESeqDataSet` object and then we can talk a bit mor
 
 ```{r}
 ## Create DESeq2Dataset object
-dds <- DESeqDataSetFromMatrix(countData = data.frame(data[,-1], row.names = data$GeneSymbol), 
-                              colData = data.frame(meta[,-1], row.names = meta$samplename), 
+dds <- DESeqDataSetFromMatrix(countData = data %>% column_to_rownames("GeneSymbol"), 
+                              colData = meta %>% column_to_rownames("samplename"), 
                               design = ~ sampletype)
 ```
 
-You can use DESeq-specific functions to access the different slots and retrieve information, if you wish. For example, suppose we wanted the original count matrix we would use `counts()` (*Note: we nested it within the `View()` function so that rather than getting printed in the console we can see it in the script editor*) :
+You can use DESeq-specific functions to access the different slots and retrieve information, if you wish. For example, suppose we wanted the original count matrix we would use `counts()`:
 
 ```{r}
-View(counts(dds))
+head(counts(dds))
 ```
 
 As we go through the workflow we will use the relevant functions to check what information gets stored inside our object.
@@ -146,10 +165,11 @@ Now, to retrieve the normalized counts matrix from `dds`, we use the `counts()`
 
 ```{r}
 normalized_counts <- counts(dds, normalized=TRUE)
+head(normalized_counts)
 ```
 
 We can save this normalized data matrix to file for later use:
 
 ```{r}
-write.table(normalized_counts, file="/work/introduction_bulkRNAseq_analysis/Results/normalized_counts.txt", sep="\t", quote=F)
+write.table(normalized_counts, file="../Results/normalized_counts.txt", sep="\t", quote=F)
 ```
diff --git a/Notebooks/07_exploratory_analysis.Rmd b/Notebooks/07_exploratory_analysis.Rmd
@@ -1,5 +1,5 @@
 ---
-title: "Exploratory analysis of bulk RNAseq"
+title: "Count normalization with DESeq2"
 author: "You!"
 date: '`r Sys.Date()`'
 knit: (function(inputFile, encoding) { 
@@ -10,8 +10,13 @@ knit: (function(inputFile, encoding) {
 output:
   # To create PDF report, uncomment below
   #pdf_document:
-  #  toc: yes
-
+  #  toc: yes  
+  html_document:
+    theme: yeti # nice theme for the webpage
+    toc: yes # table of contents
+    toc_float: yes # table of contents "floats" in the document
+    df_print: paged # data frames are interactive
+    dev: png # what format do you want for the figures?
 ---
 
 ```{r knitr, include = FALSE}
@@ -22,12 +27,30 @@ knitr::opts_chunk$set(autodep        = TRUE,
                       echo           = TRUE,
                       error          = FALSE,
                       fig.align      = "center",
-                      fig.path       = paste0("./img/", DOCNAME, "/"),
-                      #fig.width      = 10,
-                      #fig.height     = 8,
+                      fig.path       = paste0("./img/", DOCNAME, "/"), #images will be put in this folder, under the notebook name
                       message        = FALSE,
                       warning        = FALSE,
-                      eval           = FALSE)
+                      eval           = TRUE)
+```
+
+```{r setup, include = FALSE, echo = FALSE}
+# DO NOT RUN IF YOU HAVE ALREADY RUN PREVIOUS NOTEBOOKS
+# This chunk is ONLY necessary if you want to knit this document into a pdf!!
+library(tidyverse)
+library(DESeq2)
+
+# And with this last line of code, we set our working directory to the folder with this notebook.
+# This way, the relative paths will work without issues
+setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
+
+# Load objects so it can be knitted
+data <- read_table("../Data/Mov10_full_counts.txt") 
+meta <- read_table("../Data/Mov10_full_meta.txt")
+
+# Create dds object
+dds <- DESeqDataSetFromMatrix(countData = data %>% column_to_rownames("GeneSymbol"), 
+                              colData = meta %>% column_to_rownames("samplename"), 
+                              design = ~ sampletype)
 ```
 
 Approximate time: 80 minutes
@@ -68,7 +91,7 @@ DESeq2 has a built-in function for generating PCA plots using `ggplot2` under th
 
 The function `plotPCA()` requires two arguments as input: a `DESeqTransform` object and the "intgroup" (interesting group), i.e. the name of the column in our metadata that has information about the experimental sample groups. 
 
-```{r}
+```{r DESeq_PCA}
 ### Plot PCA 
 plotPCA(rld, intgroup="sampletype")
 ```
@@ -100,7 +123,7 @@ pca <- prcomp(t(rld_mat)) # perform PCA
 df <- cbind(meta, pca$x) # Create data frame with metadata and PC3 and PC4 values for input to ggplot
 ```
 
-```{r}
+```{r custom_PCA}
 # ggplot with info for all PCAs
 ggplot(df) + geom_point(aes(x=PC3, y=PC4, color = sampletype))
 ```
@@ -133,7 +156,7 @@ You will notice that they match the names we have given our samples in the metad
 
 Now, let's plot the heatmap!
 
-```{r}
+```{r dist_heatmap}
 ### Load pheatmap package
 library(pheatmap)
 
@@ -155,9 +178,21 @@ Instead of using distances between expression patterns, check the Pearson correl
 There are many arguments and options for the `pheatmap()` function. You could, for example, change the color scale used, remove the dendograms, avoid clustering or even scale the values per row or per column.
 
 ```{r}
-heat.colors <- RColorBrewer::brewer.pal(6, "Blues")
-pheatmap(rld_cor, annotation = meta, color = heat.colors, border_color=NA, fontsize = 10, 
+library(RColorBrewer)
+heat.colors <- brewer.pal(6, "Blues") # Colors from the RColorBrewer package (only 6)
+heat.colors <- colorRampPalette(heat.colors)(100) # Interpolate 100 colors
+```
+
+
+```{r custom_heatmap}
+pheatmap(sampleDistMatrix, annotation = meta %>% column_to_rownames("samplename"), 
+         color = heat.colors, border_color=NA, fontsize = 10, 
          fontsize_row = 10, height=20)
 ```       
 
 You can check all the colors that RColorBrewer offers by using the following command: `display.brewer.all()`
+
+```{r RcolorBrewer_colors}
+display.brewer.all()
+```
+
diff --git a/Notebooks/08_extra_contrast_design.Rmd b/Notebooks/08_extra_contrast_design.Rmd
@@ -10,8 +10,13 @@ knit: (function(inputFile, encoding) {
 output:
   # To create PDF report, uncomment below
   #pdf_document:
-  #  toc: yes
-
+  #  toc: yes  
+  html_document:
+    theme: yeti # nice theme for the webpage
+    toc: yes # table of contents
+    toc_float: yes # table of contents "floats" in the document
+    df_print: paged # data frames are interactive
+    dev: png # what format do you want for the figures?
 ---
 
 ```{r knitr, include = FALSE}
@@ -22,12 +27,21 @@ knitr::opts_chunk$set(autodep        = TRUE,
                       echo           = TRUE,
                       error          = FALSE,
                       fig.align      = "center",
-                      fig.path       = paste0("./img/", DOCNAME, "/"),
-                      #fig.width      = 10,
-                      #fig.height     = 8,
+                      fig.path       = paste0("./img/", DOCNAME, "/"), #images will be put in this folder, under the notebook name
                       message        = FALSE,
                       warning        = FALSE,
-                      eval           = FALSE)
+                      eval           = TRUE)
+```
+
+```{r setup, include = FALSE, echo = FALSE}
+# DO NOT RUN IF YOU HAVE ALREADY RUN PREVIOUS NOTEBOOKS
+# This chunk is ONLY necessary if you want to knit this document into a pdf!!
+library(tidyverse)
+library(DESeq2)
+
+# And with this last line of code, we set our working directory to the folder with this notebook.
+# This way, the relative paths will work without issues
+setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
 ```
 
 Approximate time: 40 minutes