Skip to content

Commit

Permalink
ready for release
Browse files Browse the repository at this point in the history
  • Loading branch information
joseale2310 committed Aug 24, 2022
1 parent 1a9a26f commit e991841
Show file tree
Hide file tree
Showing 100 changed files with 24,614 additions and 208 deletions.
49 changes: 26 additions & 23 deletions Notebooks/06a_count_matrix.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,13 @@ knit: (function(inputFile, encoding) {
output:
# To create PDF report, uncomment below
#pdf_document:
# toc: yes
# toc: yes
html_document:
theme: yeti # nice theme for the webpage
toc: yes # table of contents
toc_float: yes # table of contents "floats" in the document
df_print: paged # data frames are interactive
dev: png # what format do you want for the figures?
---

```{r knitr, include = FALSE}
Expand All @@ -21,12 +27,10 @@ knitr::opts_chunk$set(autodep = TRUE,
echo = TRUE,
error = FALSE,
fig.align = "center",
fig.path = paste0("./img/", DOCNAME, "/"),
#fig.width = 10,
#fig.height = 8,
fig.path = paste0("./img/", DOCNAME, "/"), #images will be put in this folder, under the notebook name
message = FALSE,
warning = FALSE,
eval = FALSE)
eval = TRUE)
```

Approximate time: 20 minutes
Expand All @@ -37,28 +41,28 @@ Approximate time: 20 minutes
- Describe the RNA-seq and the differential gene expression analysis workflow
- Explain why negative binomial distribution is used to model RNA-seq count data

### Loading libraries
## Loading libraries

For this analysis we will be using several R packages, some which have been installed from CRAN and others from Bioconductor. To use these packages (and the functions contained within them), we need to **load the libraries.**

```{r}
library(tidyverse)
library(RColorBrewer)
library(DESeq2)
library(pheatmap)
library(DEGreport)
library(ggrepel)
# And with this last line of code, we set our working directory to the folder with this notebook.
# This way, the relative paths will work without issues
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
```

### Loading data
## Loading data

To load the data into our current environment, we will be using the `read_table` function. We need to provide the path to each file. By default the function expects tab-delimited files, which is what we have.

```{r}
## Load in data
data <- read_table("/work/introduction_bulkRNAseq_analysis/Data/Mov10_full_counts.txt")
data <- read_table("../Data/Mov10_full_counts.txt")
meta <- read_table("/work/introduction_bulkRNAseq_analysis/Data/Mov10_full_meta.txt")
meta <- read_table("../Data/Mov10_full_meta.txt")
```

Use `class()` to inspect our data and make sure we are working with data frames:
Expand All @@ -69,20 +73,20 @@ class(meta)
class(data)
```

### Viewing data
## Viewing data

Make sure your datasets contain the expected samples / information before proceeding to perfom any type of analysis.

```{r}
View(meta)
View(data)
head(meta)
head(data)
```

### RNA-seq count distribution
## RNA-seq count distribution

To determine the appropriate statistical model, we need information about the distribution of counts. To get an idea about how RNA-seq counts are distributed, let's plot the counts of all the samples:

```{r}
```{r count_distribution}
pdata <- data %>%
gather(key = Sample, value = Count, -GeneSymbol)
Expand All @@ -96,28 +100,27 @@ ggplot(pdata) +

If we zoom in close to zero, we can see a large number of genes with counts of zero:

```{r}
```{r count_distribution_zoom}
ggplot(pdata) +
geom_histogram(aes(x = Count), stat = "bin", bins = 200) +
xlim(-5, 500) +
xlab("Raw expression counts") +
ylab("Number of genes")
```


These images illustrate some common features of RNA-seq count data, including a **low number of counts associated with a large proportion of genes**, and a long right tail due to the **lack of any upper limit for expression**. Unlike microarray data, which has a dynamic range maximum limited due to when the probes max out, there is no limit of maximum expression for RNA-seq data. Due to the differences in these technologies, the statistical models used to fit the data are different between the two methods.

### Modeling count data
## Modeling count data

RNAseq count data can be modeled using a **Poisson distribution**. this particular distribution is fitting for data where the **number of cases is very large but the probability of an event occurring is very small**. To give you an example, think of the lottery: many people buy lottery tickets (high number of cases), but only very few win (the probability of the event is small).

With RNA-Seq data, **a very large number of RNAs are represented and the probability of pulling out a particular transcript is very small**. Thus, it would be an appropriate situation to use the Poisson distribution. However, a unique property of this distribution is that the mean == variance. Realistically, with RNA-Seq data there is always some biological variation present across the replicates (within a sample class). Genes with larger average expression levels will tend to have larger observed variances across replicates.

The model that fits best, given this type of variability observed for replicates, is the **Negative Binomial (NB) model**. Essentially, **the NB model is a good approximation for data where the mean \< variance**, as is the case with RNA-Seq count data.

Run the following code to plot the *mean versus variance* for the 'Mov10 overexpression' replicates:
Run the following code to plot the *mean versus variance* for our data:

```{r}
```{r mean_vs_variance}
df <- data %>%
rowwise() %>%
summarise(mean_counts = mean(Mov10_kd_2:Irrel_kd_3),
Expand Down
42 changes: 31 additions & 11 deletions Notebooks/06b_count_normalization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,13 @@ knit: (function(inputFile, encoding) {
output:
# To create PDF report, uncomment below
#pdf_document:
# toc: yes

# toc: yes
html_document:
theme: yeti # nice theme for the webpage
toc: yes # table of contents
toc_float: yes # table of contents "floats" in the document
df_print: paged # data frames are interactive
dev: png # what format do you want for the figures?
---

```{r knitr, include = FALSE}
Expand All @@ -22,14 +27,28 @@ knitr::opts_chunk$set(autodep = TRUE,
echo = TRUE,
error = FALSE,
fig.align = "center",
fig.path = paste0("./img/", DOCNAME, "/"),
#fig.width = 10,
#fig.height = 8,
fig.path = paste0("./img/", DOCNAME, "/"), #images will be put in this folder, under the notebook name
message = FALSE,
warning = FALSE,
eval = FALSE)
eval = TRUE)
```

```{r setup, include = FALSE, echo = FALSE}
# DO NOT RUN IF YOU HAVE ALREADY RUN PREVIOUS NOTEBOOKS
# This chunk is ONLY necessary if you want to knit this document into a pdf!!
library(tidyverse)
library(DESeq2)
# And with this last line of code, we set our working directory to the folder with this notebook.
# This way, the relative paths will work without issues
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
# Load objects so it can be knitted
data <- read_table("../Data/Mov10_full_counts.txt")
meta <- read_table("../Data/Mov10_full_meta.txt")
```


Approximate time: 40 minutes

## Learning Objectives
Expand Down Expand Up @@ -113,15 +132,15 @@ Let's start by creating the `DESeqDataSet` object and then we can talk a bit mor

```{r}
## Create DESeq2Dataset object
dds <- DESeqDataSetFromMatrix(countData = data.frame(data[,-1], row.names = data$GeneSymbol),
colData = data.frame(meta[,-1], row.names = meta$samplename),
dds <- DESeqDataSetFromMatrix(countData = data %>% column_to_rownames("GeneSymbol"),
colData = meta %>% column_to_rownames("samplename"),
design = ~ sampletype)
```

You can use DESeq-specific functions to access the different slots and retrieve information, if you wish. For example, suppose we wanted the original count matrix we would use `counts()` (*Note: we nested it within the `View()` function so that rather than getting printed in the console we can see it in the script editor*) :
You can use DESeq-specific functions to access the different slots and retrieve information, if you wish. For example, suppose we wanted the original count matrix we would use `counts()`:

```{r}
View(counts(dds))
head(counts(dds))
```

As we go through the workflow we will use the relevant functions to check what information gets stored inside our object.
Expand All @@ -146,10 +165,11 @@ Now, to retrieve the normalized counts matrix from `dds`, we use the `counts()`

```{r}
normalized_counts <- counts(dds, normalized=TRUE)
head(normalized_counts)
```

We can save this normalized data matrix to file for later use:

```{r}
write.table(normalized_counts, file="/work/introduction_bulkRNAseq_analysis/Results/normalized_counts.txt", sep="\t", quote=F)
write.table(normalized_counts, file="../Results/normalized_counts.txt", sep="\t", quote=F)
```
59 changes: 47 additions & 12 deletions Notebooks/07_exploratory_analysis.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Exploratory analysis of bulk RNAseq"
title: "Count normalization with DESeq2"
author: "You!"
date: '`r Sys.Date()`'
knit: (function(inputFile, encoding) {
Expand All @@ -10,8 +10,13 @@ knit: (function(inputFile, encoding) {
output:
# To create PDF report, uncomment below
#pdf_document:
# toc: yes

# toc: yes
html_document:
theme: yeti # nice theme for the webpage
toc: yes # table of contents
toc_float: yes # table of contents "floats" in the document
df_print: paged # data frames are interactive
dev: png # what format do you want for the figures?
---

```{r knitr, include = FALSE}
Expand All @@ -22,12 +27,30 @@ knitr::opts_chunk$set(autodep = TRUE,
echo = TRUE,
error = FALSE,
fig.align = "center",
fig.path = paste0("./img/", DOCNAME, "/"),
#fig.width = 10,
#fig.height = 8,
fig.path = paste0("./img/", DOCNAME, "/"), #images will be put in this folder, under the notebook name
message = FALSE,
warning = FALSE,
eval = FALSE)
eval = TRUE)
```

```{r setup, include = FALSE, echo = FALSE}
# DO NOT RUN IF YOU HAVE ALREADY RUN PREVIOUS NOTEBOOKS
# This chunk is ONLY necessary if you want to knit this document into a pdf!!
library(tidyverse)
library(DESeq2)
# And with this last line of code, we set our working directory to the folder with this notebook.
# This way, the relative paths will work without issues
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
# Load objects so it can be knitted
data <- read_table("../Data/Mov10_full_counts.txt")
meta <- read_table("../Data/Mov10_full_meta.txt")
# Create dds object
dds <- DESeqDataSetFromMatrix(countData = data %>% column_to_rownames("GeneSymbol"),
colData = meta %>% column_to_rownames("samplename"),
design = ~ sampletype)
```

Approximate time: 80 minutes
Expand Down Expand Up @@ -68,7 +91,7 @@ DESeq2 has a built-in function for generating PCA plots using `ggplot2` under th

The function `plotPCA()` requires two arguments as input: a `DESeqTransform` object and the "intgroup" (interesting group), i.e. the name of the column in our metadata that has information about the experimental sample groups.

```{r}
```{r DESeq_PCA}
### Plot PCA
plotPCA(rld, intgroup="sampletype")
```
Expand Down Expand Up @@ -100,7 +123,7 @@ pca <- prcomp(t(rld_mat)) # perform PCA
df <- cbind(meta, pca$x) # Create data frame with metadata and PC3 and PC4 values for input to ggplot
```

```{r}
```{r custom_PCA}
# ggplot with info for all PCAs
ggplot(df) + geom_point(aes(x=PC3, y=PC4, color = sampletype))
```
Expand Down Expand Up @@ -133,7 +156,7 @@ You will notice that they match the names we have given our samples in the metad

Now, let's plot the heatmap!

```{r}
```{r dist_heatmap}
### Load pheatmap package
library(pheatmap)
Expand All @@ -155,9 +178,21 @@ Instead of using distances between expression patterns, check the Pearson correl
There are many arguments and options for the `pheatmap()` function. You could, for example, change the color scale used, remove the dendograms, avoid clustering or even scale the values per row or per column.

```{r}
heat.colors <- RColorBrewer::brewer.pal(6, "Blues")
pheatmap(rld_cor, annotation = meta, color = heat.colors, border_color=NA, fontsize = 10,
library(RColorBrewer)
heat.colors <- brewer.pal(6, "Blues") # Colors from the RColorBrewer package (only 6)
heat.colors <- colorRampPalette(heat.colors)(100) # Interpolate 100 colors
```


```{r custom_heatmap}
pheatmap(sampleDistMatrix, annotation = meta %>% column_to_rownames("samplename"),
color = heat.colors, border_color=NA, fontsize = 10,
fontsize_row = 10, height=20)
```

You can check all the colors that RColorBrewer offers by using the following command: `display.brewer.all()`

```{r RcolorBrewer_colors}
display.brewer.all()
```

26 changes: 20 additions & 6 deletions Notebooks/08_extra_contrast_design.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,13 @@ knit: (function(inputFile, encoding) {
output:
# To create PDF report, uncomment below
#pdf_document:
# toc: yes

# toc: yes
html_document:
theme: yeti # nice theme for the webpage
toc: yes # table of contents
toc_float: yes # table of contents "floats" in the document
df_print: paged # data frames are interactive
dev: png # what format do you want for the figures?
---

```{r knitr, include = FALSE}
Expand All @@ -22,12 +27,21 @@ knitr::opts_chunk$set(autodep = TRUE,
echo = TRUE,
error = FALSE,
fig.align = "center",
fig.path = paste0("./img/", DOCNAME, "/"),
#fig.width = 10,
#fig.height = 8,
fig.path = paste0("./img/", DOCNAME, "/"), #images will be put in this folder, under the notebook name
message = FALSE,
warning = FALSE,
eval = FALSE)
eval = TRUE)
```

```{r setup, include = FALSE, echo = FALSE}
# DO NOT RUN IF YOU HAVE ALREADY RUN PREVIOUS NOTEBOOKS
# This chunk is ONLY necessary if you want to knit this document into a pdf!!
library(tidyverse)
library(DESeq2)
# And with this last line of code, we set our working directory to the folder with this notebook.
# This way, the relative paths will work without issues
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
```

Approximate time: 40 minutes
Expand Down
Loading

0 comments on commit e991841

Please sign in to comment.