lab_preprocessing.Rmd

---
title: "Data preprocessing"
subtitle: Workshop on RNA-Seq
editor_options:
  chunk_output_type: console
output:
  bookdown::html_document2:
    highlight: textmate
    toc: true
    toc_float:
      collapsed: true
      smooth_scroll: true
      print: false
    toc_depth: 4
    number_sections: true
    df_print: default
    code_folding: none
    self_contained: false
    keep_md: false
    encoding: 'UTF-8'
    css: "assets/lab.css"
    include:
      after_body: assets/footer-lab.html
---

```{r,child="assets/header-lab.Rmd"}
```

<div class="boxy boxy-exclamation boxy-yellow">
Set `~/RNAseq/labs/` as your working directory (i.e. your working directory should be the directory where you saved this .Rmd file). Create a subdirectory named `data` in this directory (`~/RNAseq/labs/data/`) for input and output files.
</div>

Data preprocessing is done in R. Load the necessary R packages and source the download function.

```{r}
# plotting
library(DESeq2) # rna-seq
library(stringr)
library(rafalib) # nice plot arrangement
rafalib::mypar(mar=c(6,2.5,2.5,1)) #sets nice arrangement for the whole document

# source download function
source("https://raw.githubusercontent.com/NBISweden/workshop-RNAseq/master/assets/scripts.R")
```

<br>

# Preparation

The first step is some data wrangling and clean-up to prepare the data for analyses. Download the count table generated by featureCounts (named **gene_counts_original.tsv** in the repository) into the `data` subdirectory.

```{r}
download_data("data/gene_counts_original.tsv")
```

Download the metadata table.

```{r}
download_data("data/metadata_original.csv")
```

Read in the files to R.

```{r}
co <- read.delim("data/gene_counts_original.tsv",
                 sep = "\t",
                 header = TRUE,
                 stringsAsFactors = FALSE,
                 comment.char = "#")
mo <- read.csv("data/metadata_original.csv",
               header = TRUE)
```

Inspect how the files look like.

```{r}
head(co)
head(mo)
```

We need the count table to have only the counts (numeric data) with gene IDs as row names. And the metadata table must only include information for the samples that are in the count table. And in the same order.

```{r}
cr <- co[,7:12]
rownames(cr) <- co$Geneid
colnames(cr) <- substr(colnames(cr),42,43)
mr <- mo[mo$No %in% as.integer(colnames(cr)),]
colnames(cr) <- mr$SampleName
rownames(mr) <- mr$SampleName
```

Inspect the new count table and metadata table. Ensure that labels match.

```{r}
head(cr)
mr
all.equal(colnames(cr),rownames(mr))
```

Finally save the data in the **data** directory.

```{r}
write.csv(cr,file="data/gene_counts_raw.csv",quote=FALSE)
write.csv(mr,file="data/metadata_raw.csv",quote=FALSE)
```

# Filtering

The next step is to remove samples and genes that are uninformative. We read in the count table (*You can skip this step if continuing from the previous step*).

```{r}
download_data("data/gene_counts_raw.csv")
cr <- read.csv("data/gene_counts_raw.csv",
               header = TRUE,
               stringsAsFactors = FALSE,
               row.names = 1)
head(cr)
str(cr)
```

The count data shows read counts across samples and genes. The columns denote samples and rows denote genes.

Read in the metadata (*You can skip this step if continuing from the previous step*). Each row corresponds to a sample.

```{r}
download_data("data/metadata_raw.csv")
mr <- read.csv("data/metadata_raw.csv",
               header = TRUE,
               stringsAsFactors = FALSE,
               row.names = 1)
head(mr)
str(mr)
```

The most relevant metadata columns are **sampleName** and **Group**. It is important to check that the number of columns of data match the number of rows of metadata. And that the column names of data match the row names of metadata.

```{r}
all.equal(colnames(cr),rownames(mr))
```

Let's visualise the distribution of counts using a boxplot and density plot.

```{r}
rafalib::mypar(1,2, mar=c(6,3,3,2))
boxplot(log2(as.matrix(cr)+1),
        ylab = expression('Log'[2]~'Read counts'), las=2, main="Raw data")
hist(log2(as.matrix(cr)+1),
     ylab = "", las = 2, main = "Raw data")
par(mfrow=c(1,1))
```

On the boxplot, the median values are zero across all samples. This means that half the values in each sample are zeros. On the histogram, we see a huge peak of zeros. This data set would benefit from a low count filtering.

We can check if any samples need to be discarded based on the number of genes detected. We create a barplot of genes detected across samples.

```{r}
barplot(colSums(cr > 3), las = 2) # Remove ylab from here
mtext("Number of detected genes", side = 2, line = 3) # Add ylab manually
abline(h = median(colSums(cr>3)))
```

On average, about `r sprintf("%0.5f",mean(colSums(cr>3)))` genes are detected. All samples are more or less close to the average. None of the samples look bad enough to be removed.

<i class="fas fa-comments"></i> What does `cr>3` do? Why did we use 3? Is it better than using `cr>0`?

And we can create a similar plot for detection rate across genes.

```{r}
barplot(rowSums(cr>3), xlab = "Genes", ylab = "Number of samples", names.arg = "")
abline(h = median(rowSums(cr>3)), col="red")
```

This is hard to see. It's perhaps easier to plot a histogram.

```{r}
hist(rowSums(cr>3))
```

There are a lot of genes that are not expressed (ie; zero on x-axis) in any sample. These can be removed completely. We are mostly interested in the peak on the right. These are genes that are expressed in all samples. We don't want to be too stringent, so we will choose to keep genes that are expressed in at least 3 samples since our groups have 3 samples each.

Below, rather than using zero as the minimum value for detection, we used minimum of 5 reads in at least 3 samples (since each of test groups consist of 3 samples).

```{r}
# remove genes with low counts
keep_genes <- rowSums( cr > 5 ) >= 3
cf <- cr[keep_genes,]
```

<i class="fas fa-comments"></i> How would the results change if we used total number of samples (ie; 6 for this dataset) in the code above? Are there any drawbacks to doing that?

Distribution of the filtered counts looks like below. Compare this to the previous boxplot above.

```{r}
boxplot(log2(as.matrix(cf)+1),ylab=expression('Log'[2]~'Read counts'),las=2,main="Filtered data")
```

In addition, compare the histogram of filtered counts below to the raw data above.

```{r}
hist(rowSums(cf>3))
```

The missingness in the data set is reduced. The filtering process has removed `r nrow(cr)-nrow(cf)` genes with low counts.

Since no samples were discarded, the metadata file will remain the same. And we can check that the labels are in the same order in counts and metadata.

```{r}
all.equal(colnames(cf),rownames(mr))
```

At this point, we can save the filtered data.

```{r}
write.csv(cf, "data/counts_filtered.csv", quote = F)
```

# Normalisation

The raw count data needs to be corrected for various biases before statistical inference. If the dataset is to be used in an R package for differential gene expression such as **DESeq2**, **edgeR** or **Limma**, you must provide the **raw data** directly. This is because, these packages handle the correction and transformation internally. In addition, these packages do not control for gene length. Therefore, for custom analyses and gene-to-gene comparison, the raw data needs to be normalised.

## CPM/TPM

For analysis other than DGE, the data set must be corrected before use. The most basic correction required is sequencing depth. This is achieved using rescaling the counts to counts per 1 million (CPM).

```{r}
download_data("data/counts_filtered.csv")
cf <- read.csv("data/counts_filtered.csv",
               stringsAsFactors = F,
               row.names = 1)

download_data("data/metadata_raw.csv")
if(!exists("mr")) mr <- read.csv("data/metadata_raw.csv",
                                 stringsAsFactors=F,
                                 row.names=1)

all.equal(colnames(cf),rownames(mr))
```

```{r}
cc <- cf / colSums(cf) * 1e6 
log2_cc <- log2( cc + 1 )
boxplot(log2_cc,ylab=expression('Log'[2]~'Read counts'), las=2, main="Log2 CPM")
```

But, CPM data has some drawbacks. It is not suitable for within-sample comparisons. The total number of reads per sample varies from sample to sample. This also makes it harder to compare one experiment to another. In addition, gene length is not controlled for in this correction. RPKM/FPKM normalisations correct for gene length, but they are not recommended because they are not comparable between samples.

A better correction method that resolves sequencing depth and gene length is TPM (transcripts-per-million). The code for computing TPM is simple.

```{r}
#' @title Compute TPM from a read count matrix
#' @param counts A numeric data.frame of read counts with samples (columns) and genes (rows).
#' @param len A vector of gene cds length equal to number of rows of dfr.
#'
#' https://support.bioconductor.org/p/91218/
#'
tpm <- function(counts,len) {
  x <- counts/(len/1000)
  return(x*1e6/colSums(x))
}
```

We read in the gene length information.

```{r}
co <- read.delim("data/gene_counts_original.tsv",
                 sep="\t",
                 header=TRUE,
                 stringsAsFactors=F,
                 comment.char="#")
g <- data.frame( ensembl_gene_id = co$Geneid , 
                 transcript_length = co$Length,
                 stringsAsFactors = F, row.names = co$Geneid)
g <- g[!duplicated(g$ensembl_gene_id),]
```

Next, we find shared genes between count data and annotation data and match their order.

```{r}
igenes <- intersect(rownames(cf),g$ensembl_gene_id)
g1 <- g[igenes,]
cf1 <- cf[igenes,]
all.equal(rownames(cf1),g1$ensembl_gene_id)
```

And then we run the `tpm()` function on the count data using the gene lengths. And then we create a boxplot of the resulting values.

```{r}
ct <- tpm(cf1,g1$transcript_length)
log2_ct <- log2( ct + 1 )
boxplot(log2_ct,ylab=expression('Log'[2]~'TPM'),las=2,main="Log2 TPM")
write.csv(ct,"data/counts_tpm.csv",quote=F)
```

This is the distribution of TPM counts.

## DESeq2

DESeq2 internally corrects counts for sequencing depth and RNA compositional bias using **Median of ratios** method. The details of this method are described further in the DGE lab. To run this method, we create a DESeq2 object using the count data and metadata.

```{r}
library(DESeq2)
mr$Group <- factor(mr$Group)
d <- DESeqDataSetFromMatrix(countData=cf,colData=mr,design=~Group)
d <- DESeq2::estimateSizeFactors(d,type="ratio")
cd <- counts(d,normalized=TRUE)
saveRDS(cd,"data/gene_counts_normalised_deseq2.Rds")
```

```{r}
cd <- readRDS("data/gene_counts_normalised_deseq2.Rds")
log2_cd <- log2(cd + 1)
boxplot(log2_cd,ylab=expression('Log'[2]~'Read counts'),las=2,main="DESeq2")
```

## VST

For the purpose of exploratory analysis such as MDS, PCA, clustering etc, VST (variance-stabilizing-transformation) is recommended. VST is also run using DESeq2. As in the previous step, a DESeq2 object is created.

```{r}
library(DESeq2)
mr$Group <- factor(mr$Group)
d <- DESeqDataSetFromMatrix(countData=cf,colData=mr,design=~Group)
d <- DESeq2::estimateSizeFactors(d,type="ratio")
d <- DESeq2::estimateDispersions(d)
cv <- as.data.frame(assay(varianceStabilizingTransformation(d,blind=T)),check.names=F)

write.csv(cv,"data/counts_vst.csv",quote=FALSE)

boxplot(cv,ylab=expression('Log'[2]~'Read counts'),las=2,main="VST")
```

The effect of VST transformation can be clearly seen in a mean vs variance plot.

```{r,fig.height=7,fig.width=7}
rowVar <- function(x) apply(x,1,var)
rafalib::mypar(mfrow=c(3,2))
plot(log2(rowMeans(cf)),log2(rowVar(cf)),
     xlab='Mean raw counts (log scale)',
     ylab='Variance raw counts (log scale)',
     main="Raw counts (filtered)",cex=.1)
plot(rowMeans(log2(cf +1)),rowVar(log2(cf +1)),
     xlab='Mean log-transformed counts',
     ylab='Variance log-transformed counts',
     main="Log-transformed counts",cex=.1)
plot(rowMeans(log2_cc),rowVar(log2_cc),
     xlab="Mean log CPM",
     ylab='Variance log CPM',main="Log CPM",cex=.1)
plot(rowMeans(log2_ct),rowVar(log2_ct),
     xlab="Mean log TPM",
     ylab='Variance log TPM',main="Log TPM",cex=.1)
plot(rowMeans(log2_cd),rowVar(log2_cd),
     xlab='Mean DESeq2',ylab='Variance DESeq2',main="DESeq2",cex=.1)
plot(rowMeans(cv),rowVar(cv),
     xlab='Mean VST-transformed',
     ylab='Variance VST-transformed',main="VST",cex=.1)
rafalib::mypar(mar=c(6,2.5,2.5,1))
```

For RNA-seq data, as the mean count value increases, the variance increases. There is a strong almost linear relationship as seen in the figures. The statistical methods such as PCA expects similar variance across the range of mean values. If not, the higher variance genes will contribute more than the lower variance genes. Such data is said to be heteroscedastic and needs to be corrected. One option is log transformation (with pseudocount), but this tends to inflate the contribution of the low variance genes. To obtain similar variance across the whole range of mean values, DESeq2 offers two methods VST (variance stabilising transformation) and RLOG (regularised log transformation).

As the name suggests, VST transformation stabilizes variance across the whole range of count values. VST is recommended for clustering or visualisation. It is not intended for differential gene expression. If the size factors vary dramatically between samples, then RLOG transformation is recommended. A comparable approach is voom transformation from the R package limma.

## Conclusion

Finally, we can compare all of the various transformations in a single plot.

```{r,fig.height=3.5,fig.width=9}
rafalib::mypar(1,4,mar=c(6,2.5,2.5,1) )
boxplot(as.matrix(log2(cc+1)),ylab=expression('Log'[2]~'Read counts'),las=2,main="Log CPM", ylim = c(0,16))
boxplot(as.matrix(log2(ct+1)),ylab=expression('Log'[2]~'Read counts'),las=2,main="Log TPM", ylim = c(0,16))
boxplot(as.matrix(log2(cd+1)),ylab=expression('Log'[2]~'Read counts'),las=2,main="DESeq2", ylim = c(0,16))
boxplot(as.matrix(cv),ylab=expression('Log'[2]~'Read counts'),las=2,main="VST", ylim = c(0,16))
rafalib::mypar(mar=c(6,2.5,2.5,1))
```

At this point, we can save the filtered data.

```{r}
# write.csv(cc,"data/counts_cpm.csv",quote=F)
write.csv(ct,"data/counts_tpm.csv",quote=F)
# write.csv(cd,"data/counts_deseq2.csv",quote=F)
# write.csv(cv,"data/counts_vst.csv",quote=F)
```

<i class="fas fa-comments"></i> Would it be possible to have one perfect normalisation method for all types of analyses? Is there any drawback to using gene length corrected counts in differential gene expression analyses?

***