-
Notifications
You must be signed in to change notification settings - Fork 14
/
lab_preprocessing.Rmd
387 lines (296 loc) · 14.3 KB
/
lab_preprocessing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
---
title: "Data preprocessing"
subtitle: Workshop on RNA-Seq
editor_options:
chunk_output_type: console
output:
bookdown::html_document2:
highlight: textmate
toc: true
toc_float:
collapsed: true
smooth_scroll: true
print: false
toc_depth: 4
number_sections: true
df_print: default
code_folding: none
self_contained: false
keep_md: false
encoding: 'UTF-8'
css: "assets/lab.css"
include:
after_body: assets/footer-lab.html
---
```{r,child="assets/header-lab.Rmd"}
```
<div class="boxy boxy-exclamation boxy-yellow">
Set `~/RNAseq/labs/` as your working directory (i.e. your working directory should be the directory where you saved this .Rmd file). Create a subdirectory named `data` in this directory (`~/RNAseq/labs/data/`) for input and output files.
</div>
Data preprocessing is done in R. Load the necessary R packages and source the download function.
```{r}
# plotting
library(DESeq2) # rna-seq
library(stringr)
library(rafalib) # nice plot arrangement
rafalib::mypar(mar=c(6,2.5,2.5,1)) #sets nice arrangement for the whole document
# source download function
source("https://raw.githubusercontent.com/NBISweden/workshop-RNAseq/master/assets/scripts.R")
```
<br>
# Preparation
The first step is some data wrangling and clean-up to prepare the data for analyses. Download the count table generated by featureCounts (named **gene_counts_original.tsv** in the repository) into the `data` subdirectory.
```{r}
download_data("data/gene_counts_original.tsv")
```
Download the metadata table.
```{r}
download_data("data/metadata_original.csv")
```
Read in the files to R.
```{r}
co <- read.delim("data/gene_counts_original.tsv",
sep = "\t",
header = TRUE,
stringsAsFactors = FALSE,
comment.char = "#")
mo <- read.csv("data/metadata_original.csv",
header = TRUE)
```
Inspect how the files look like.
```{r}
head(co)
head(mo)
```
We need the count table to have only the counts (numeric data) with gene IDs as row names. And the metadata table must only include information for the samples that are in the count table. And in the same order.
```{r}
cr <- co[,7:12]
rownames(cr) <- co$Geneid
colnames(cr) <- substr(colnames(cr),42,43)
mr <- mo[mo$No %in% as.integer(colnames(cr)),]
colnames(cr) <- mr$SampleName
rownames(mr) <- mr$SampleName
```
Inspect the new count table and metadata table. Ensure that labels match.
```{r}
head(cr)
mr
all.equal(colnames(cr),rownames(mr))
```
Finally save the data in the **data** directory.
```{r}
write.csv(cr,file="data/gene_counts_raw.csv",quote=FALSE)
write.csv(mr,file="data/metadata_raw.csv",quote=FALSE)
```
# Filtering
The next step is to remove samples and genes that are uninformative. We read in the count table (*You can skip this step if continuing from the previous step*).
```{r}
download_data("data/gene_counts_raw.csv")
cr <- read.csv("data/gene_counts_raw.csv",
header = TRUE,
stringsAsFactors = FALSE,
row.names = 1)
head(cr)
str(cr)
```
The count data shows read counts across samples and genes. The columns denote samples and rows denote genes.
Read in the metadata (*You can skip this step if continuing from the previous step*). Each row corresponds to a sample.
```{r}
download_data("data/metadata_raw.csv")
mr <- read.csv("data/metadata_raw.csv",
header = TRUE,
stringsAsFactors = FALSE,
row.names = 1)
head(mr)
str(mr)
```
The most relevant metadata columns are **sampleName** and **Group**. It is important to check that the number of columns of data match the number of rows of metadata. And that the column names of data match the row names of metadata.
```{r}
all.equal(colnames(cr),rownames(mr))
```
Let's visualise the distribution of counts using a boxplot and density plot.
```{r}
rafalib::mypar(1,2, mar=c(6,3,3,2))
boxplot(log2(as.matrix(cr)+1),
ylab = expression('Log'[2]~'Read counts'), las=2, main="Raw data")
hist(log2(as.matrix(cr)+1),
ylab = "", las = 2, main = "Raw data")
par(mfrow=c(1,1))
```
On the boxplot, the median values are zero across all samples. This means that half the values in each sample are zeros. On the histogram, we see a huge peak of zeros. This data set would benefit from a low count filtering.
We can check if any samples need to be discarded based on the number of genes detected. We create a barplot of genes detected across samples.
```{r}
barplot(colSums(cr > 3), las = 2) # Remove ylab from here
mtext("Number of detected genes", side = 2, line = 3) # Add ylab manually
abline(h = median(colSums(cr>3)))
```
On average, about `r sprintf("%0.5f",mean(colSums(cr>3)))` genes are detected. All samples are more or less close to the average. None of the samples look bad enough to be removed.
<i class="fas fa-comments"></i> What does `cr>3` do? Why did we use 3? Is it better than using `cr>0`?
And we can create a similar plot for detection rate across genes.
```{r}
barplot(rowSums(cr>3), xlab = "Genes", ylab = "Number of samples", names.arg = "")
abline(h = median(rowSums(cr>3)), col="red")
```
This is hard to see. It's perhaps easier to plot a histogram.
```{r}
hist(rowSums(cr>3))
```
There are a lot of genes that are not expressed (ie; zero on x-axis) in any sample. These can be removed completely. We are mostly interested in the peak on the right. These are genes that are expressed in all samples. We don't want to be too stringent, so we will choose to keep genes that are expressed in at least 3 samples since our groups have 3 samples each.
Below, rather than using zero as the minimum value for detection, we used minimum of 5 reads in at least 3 samples (since each of test groups consist of 3 samples).
```{r}
# remove genes with low counts
keep_genes <- rowSums( cr > 5 ) >= 3
cf <- cr[keep_genes,]
```
<i class="fas fa-comments"></i> How would the results change if we used total number of samples (ie; 6 for this dataset) in the code above? Are there any drawbacks to doing that?
Distribution of the filtered counts looks like below. Compare this to the previous boxplot above.
```{r}
boxplot(log2(as.matrix(cf)+1),ylab=expression('Log'[2]~'Read counts'),las=2,main="Filtered data")
```
In addition, compare the histogram of filtered counts below to the raw data above.
```{r}
hist(rowSums(cf>3))
```
The missingness in the data set is reduced. The filtering process has removed `r nrow(cr)-nrow(cf)` genes with low counts.
Since no samples were discarded, the metadata file will remain the same. And we can check that the labels are in the same order in counts and metadata.
```{r}
all.equal(colnames(cf),rownames(mr))
```
At this point, we can save the filtered data.
```{r}
write.csv(cf, "data/counts_filtered.csv", quote = F)
```
# Normalisation
The raw count data needs to be corrected for various biases before statistical inference. If the dataset is to be used in an R package for differential gene expression such as **DESeq2**, **edgeR** or **Limma**, you must provide the **raw data** directly. This is because, these packages handle the correction and transformation internally. In addition, these packages do not control for gene length. Therefore, for custom analyses and gene-to-gene comparison, the raw data needs to be normalised.
## CPM/TPM
For analysis other than DGE, the data set must be corrected before use. The most basic correction required is sequencing depth. This is achieved using rescaling the counts to counts per 1 million (CPM).
```{r}
download_data("data/counts_filtered.csv")
cf <- read.csv("data/counts_filtered.csv",
stringsAsFactors = F,
row.names = 1)
download_data("data/metadata_raw.csv")
if(!exists("mr")) mr <- read.csv("data/metadata_raw.csv",
stringsAsFactors=F,
row.names=1)
all.equal(colnames(cf),rownames(mr))
```
```{r}
cc <- cf / colSums(cf) * 1e6
log2_cc <- log2( cc + 1 )
boxplot(log2_cc,ylab=expression('Log'[2]~'Read counts'), las=2, main="Log2 CPM")
```
But, CPM data has some drawbacks. It is not suitable for within-sample comparisons. The total number of reads per sample varies from sample to sample. This also makes it harder to compare one experiment to another. In addition, gene length is not controlled for in this correction. RPKM/FPKM normalisations correct for gene length, but they are not recommended because they are not comparable between samples.
A better correction method that resolves sequencing depth and gene length is TPM (transcripts-per-million). The code for computing TPM is simple.
```{r}
#' @title Compute TPM from a read count matrix
#' @param counts A numeric data.frame of read counts with samples (columns) and genes (rows).
#' @param len A vector of gene cds length equal to number of rows of dfr.
#'
#' https://support.bioconductor.org/p/91218/
#'
tpm <- function(counts,len) {
x <- counts/(len/1000)
return(x*1e6/colSums(x))
}
```
We read in the gene length information.
```{r}
co <- read.delim("data/gene_counts_original.tsv",
sep="\t",
header=TRUE,
stringsAsFactors=F,
comment.char="#")
g <- data.frame( ensembl_gene_id = co$Geneid ,
transcript_length = co$Length,
stringsAsFactors = F, row.names = co$Geneid)
g <- g[!duplicated(g$ensembl_gene_id),]
```
Next, we find shared genes between count data and annotation data and match their order.
```{r}
igenes <- intersect(rownames(cf),g$ensembl_gene_id)
g1 <- g[igenes,]
cf1 <- cf[igenes,]
all.equal(rownames(cf1),g1$ensembl_gene_id)
```
And then we run the `tpm()` function on the count data using the gene lengths. And then we create a boxplot of the resulting values.
```{r}
ct <- tpm(cf1,g1$transcript_length)
log2_ct <- log2( ct + 1 )
boxplot(log2_ct,ylab=expression('Log'[2]~'TPM'),las=2,main="Log2 TPM")
write.csv(ct,"data/counts_tpm.csv",quote=F)
```
This is the distribution of TPM counts.
## DESeq2
DESeq2 internally corrects counts for sequencing depth and RNA compositional bias using **Median of ratios** method. The details of this method are described further in the DGE lab. To run this method, we create a DESeq2 object using the count data and metadata.
```{r}
library(DESeq2)
mr$Group <- factor(mr$Group)
d <- DESeqDataSetFromMatrix(countData=cf,colData=mr,design=~Group)
d <- DESeq2::estimateSizeFactors(d,type="ratio")
cd <- counts(d,normalized=TRUE)
saveRDS(cd,"data/gene_counts_normalised_deseq2.Rds")
```
```{r}
cd <- readRDS("data/gene_counts_normalised_deseq2.Rds")
log2_cd <- log2(cd + 1)
boxplot(log2_cd,ylab=expression('Log'[2]~'Read counts'),las=2,main="DESeq2")
```
## VST
For the purpose of exploratory analysis such as MDS, PCA, clustering etc, VST (variance-stabilizing-transformation) is recommended. VST is also run using DESeq2. As in the previous step, a DESeq2 object is created.
```{r}
library(DESeq2)
mr$Group <- factor(mr$Group)
d <- DESeqDataSetFromMatrix(countData=cf,colData=mr,design=~Group)
d <- DESeq2::estimateSizeFactors(d,type="ratio")
d <- DESeq2::estimateDispersions(d)
cv <- as.data.frame(assay(varianceStabilizingTransformation(d,blind=T)),check.names=F)
write.csv(cv,"data/counts_vst.csv",quote=FALSE)
boxplot(cv,ylab=expression('Log'[2]~'Read counts'),las=2,main="VST")
```
The effect of VST transformation can be clearly seen in a mean vs variance plot.
```{r,fig.height=7,fig.width=7}
rowVar <- function(x) apply(x,1,var)
rafalib::mypar(mfrow=c(3,2))
plot(log2(rowMeans(cf)),log2(rowVar(cf)),
xlab='Mean raw counts (log scale)',
ylab='Variance raw counts (log scale)',
main="Raw counts (filtered)",cex=.1)
plot(rowMeans(log2(cf +1)),rowVar(log2(cf +1)),
xlab='Mean log-transformed counts',
ylab='Variance log-transformed counts',
main="Log-transformed counts",cex=.1)
plot(rowMeans(log2_cc),rowVar(log2_cc),
xlab="Mean log CPM",
ylab='Variance log CPM',main="Log CPM",cex=.1)
plot(rowMeans(log2_ct),rowVar(log2_ct),
xlab="Mean log TPM",
ylab='Variance log TPM',main="Log TPM",cex=.1)
plot(rowMeans(log2_cd),rowVar(log2_cd),
xlab='Mean DESeq2',ylab='Variance DESeq2',main="DESeq2",cex=.1)
plot(rowMeans(cv),rowVar(cv),
xlab='Mean VST-transformed',
ylab='Variance VST-transformed',main="VST",cex=.1)
rafalib::mypar(mar=c(6,2.5,2.5,1))
```
For RNA-seq data, as the mean count value increases, the variance increases. There is a strong almost linear relationship as seen in the figures. The statistical methods such as PCA expects similar variance across the range of mean values. If not, the higher variance genes will contribute more than the lower variance genes. Such data is said to be heteroscedastic and needs to be corrected. One option is log transformation (with pseudocount), but this tends to inflate the contribution of the low variance genes. To obtain similar variance across the whole range of mean values, DESeq2 offers two methods VST (variance stabilising transformation) and RLOG (regularised log transformation).
As the name suggests, VST transformation stabilizes variance across the whole range of count values. VST is recommended for clustering or visualisation. It is not intended for differential gene expression. If the size factors vary dramatically between samples, then RLOG transformation is recommended. A comparable approach is voom transformation from the R package limma.
## Conclusion
Finally, we can compare all of the various transformations in a single plot.
```{r,fig.height=3.5,fig.width=9}
rafalib::mypar(1,4,mar=c(6,2.5,2.5,1) )
boxplot(as.matrix(log2(cc+1)),ylab=expression('Log'[2]~'Read counts'),las=2,main="Log CPM", ylim = c(0,16))
boxplot(as.matrix(log2(ct+1)),ylab=expression('Log'[2]~'Read counts'),las=2,main="Log TPM", ylim = c(0,16))
boxplot(as.matrix(log2(cd+1)),ylab=expression('Log'[2]~'Read counts'),las=2,main="DESeq2", ylim = c(0,16))
boxplot(as.matrix(cv),ylab=expression('Log'[2]~'Read counts'),las=2,main="VST", ylim = c(0,16))
rafalib::mypar(mar=c(6,2.5,2.5,1))
```
At this point, we can save the filtered data.
```{r}
# write.csv(cc,"data/counts_cpm.csv",quote=F)
write.csv(ct,"data/counts_tpm.csv",quote=F)
# write.csv(cd,"data/counts_deseq2.csv",quote=F)
# write.csv(cv,"data/counts_vst.csv",quote=F)
```
<i class="fas fa-comments"></i> Would it be possible to have one perfect normalisation method for all types of analyses? Is there any drawback to using gene length corrected counts in differential gene expression analyses?
***