-
Notifications
You must be signed in to change notification settings - Fork 3
/
36_CummeRbund.Rmd
314 lines (264 loc) · 9.85 KB
/
36_CummeRbund.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
Data analysis of CuffDiff output using CummeRbund
===================================================
Install packages
---------------------------------------------------
```{r}
#
# http://bioconductor.org/packages/devel/bioc/vignettes/cummeRbund/inst/doc/cummeRbund-manual.pdf
# http://compbio.mit.edu/cummeRbund/manual_2_0.html
# http://compbio.mit.edu/cummeRbund/manual_2_0.html#sthash.QPb8ZXGK.dpuf
#source("http://bioconductor.org/biocLite.R")
#biocLite("cummeRbund")
#install.packages('RSQLite')
#install.packages('ggplot2')
#install.packages('reshape2')
#install.packages('plyr')
#install.packages('fastcluster')
#install.packages('rtracklayer')
#install.packages('Gviz')
#install.packages('BiocGenerics')
#install.packages('Hmisc')
```
Load the library and get the directory with the data
```{r}
library(cummeRbund)
setwd('~/Desktop/temp/04_CuffDiff_withLabel/')
dir()
```
```{r}
cuff<-readCufflinks()
cuff
```
Add annotations
```{r}
annot<-read.table("genes_annotation_unique.tab",sep="\t")
#head(annot[,2])
#addFeatures(cuff,annot,level="genes")
```
Global statistics and Quality Control
```{r}
disp<-dispersionPlot(genes(cuff))
disp
```
The squared coefficient of variation is a normalized measure of cross-replicate variability that can be useful for evaluating the quality your RNA-seq data. Differences in CV2 can result in lower numbers of differentially expressed genes due to a higher degree of variability between replicate fpkm estimates.
```{r}
genes.scv<-fpkmSCVPlot(genes(cuff))
fpkmSCVPlot(genes(cuff))
isoforms.scv<-fpkmSCVPlot(isoforms(cuff))
fpkmSCVPlot(isoforms(cuff))
```
To assess the distributions of FPKM scores across samples, you can use the csDensity plot (Figure 1).
```{r}
dens<-csDensity(genes(cuff))
dens
densRep<-csDensity(genes(cuff),replicates=T)
densRep
#Boxplots can be visualized using the csBoxplot method (Figure 2).
b<-csBoxplot(genes(cuff))
b
brep<-csBoxplot(genes(cuff),replicates=T)
brep
#A matrix of pairwise scatterplots can be drawn using the csScatterMatrix() method.
s<-csScatterMatrix(genes(cuff))
s
```
Individual Pairwise comparisons can be made by using csScatter . You must specify the sample names to use for
the x and y axes
```{r}
s<-csScatter(genes(cuff),"Pad","Ripe",smooth=T)
s
dend<-csDendro(genes(cuff))
dend.rep<-csDendro(genes(cuff),replicates=T)
```
MvsA plots can be useful to determine any systematic bias that may be present between conditions. The CuffData
method MAplot() can be used to examine these intensity vs fold-change plots. You must specify the sample names
to use for the pairwise comparison with x and y:
```{r}
m<-MAplot(genes(cuff),"Pad","Ripe")
m
mCount<-MAplot(genes(cuff),"Pad","Ripe",useCount=T)
mCount
```
Volcano plots are also available for the CuffData objects.
```{r}
v<-csVolcanoMatrix(genes(cuff))
v
#For individual pairwise comparisons, you must specify the comparisons by sample name.
v<-csVolcano(genes(cuff),"Pink","Ripe")
v
```
6 Accessing Data
Cuffdiff run information
Run-level information such as run parameters, and sample information can be accessed from a CuffSet object by
using the runInfo and replicates methods:
```{r}
runInfo(cuff)
replicates(cuff)
Features/Annotation
```
Feature-level information can be accessed directly from a CuffData object using the fpkm, repFpkm, count, diffData,
or annotation methods:
```{r}
gene.features<-annotation(genes(cuff))
head(gene.features)
```
```{r}
gene.fpkm<-fpkm(genes(cuff))
head(gene.fpkm)
gene.repFpkm<-repFpkm(genes(cuff))
head(gene.repFpkm)
gene.counts<-count(genes(cuff))
head(gene.counts)
gene.diff<-diffData(genes(cuff))
head(gene.diff)
```
Condition and feature names
Vectors of sample names and feature names are available by using the samples and featureNames methods:
```{r}
sample.names<-samples(genes(cuff))
head(sample.names)
gene.featurenames<-featureNames(genes(cuff))
head(gene.featurenames)
```
Convenience functions
To facilitate Bioconductor-like operations, an 'FPKM-matrix' can be returned easily using the fpkmMatrix method:
```{r}
gene.matrix<-fpkmMatrix(genes(cuff))
head(gene.matrix)
```
A matrix of replicate FPKM values can be retrieved by using repFpkmMatrix
```{r}
gene.rep.matrix<-repFpkmMatrix(genes(cuff))
head(gene.rep.matrix)
```
Similarly, a matrix of normalized counts can be generated by using countMatrix
```{r}
gene.count.matrix<-countMatrix(genes(cuff))
head(gene.count.matrix)
```
7 Creating Gene Sets
Gene Sets (stored in a CuffGeneSet object) can be created using the getGenes method on a CuffSet object. You
must first create a vector of 'gene id' or 'gene short name' values to identify the genes you wish to select:
```{r}
data(sampleData)
myGeneIds<-sampleIDs
myGeneIds
# ABA
myGeneIds<-c('CUFF.48633','CUFF.23339')
myGeneIds
myGenes<-getGenes(cuff,myGeneIds)
myGenes
expressionBarplot(myGenes) + ylab("number of subjects") +
labs(fill="Sample types")
expressionPlot(myGenes)
```
7.1 Geneset level plots
There are several plotting functions available for gene-set-level visualization:
The csHeatmap() function is a plotting wrapper that takes as input either a CuffGeneSet or a CuffFeatureSet
object (essentially a collection of genes and/or features) and produces a heatmap of FPKM expression values. The
'cluster' argument can be used to re-order either 'row', 'column', or 'both' dimensions of this matrix. By default, the
Jensen-Shannon distance is used as the clustering metric, however, any function that produces a dist object can be
passed to the 'cluster' argument as well.
```{r}
h<-csHeatmap(myGenes,cluster='both')
h
h.rep<-csHeatmap(myGenes,cluster='both',replicates=T)
h.rep
```
If you prefer barplots over heatmaps for genesets (although this is not necessarily recommended for large gene
sets). You can use the expressionBarplot() method on a CuffFeatureSet or a CuffGeneSet object.
```{r}
b<-expressionBarplot(myGenes, xlab="1,2,3")
b
#The csScatter() method can be used to produce scatter plot comparisons between any two conditions.
s<-csScatter(myGenes,"Pink","Ripe",smooth=T)
s
#The volcano plot is a useful visualization to compare fold change between any two conditions and significance(-log P-values).
v<-csVolcano(myGenes,"Pink","Ripe")
v
```
Similar plots can be made for all sub-level features of a CuGeneSet class by specifying which slot you would like
to plot (eg. isoforms(myGenes),TSS(myGenes),CDS(myGenes)).
```{r}
ih<-csHeatmap(isoforms(myGenes),cluster='both',labRow=F)
ih
th<-csHeatmap(TSS(myGenes),cluster='both,',labRow=F)
th
```
8 Individual Genes
An individual CuffGene object can be created by using the getGene() function for a given 'gene id' or 'gene short name'.
As of cummeRbund v2:0 you can also use isoform id, tss group id, or cds id values to retrieve the corresponding
parent gene object.
```{r}
head(myGeneIds)
myGeneId<-"CUFF.51383" ## splicing variant
myGene<-getGene(cuff,myGeneId)
myGene
```
8.1 Gene-level plots
```{r}
gl<-expressionPlot(myGene)
gl
gl.rep<-expressionPlot(myGene,replicates=TRUE)
gl.rep
gl.iso.rep<-expressionPlot(isoforms(myGene),replicates=T)
gl.iso.rep
gl.cds.rep<-expressionPlot(CDS(myGene),replicates=T)
gl.cds.rep
gb<-expressionBarplot(myGene)
gb
gb.rep<-expressionBarplot(myGene,replicates=T)
gb.rep
igb<-expressionBarplot(isoforms(myGene),replicates=T)
igb
```
9.1 Overview of significant features
The sigMatrix() function can provide you with a \quick{and{dirty" view of the number of signifcant features of a
particular type, and at a given alpha (0:05 by default). For example:
```{r}
sigMatrix(cuff,level='genes',alpha=0.05)
```
9.3 Exploring the relationships between conditions
9.3.1 Distance matrix
Similarities between conditions and/or replicates can provide useful insight into the relationship between various
groupings of conditions and can aid in identifying outlier replicates that do not behave as expected. cummeRbund
provides the csDistHeat() method to visualize the pairwise similarities between conditions:
```{r}
csDistHeat(genes(cuff))
csDistHeat(genes(cuff),replicates=T)
```
9.3.2 Dimensionality reduction
Dimensionality reduction is an informative approach for clustering and exploring the relationships between conditions.
It can be useful for feature selection as well as identifying the sources of variability within your data. To this end, we
have applied two dierent dimensionality reduction strategies in cummeRbund: principal component analysis (PCA)
and multi-dimensional scaling (MDS). We provide the two wrapper methods, PCAplot and MDSplot
```{r}
genes.PCA<-PCAplot(genes(cuff),"PC1","PC2")
genes.PCA
genes.MDS<-MDSplot(genes(cuff))
genes.MDS
genes.PCA.rep<-PCAplot(genes(cuff),"PC1","PC2",replicates=T)
genes.PCA.rep
genes.MDS.rep<-MDSplot(genes(cuff),replicates=T)
genes.MDS.rep
```
9.4 Partitioning
K-means clustering is a useful tool that can be helpful in identifying clusters of genes with similar expression profiles. In fact, these profiles are learned from the data during the clustering. csCluster() uses the pam() method from the clustering package to perform the partitioning around medoids. In this case however, the distance metric used by default is the Jensen-Shannon distance instead of the default Euclidean distance. Prior to performing this particular partitioning, the user must choose the number of clusters (K) into which the expression profiles should be divided.
```{r}
ic<-csCluster(myGenes,k=4)
head(ic$cluster)
#- See more at: http://compbio.mit.edu/cummeRbund/manual_2_0.html#sthash.xOG6Ej9O.dpuf
```
9.6 Finding similar genes
Another common question in large-scale gene expression analyses is 'How can I and genes with similar expression
probles to gene x?'. We have implemented a method, fndSimilar to allow you to identify a fxed number of the most
similar genes to a given gene of interest. For example, if you wanted to and the 20 genes most similar to "PINK1",
you could do the following:
```{r}
mySimilar<-findSimilar(cuff,"CUFF.11214",n=20)
mySimilar.expression<-expressionPlot(mySimilar,logMode=T,showErrorbars=F)
mySimilar.expression
```
```{r}
sessionInfo()
```