36_CummeRbund.Rmd

Data analysis of CuffDiff output using CummeRbund
===================================================


Install packages
---------------------------------------------------
```{r}
# 
# http://bioconductor.org/packages/devel/bioc/vignettes/cummeRbund/inst/doc/cummeRbund-manual.pdf
# http://compbio.mit.edu/cummeRbund/manual_2_0.html
# http://compbio.mit.edu/cummeRbund/manual_2_0.html#sthash.QPb8ZXGK.dpuf
#source("http://bioconductor.org/biocLite.R")
#biocLite("cummeRbund")
#install.packages('RSQLite')
#install.packages('ggplot2')
#install.packages('reshape2')
#install.packages('plyr')
#install.packages('fastcluster')
#install.packages('rtracklayer')
#install.packages('Gviz')
#install.packages('BiocGenerics')
#install.packages('Hmisc')
```

Load the library and get the directory with the data
```{r}
library(cummeRbund)
setwd('~/Desktop/temp/04_CuffDiff_withLabel/')
dir()
```

```{r}
cuff<-readCufflinks()
cuff
```

Add annotations
```{r}
annot<-read.table("genes_annotation_unique.tab",sep="\t")
#head(annot[,2])
#addFeatures(cuff,annot,level="genes")
```

Global statistics and Quality Control
```{r}
disp<-dispersionPlot(genes(cuff))
disp
```
The squared coefficient of variation is a normalized measure of cross-replicate variability that can be useful for evaluating the quality your RNA-seq data. Differences in CV2 can result in lower numbers of differentially expressed genes due to a higher degree of variability between replicate fpkm estimates.

```{r}
genes.scv<-fpkmSCVPlot(genes(cuff))
fpkmSCVPlot(genes(cuff))
isoforms.scv<-fpkmSCVPlot(isoforms(cuff))
fpkmSCVPlot(isoforms(cuff))
```


To assess the distributions of FPKM scores across samples, you can use the csDensity plot (Figure 1).
```{r}
dens<-csDensity(genes(cuff))
dens
densRep<-csDensity(genes(cuff),replicates=T)
densRep
#Boxplots can be visualized using the csBoxplot method (Figure 2).
b<-csBoxplot(genes(cuff))
b
brep<-csBoxplot(genes(cuff),replicates=T)
brep
#A matrix of pairwise scatterplots can be drawn using the csScatterMatrix() method.
s<-csScatterMatrix(genes(cuff))
s
```

Individual Pairwise comparisons can be made by using csScatter . You must specify the sample names to use for
the x and y axes

```{r}
s<-csScatter(genes(cuff),"Pad","Ripe",smooth=T)
s
dend<-csDendro(genes(cuff))
dend.rep<-csDendro(genes(cuff),replicates=T)
```

MvsA plots can be useful to determine any systematic bias that may be present between conditions. The CuffData
method MAplot() can be used to examine these intensity vs fold-change plots. You must specify the sample names
to use for the pairwise comparison with x and y:

```{r}
m<-MAplot(genes(cuff),"Pad","Ripe")
m
mCount<-MAplot(genes(cuff),"Pad","Ripe",useCount=T)
mCount
```

Volcano plots are also available for the CuffData objects.
```{r}
v<-csVolcanoMatrix(genes(cuff))
v
#For individual pairwise comparisons, you must specify the comparisons by sample name.
v<-csVolcano(genes(cuff),"Pink","Ripe")
v
```

6 Accessing Data
Cuffdiff run information
Run-level information such as run parameters, and sample information can be accessed from a CuffSet object by
using the runInfo and replicates methods:

```{r}
runInfo(cuff)
replicates(cuff)
Features/Annotation
```
Feature-level information can be accessed directly from a CuffData object using the fpkm, repFpkm, count, diffData,
or annotation methods:
```{r}
gene.features<-annotation(genes(cuff))
head(gene.features)
```

```{r}
gene.fpkm<-fpkm(genes(cuff))
head(gene.fpkm)
gene.repFpkm<-repFpkm(genes(cuff))
head(gene.repFpkm)
gene.counts<-count(genes(cuff))
head(gene.counts)
gene.diff<-diffData(genes(cuff))
head(gene.diff)
```


Condition and feature names
Vectors of sample names and feature names are available by using the samples and featureNames methods:
```{r}
sample.names<-samples(genes(cuff))
head(sample.names)
gene.featurenames<-featureNames(genes(cuff))
head(gene.featurenames)
```


Convenience functions
To facilitate Bioconductor-like operations, an 'FPKM-matrix' can be returned easily using the fpkmMatrix method:
```{r}
gene.matrix<-fpkmMatrix(genes(cuff))
head(gene.matrix)
```

A matrix of replicate FPKM values can be retrieved by using repFpkmMatrix
```{r}
gene.rep.matrix<-repFpkmMatrix(genes(cuff))
head(gene.rep.matrix)
```

Similarly, a matrix of normalized counts can be generated by using countMatrix
```{r}
gene.count.matrix<-countMatrix(genes(cuff))
head(gene.count.matrix)
```

7 Creating Gene Sets
Gene Sets (stored in a CuffGeneSet object) can be created using the getGenes method on a CuffSet object. You
must first create a vector of 'gene id' or 'gene short name' values to identify the genes you wish to select:
```{r}
data(sampleData)
myGeneIds<-sampleIDs
myGeneIds
# ABA
myGeneIds<-c('CUFF.48633','CUFF.23339')
myGeneIds
myGenes<-getGenes(cuff,myGeneIds)
myGenes
expressionBarplot(myGenes) + ylab("number of subjects") +
   labs(fill="Sample types")
expressionPlot(myGenes)
```


7.1 Geneset level plots
There are several plotting functions available for gene-set-level visualization:
The csHeatmap() function is a plotting wrapper that takes as input either a CuffGeneSet or a CuffFeatureSet
object (essentially a collection of genes and/or features) and produces a heatmap of FPKM expression values. The
'cluster' argument can be used to re-order either 'row', 'column', or 'both' dimensions of this matrix. By default, the
Jensen-Shannon distance is used as the clustering metric, however, any function that produces a dist object can be
passed to the 'cluster' argument as well.

```{r}
h<-csHeatmap(myGenes,cluster='both')
h
h.rep<-csHeatmap(myGenes,cluster='both',replicates=T)
h.rep
```


If you prefer barplots over heatmaps for genesets (although this is not necessarily recommended for large gene
sets). You can use the expressionBarplot() method on a CuffFeatureSet or a CuffGeneSet object.

```{r}
b<-expressionBarplot(myGenes, xlab="1,2,3")
b
#The csScatter() method can be used to produce scatter plot comparisons between any two conditions.
s<-csScatter(myGenes,"Pink","Ripe",smooth=T)
s
#The volcano plot is a useful visualization to compare fold change between any two conditions and significance(-log P-values).
v<-csVolcano(myGenes,"Pink","Ripe")
v
```

Similar plots can be made for all sub-level features of a CuGeneSet class by specifying which slot you would like
to plot (eg. isoforms(myGenes),TSS(myGenes),CDS(myGenes)).
```{r}
ih<-csHeatmap(isoforms(myGenes),cluster='both',labRow=F)
ih
th<-csHeatmap(TSS(myGenes),cluster='both,',labRow=F)
th
```

8 Individual Genes
An individual CuffGene object can be created by using the getGene() function for a given 'gene id' or 'gene short name'.
As of cummeRbund v2:0 you can also use isoform id, tss group id, or cds id values to retrieve the corresponding
parent gene object.
```{r}
head(myGeneIds)
myGeneId<-"CUFF.51383" ## splicing variant
myGene<-getGene(cuff,myGeneId)
myGene
```

8.1 Gene-level plots
```{r}
gl<-expressionPlot(myGene)
gl
gl.rep<-expressionPlot(myGene,replicates=TRUE)
gl.rep
gl.iso.rep<-expressionPlot(isoforms(myGene),replicates=T)
gl.iso.rep
gl.cds.rep<-expressionPlot(CDS(myGene),replicates=T)
gl.cds.rep

gb<-expressionBarplot(myGene)
gb
gb.rep<-expressionBarplot(myGene,replicates=T)
gb.rep
igb<-expressionBarplot(isoforms(myGene),replicates=T)
igb

```


9.1 Overview of significant features
The sigMatrix() function can provide you with a \quick{and{dirty" view of the number of signifcant features of a
particular type, and at a given alpha (0:05 by default). For example:
```{r}
sigMatrix(cuff,level='genes',alpha=0.05)
```

9.3 Exploring the relationships between conditions
9.3.1 Distance matrix
Similarities between conditions and/or replicates can provide useful insight into the relationship between various
groupings of conditions and can aid in identifying outlier replicates that do not behave as expected. cummeRbund
provides the csDistHeat() method to visualize the pairwise similarities between conditions:
```{r}
csDistHeat(genes(cuff))
csDistHeat(genes(cuff),replicates=T)
```


9.3.2 Dimensionality reduction
Dimensionality reduction is an informative approach for clustering and exploring the relationships between conditions.
It can be useful for feature selection as well as identifying the sources of variability within your data. To this end, we
have applied two dierent dimensionality reduction strategies in cummeRbund: principal component analysis (PCA)
and multi-dimensional scaling (MDS). We provide the two wrapper methods, PCAplot and MDSplot

```{r}
genes.PCA<-PCAplot(genes(cuff),"PC1","PC2")
genes.PCA
genes.MDS<-MDSplot(genes(cuff))
genes.MDS
genes.PCA.rep<-PCAplot(genes(cuff),"PC1","PC2",replicates=T)
genes.PCA.rep
genes.MDS.rep<-MDSplot(genes(cuff),replicates=T)
genes.MDS.rep
```


9.4  Partitioning
K-means clustering is a useful tool that can be helpful in identifying clusters of genes with similar expression profiles. In fact, these profiles are learned from the data during the clustering. csCluster() uses the pam() method from the clustering package to perform the partitioning around medoids. In this case however, the distance metric used by default is the Jensen-Shannon distance instead of the default Euclidean distance. Prior to performing this particular partitioning, the user must choose the number of clusters (K) into which the expression profiles should be divided.
 
 ```{r}
ic<-csCluster(myGenes,k=4)
head(ic$cluster)
#- See more at: http://compbio.mit.edu/cummeRbund/manual_2_0.html#sthash.xOG6Ej9O.dpuf
```

9.6 Finding similar genes
Another common question in large-scale gene expression analyses is 'How can I and genes with similar expression
probles to gene x?'. We have implemented a method, fndSimilar to allow you to identify a fxed number of the most
similar genes to a given gene of interest. For example, if you wanted to and the 20 genes most similar to "PINK1",
you could do the following:

```{r}
mySimilar<-findSimilar(cuff,"CUFF.11214",n=20)
mySimilar.expression<-expressionPlot(mySimilar,logMode=T,showErrorbars=F)
mySimilar.expression
 ```


```{r}
sessionInfo()
```