Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update methylation.Rmd #107

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 5 additions & 64 deletions methyl/methylation.Rmd
Original file line number Diff line number Diff line change
@@ -1,44 +1,28 @@
---
layout: page
title: Analyzing DNA methylation data
---


```{r options, echo=FALSE}
library(knitr)
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))
```


In this unit we will show an example of analyzing methylation data. We will use colon cancer data from TCGA. The data was created with the Illumina 450K array and we have already processed the raw data to create matrix with methylation measurements. The script that creates these ojects is here: https://github.com/genomicsclass/labs/blob/master/Rscripts/read_tcga_meth.R

Let's begin by loading the data
loading the data
```{r,message=FALSE}
# devtools::install_github("genomicsclass/coloncancermeth")
library(S4Vectors)
library(coloncancermeth)
data(coloncancermeth)
```

We know have three tables one containing the methylation data, one with information about the samples or columns of the data matrix, and granges object with the genomic location of the CpGs represetned in the rows of the data matrix

```{r}
dim(meth) ##this is the methylation data
dim(pd) ##this is sample information
length(gr)
```

The `pd` object includes clinical information. One the columns tells us if the sample is from colon cancer or from normal tissue

```{r}
colnames(pd)
table(pd$Status)
normalIndex <- which(pd$Status=="normal")
cancerlIndex <- which(pd$Status=="cancer")
```


Let's start by taking a quick look at the distribution of methylation measurements for the normal samples:

```{r}
i=normalIndex[1]
plot(density(meth[,i],from=0,to=1),main="",ylim=c(0,3),type="n")
Expand All @@ -50,25 +34,16 @@ for(i in cancerlIndex){
lines(density(meth[,i],from=0,to=1),col=2)
}
```

We are interested in finding regions of the genome that are different between cancer and normal samples. Furthermore, we want regions that are consistenly different therefore we can treat this as an inference problem. We can compute a t-statistic for each CpG:

```{r,message=FALSE}
library(limma)
X<-model.matrix(~pd$Status)
fit<-lmFit(meth,X)
eb <- eBayes(fit)
```

A volcano plot reveals many differences:

```{r}
library(rafalib)
splot(fit$coef[,2],-log10(eb$p.value[,2]),xlab="Effect size",ylab="-log10 p-value")
```

If we have reason to believe for DNA methylation to have an effect on gene expression a region of the genome needs to be affected, not just a single CpG, we should look beyond. Here is plot of the region surrounding the top hit:

```{r,message=FALSE}
library(GenomicRanges)
i <- which.min(eb$p.value[,2])
Expand All @@ -81,58 +56,28 @@ pos=start(gr)
plot(pos[Index],fit$coef[Index,2],type="b",xlab="genomic location",ylab="difference")
matplot(pos[Index],meth[Index,],col=cols,xlab="genomic location")
```

We can search for these regions explicitly instead of searching for single points, as explained by Jaffe and Irizarry (2012) [http://www.ncbi.nlm.nih.gov/pubmed/22422453].

If we are going to perform regional analysis we first have to define a region. But one issue is that not only do we have to separate the analysis by chromosome but that within each chromosome we usually have big gaps creating subgroups of regions to be analyzed.

```{r}
chr1Index <- which(chr=="chr1")
hist(log10(diff(pos[chr1Index])),main="",xlab="log 10 method")
```

We can create groups in the following way.

```{r,message=FALSE}
# BiocManager::install("bumphunter")
library(bumphunter)
cl=clusterMaker(chr,pos,maxGap=500)
table(table(cl)) ##shows the number of regions with 1,2,3, ... points in them
```


Now let's consider two example regions:

```{r}
###Select the region with the smallest value
Index<- which(cl==cl[which.min(fit$coef[,2])])
matplot(pos[Index],meth[Index,],col=cols,pch=1,xlab="genomic location",ylab="methylation")

x1=pos[Index]
y1=fit$coef[Index,2]
plot(x1,y1,xlab="genomic location",ylab="Methylation difference",ylim=c(-1,1))
abline(h=0,lty=2)
abline(h=c(-.1,.1),lty=2)
```

This region shows only a single CpG as different. In contrast, notice this region:

```{r}
Index=which(cl==72201) ##we know this is a good example from analysis we have already performed

##Example this region
Index=which(cl==72201)
matplot(pos[Index],meth[Index,],col=cols,pch=1,xlab="genomic location",ylab="methylation")

x2=pos[Index]
y2=fit$coef[Index,2]
plot(x2,y2,xlab="genomic location",ylab="Methylation difference",ylim=c(-1,1))
abline(h=0,lty=2)
abline(h=c(-.1,.1),lty=2)
```

<a name="DMR"></a>

If we are interested in prioritizing regions over single points, we need an alternative approach. If we assume that the real signal is smooth, we could use statistical smoothing techniques such as loess. Here is an example two regions above

```{r}
lfit <- loess(y1~x1,degree=1,family="symmetric",span=1/2)
plot(x1,y1,xlab="genomic location",ylab="Methylation difference",ylim=c(-1,1))
Expand All @@ -145,9 +90,6 @@ abline(h=c(-.1,0,.1),lty=2)
lines(x2,lfit$fitted,col=2)
```


The bumphunter automates this procedure:

```{r}
res<-bumphunter(meth,X,chr=chr,pos=pos,cluster=cl,cutoff=0.1,B=0)
tab<-res$table
Expand All @@ -162,5 +104,4 @@ plot(pos[Index],res$fitted[Index,1],xlab="genomic location",ylab="Methylation di
abline(h=c(-0.1,0,.1),lty=2)
```

The function also allows from smoothing and permutation based inference for the regions. However, we do not recommend running the function with these options without the ability to parallelize.

The function also allows from smoothing and permutation based inference for the regions.