how to use the methylation data when to do boxplot ? #324

DengEr-1993 · 2024-08-08T15:47:47Z

Hello Alex,

I have a small question about RRBS methylation data.

It is my question that I post before.

My question this time is how to use RRBS methylation data to do a boxplot based on CpG sites values.

You are analysing RRBS data comparing two groups with two replicates.
You aggregated counts over* tiling windows*, which *sums up the
methylated/unmethylated
base counts over the *tilling windows accross genome (see code

methylKit/R/regionalize.R

Lines 121 to 181 in 388770d

setMethod("regionCounts", signature(object="methylRaw",regions="GRanges"),

function(object,regions,cov.bases,strand.aware,save.db=FALSE,...){

#require(GenomicRanges)

# sort regions

regions <- sortSeqlevels(regions)

regions <- sort(regions,ignore.strand=TRUE)

# overlap object with regions

# convert object to GRanges

if(!strand.aware){

g.meth=as(object,"GRanges")

strand(g.meth)="*"

mat=IRanges::as.matrix( findOverlaps(regions,g.meth ) )

#mat=matchMatrix( findOverlaps(regions,g.meth ) )

}else{

mat=IRanges::as.matrix( findOverlaps(regions,as(object,"GRanges")) )

#mat=matchMatrix( findOverlaps(regions,as(object,"GRanges")) )

}

#require(data.table)

# create a temporary data.table row ids from regions and counts from object

coverage=numCs=numTs=id=covered=NULL

df=data.frame(id = mat[, 1], getData(object)[mat[, 2], c(5, 6, 7)])

dt=data.table(df)

#dt=data.table(id=mat[,1],object[mat[,2],c(6,7,8)] ) worked with data.table 1.7.7

# use data.table to sum up counts per region

sum.dt=dt[,list(coverage=sum(coverage),

numCs =sum(numCs),

numTs =sum(numTs),covered=length(numTs)),by=id]

sum.dt=sum.dt[sum.dt$covered>=cov.bases,]

temp.df=as.data.frame(regions) # get regions to a dataframe

# look for values with "name" in it, eg. "tx_name" or "name"

# valuesList = names(values(regions))

# nameid = valuesList[grep (valuesList, pattern="name")]

#create id string for the new object to be returned

#ids have to be unique and we can not assume GRanges objects will

#have a name attribute

if("name" %in% names(temp.df))

{

new.ids=paste(temp.df[sum.dt$id,"seqnames"],temp.df[sum.dt$id,"start"],

temp.df[sum.dt$id,"end"],temp.df[sum.dt$id,"name"],sep=".")

}else{

new.ids=paste(temp.df[sum.dt$id,"seqnames"],temp.df[sum.dt$id,"start"],

temp.df[sum.dt$id,"end"],sep=".")

}

#create a new methylRaw object to return

new.data=data.frame(#id =new.ids,

chr =temp.df[sum.dt$id,"seqnames"],

start =temp.df[sum.dt$id,"start"],

end =temp.df[sum.dt$id,"end"],

strand =temp.df[sum.dt$id,"strand"],

coverage=sum.dt$coverage,

numCs =sum.dt$numCs,

numTs =sum.dt$numTs)

),
so methylation level per region would be the total CpG sites percentage
(sum(freqC)/sum(Coverage)).

But if I do a boxplot, I should calculate the mean methylation level of one dmr which may contain many CpGs.

For example, below is a dmr which contains 6 CpG sites. But because here it shows us the methylation percentage not freqC.

chr1.836516 chr1 836516 F 4 100.00 0
chr1.836543 chr1 836543 F 4 0.00 100
chr1.836941 chr1 836941 F 19 68.42 31.58
chr1.836942 chr1 836942 R 1 100.00 0
chr1.836953 chr1 836953 F 19 78.95 21.05
chr1.836954 chr1 836954 R 1 100.00 0

So if I do boxplot for this dmr,

Mean Methylation Level (Group 1) = (100.00+0.00+68.42+100.00+78.95+100.00)/6 =74.56 or 0.7456

However, this value must be different compared to sum(freqC)/sum(Coverage) , right ?

For sum(freqC)/sum(Coverage) ,

it's :

(4 * 1 + 4* 0+ .... 1* 1)/(4+4+19+1+19+1)= 0.70833333

Do you think I can do boxplot this way if I want to show all CpGs methylation level ?

Can I ignore this difference in some degree.

Because I think it is the only way to show the methylation level by boxplot.

I am looking forward to your response.

Best regards,
Xiangyi

alexg9010 · 2024-08-08T19:36:49Z

Hi Xiangyi,

I think when dealing with this low amount of coverage, a difference of 5% can safely be ignored. Your DMRs should show much higher differences to be significant.

If you are visualizing single DMRs, averaging the methylation per CpG would allow you to create a point for each CpG and is the only way to get the boxplot.

Using the mean per DMR gives you a weighted estimate of the methylation and is more biased towards higher coverage CpGs, but this seems reasonable given the low coverage. But you do not get methylation information per CpG, so this can not be used for the boxplot. However when reporting average methylation per DMR, this is probably the value you should use.

Best,
Alex

.

DengEr-1993 · 2024-08-09T08:58:29Z

Dear Alex,

Thank you so much for your response.

I think when dealing with this low amount of coverage, a difference of 5% can safely be ignored. Your DMRs should show much higher differences to be significant.

Well, I understand what you mean. Yes, we used a more strict criteria to do dmr calling. Here I just show you an example.

If you are visualizing single DMRs, averaging the methylation per CpG would allow you to create a point for each CpG and is the only way to get the boxplot.

I agree with you that it is only way to show all CpG cites methylation levels in a boxplot.

Using the mean per DMR gives you a weighted estimate of the methylation and is more biased towards higher coverage CpGs, but this seems reasonable given the low coverage. But you do not get methylation information per CpG, so this can not be used for the boxplot. However when reporting average methylation per DMR, this is probably the value you should use.

But I am a little confused. the mean per DMR you mean sum(freqC)/sum(Coverage) , right ? It can not be used for the boxplot.

I think I know it must be more complicated process of dmr calling.

when reporting average methylation per DMR, this is probably the value you should use

You mean average methylation per DMR of DMR calling, or sum(freqC)/sum(Coverage).

If so, I get you.

------------------------ Last question
For our data, we did paired-comparison DMR calling among 6 groups. So we got many DMRs in each comparisons.

And in order to see how much degree these DMRs show, we load data into IGV.

In some degree, we can see the difference, for example, one dmr, low methylation of A, compared to B, but also can be compared to the rest groups.

However, I was wondering if somebody may ask how do you know it is significant that this dmr in A versus other rest groups ?
This dmr may just comes from one paired comparison, for example, A versus B and it may show us really low methylation only in A but high or full methylation in B and the other 4 groups.

So we want to do a boxplot based on the same DMRs and use t.test to show significance between all comparisons.

Could you give me some advice on my last question if you understand my description.

Thanks again.

Best,
Xiangyi

alexg9010 · 2024-08-09T13:32:17Z

Hi Xiangyi, Correct, with "mean per DMR" I am talking about `sum(freqC)/sum(Coverage)` per DMR, in contrast to "averaging the methylation per CpG" within the DMR which would be `(numC1/Coverage1 + ... numCN/CoverageN) / N`. Concerning your question, what you suggest should be fine, since you get the significance between the groups using the t.test. Best, Alex Am Fr., 9. Aug. 2024 um 10:58 Uhr schrieb Charite Learner < ***@***.***>:

…

Dear Alex, Thank you so much for your response. I think when dealing with this low amount of coverage, a difference of 5% can safely be ignored. Your DMRs should show much higher differences to be significant. Well, I understand what you mean. Yes, we used a more strict criteria to do dmr calling. Here I just show you an example. If you are visualizing single DMRs, averaging the methylation per CpG would allow you to create a point for each CpG and is the only way to get the boxplot. I agree with you that it is only way to show all CpG cites methylation levels in a boxplot. Using the mean per DMR gives you a weighted estimate of the methylation and is more biased towards higher coverage CpGs, but this seems reasonable given the low coverage. But you do not get methylation information per CpG, so this can not be used for the boxplot. However when reporting average methylation per DMR, this is probably the value you should use. But I am a little confused. the mean per DMR you mean sum(freqC)/sum(Coverage) , right ? It can not be used for the boxplot. I think I know it must be more complicated process of dmr calling. when reporting average methylation per DMR, this is probably the value you should use You mean average methylation per DMR of DMR calling, or sum(freqC)/sum(Coverage). If so, I get you. ------------------------ Last question For our data, we did paired-comparison DMR calling among 6 groups. So we got many DMRs in each comparisons. And in order to see how much degree these DMRs show, we load data into IGV. In some degree, we can see the difference, for example, one dmr, low methylation of A, compared to B, but also can be compared to the rest groups. However, I was wondering if somebody may ask how do you know it is significant that this dmr in A versus other rest groups ? This dmr may just comes from one paired comparison, for example, A versus B and it may show us really low methylation only in A but high or full methylation in B and the other 4 groups. So we want to do a boxplot based on the same DMRs and use t.test to show significance between all comparisons. Could you give me some advice on my last question if you understand my description. Thanks again. Best, Xiangyi — Reply to this email directly, view it on GitHub <#324 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADK7JD3VDECGH4Z2YQ5JEJDZQSAEVAVCNFSM6AAAAABMGYUAH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZXGQ4DKNZUGE> . You are receiving this because you commented.Message ID: ***@***.***>

DengEr-1993 · 2024-08-09T15:47:18Z

Hi Xiangyi, Correct, with "mean per DMR" I am talking about sum(freqC)/sum(Coverage) per DMR, in contrast to "averaging the methylation per CpG" within the DMR which would be (numC1/Coverage1 + ... numCN/CoverageN) / N. Concerning your question, what you suggest should be fine, since you get the significance between the groups using the t.test. Best, Alex Am Fr., 9. Aug. 2024 um 10:58 Uhr schrieb Charite Learner < @.>:
…
Dear Alex, Thank you so much for your response. I think when dealing with this low amount of coverage, a difference of 5% can safely be ignored. Your DMRs should show much higher differences to be significant. Well, I understand what you mean. Yes, we used a more strict criteria to do dmr calling. Here I just show you an example. If you are visualizing single DMRs, averaging the methylation per CpG would allow you to create a point for each CpG and is the only way to get the boxplot. I agree with you that it is only way to show all CpG cites methylation levels in a boxplot. Using the mean per DMR gives you a weighted estimate of the methylation and is more biased towards higher coverage CpGs, but this seems reasonable given the low coverage. But you do not get methylation information per CpG, so this can not be used for the boxplot. However when reporting average methylation per DMR, this is probably the value you should use. But I am a little confused. the mean per DMR you mean sum(freqC)/sum(Coverage) , right ? It can not be used for the boxplot. I think I know it must be more complicated process of dmr calling. when reporting average methylation per DMR, this is probably the value you should use You mean average methylation per DMR of DMR calling, or sum(freqC)/sum(Coverage). If so, I get you. ------------------------ Last question For our data, we did paired-comparison DMR calling among 6 groups. So we got many DMRs in each comparisons. And in order to see how much degree these DMRs show, we load data into IGV. In some degree, we can see the difference, for example, one dmr, low methylation of A, compared to B, but also can be compared to the rest groups. However, I was wondering if somebody may ask how do you know it is significant that this dmr in A versus other rest groups ? This dmr may just comes from one paired comparison, for example, A versus B and it may show us really low methylation only in A but high or full methylation in B and the other 4 groups. So we want to do a boxplot based on the same DMRs and use t.test to show significance between all comparisons. Could you give me some advice on my last question if you understand my description. Thanks again. Best, Xiangyi — Reply to this email directly, view it on GitHub <#324 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADK7JD3VDECGH4Z2YQ5JEJDZQSAEVAVCNFSM6AAAAABMGYUAH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZXGQ4DKNZUGE . You are receiving this because you commented.Message ID: @.>

Many thanks !

Have a nice weekend.

Best regards,
Xiangyi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to use the methylation data when to do boxplot ? #324

how to use the methylation data when to do boxplot ? #324

DengEr-1993 commented Aug 8, 2024

alexg9010 commented Aug 8, 2024

DengEr-1993 commented Aug 9, 2024

alexg9010 commented Aug 9, 2024 via email

DengEr-1993 commented Aug 9, 2024

how to use the methylation data when to do boxplot ? #324

how to use the methylation data when to do boxplot ? #324

Comments

DengEr-1993 commented Aug 8, 2024

alexg9010 commented Aug 8, 2024

DengEr-1993 commented Aug 9, 2024

alexg9010 commented Aug 9, 2024 via email

DengEr-1993 commented Aug 9, 2024