Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

q-values for heavily-weighted tests are often much lower than the original p-values #5

Open
fweghorst opened this issue Sep 2, 2022 · 1 comment

Comments

@fweghorst
Copy link

fweghorst commented Sep 2, 2022

Thanks for producing this much-needed package. I have one concern: The lm_qvalue function often lowers the p-values of heavy-weighted tests, sometimes to the point of approaching or achieving significance, even when the original p-value was well above 0.50.

I've attached a sample of my data, as well as the R code use to obtain them, to demonstrate the issue. The data were obtained with the DESeq2 package, followed by lm_qvalue using per-gene read counts as weights. Some heavy-weighted genes behave as expected (e.g. ENSGALG00010000588 has a low p-value that is only adjusted slightly upward), but some heavy-weighted genes demonstrate the above-mentioned issue (e.g. ENSGALG00010020480 has a p-value of 0.737 that is lowered to 0.0705).

I understand that methods of FDR control that estimate a prior number of refuted null hypotheses (such as swfdr and Benjamini-Yekutieli) sometimes result in q-values that are lower than original p-values, but I did not expect the reduction to be as large as what I am seeing with swfdr. Any explanation would be much appreciated!

swfdr.csv
coldata <- read.csv("/dfs6/pub/fweghors/CREVASSE_ref/swishcoldata.csv")
library("tximeta")
library("DESeq2")
library(SummarizedExperiment)
y <- tximeta(coldata)
y <- summarizeToGene(y)
y$condition <- factor(y$condition, levels=c("VEH","DEVD"))
y$batch <- factor(y$batch, levels=c(1, 2))
dds <- DESeqDataSet(y, design = ~ condition + batch)
dds <- DESeq(dds)
res <- results(dds, name="condition_DEVD_vs_VEH")
genes <- dim(dds)[1]
counts <- assay(dds)[1:genes,1:10]
weights <- rowSums(counts)
res <- cbind(res, weights)
pvalues <- res$pvalue
library(swfdr)
qv <- lm_qvalue(pvalues, X=weights)
padjust <- qv$qvalues
res_adj <- cbind(res,padjust)

@jtleek
Copy link
Contributor

jtleek commented Sep 11, 2022

Hi!

A few points - though I'm not 100% sure any will totally solve the problem :/

  1. In general the most common way for the q-values to be lower (including much lower) than their respective p-values is if the prior probability that a hypothesis is null is very low. For example if almost all the p-values from a study are very small, then you will get a \pi_0 near 0 with methods' like those of Storey, ours, etc.. and if you have any large p-values the corresponding q-values will still be small.
  2. In the code above, it looks like the weights you are using are based on the average count. Those weights then correspond to different prior probabilities \pi_0 | weight. These prior probabilities are likely very low for some values of the weights and so you get much smaller q-values than p-values in those cases.

What I would do next to diagnose is plot weight vs p-value and weight vs q-value. I would also plot q-value vs p-value, but colored by weight. These diagnostics should help you discover which values of the weights have very low prior probabilities of the null, leading to this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants