add weights in pseudbulkDGE for edgeR analysis #107

koenvandenberge · 2023-02-01T07:48:41Z

Hi,

In some cases, one might want to add weights to the dispersion estimation and fitting of the edgeR model when using the pseudobulkDGE function. In my case this would be sample-level weights (much like voom with sample quality weights), but I imagine that observation-level weights may also be of interest, so I have allowed a matrix of weights that is of same dimension as the expression matrix.
This seems to be possible using only a minor adjustment.

Let me know what you think.

LTLA · 2023-02-02T06:57:51Z

In principle, the ability to include weights would be useful. However, there is a minor and major problem.

The minor problem is that weights needs to be subsetted by chosen to match the subsetting of x and col.data.

The major problem is that I don't see how these weights can be easily computed by your average user without asking them to go through the edgeR pipeline for each comparison. I assume that you're computing weights using knowledge of the design matrix, plus some normalization and filtering information, etc. If that's the case, you probably were already halfway through some kind of edgeR analysis, so why bother using pseudoBulkDGE at all?

How are you deriving your weights? Unlike limma, we don't have many weighting functions in edgeR, and the interpretation of the weights isn't so clear-cut either... something to do with scaling the QL dispersion IIRC. But if you do have a function, a more ergonomic approach would be to allow users to pass a weighting function that runs on y (possibly also passing in curdesign) and is expected to emit a vector/matrix of weights for each label.

koenvandenberge · 2023-02-02T09:28:58Z

Thank you for looking into this; I missed the subsetting and have adapted the pull request.

On the major problem, I think it depends. Personally, I am only interested in sample-level weights that do not depend on the gene. These can be estimated for all cell types in one go on the aggregated SCE, and subsequently provided to pseudobulkDGE, without first having to write a custom loop across cell types. Even if the weights would be gene-specific, they can be susbetted based on the filterByExpr filtering used in pseudobulkDGE, after first calculating them on the full pseudobulk data.
Sometimes, I think sample-level weights can be calculated only based on the colData too, e.g., sample purity/contamination effects or even inverse probability weighting approaches. These require working on the full pseudo-bulk data first, to estimate the (inverse) probabilities / weights and therefore it seems suboptimal to do this for every cell type separately using a function.

Regarding the interpretation, how much of an issue do you think it is? You'll know better, but it seems like it may be more clear-cut when not using the QL framework and so, if weights are used, the estimateDisp and glmFit workflow could be used instead, although that might be taking it too far for scran.

I understand that the suggestion may be niche, and if you think it does not fit within scran, feel free to decline the pull request. I was just hoping to make my future life a bit easier.

LTLA · 2023-02-04T08:21:01Z

The proposed changes to the code seem fine enough... provided the weights are sensible.

And I do worry about whether you have sensible weights to pass in. I was staring at edgeR's C++ code to jog my memory, and I think that the weights are being applied to the mean-variance function for each observation. So: if you were to say that an observation has a user-supplied weight of 2, you're claiming it's half as variable as it should be for its mean. This has the (superficially desired) effect of increasing its influence on the coefficient estimates, but it's also a very specific meaning, and I don't think you can just throw in any value and expect type I error control to be maintained.

On that note: IIRC, as soon as you have a non-unity weight, you're in already halfway into QL territory, as the QL methods also involve scaling the variance function. Specifically, you can't just divide the variance by some arbitrary user-supplied weight and still consider the observations to follow a negative binomial distribution - these non-normal models don't scale up and down like that, and that's why we have QL. So the choice between QL/non-QL shouldn't be the issue here.

Having said all of that, I barely remember any of this stuff, and my grasp of it was fairly tenuous to begin with, so it might just all be fine. Perhaps @gksmyth or @yunshun could provide their thoughts. I remember some discussion about weighting a few years ago, but I can't remember what the conclusion was.

gksmyth · 2023-02-04T10:29:14Z

Please be careful. Weights used for edgeR glm fitting and weights input to NB dispersion estimation are not the same thing. For glms and the QL pipeline, weights have a clear meaning in terms of the quasi-dispersions. For non-QL functions like estimateDisp(), weights only have a meaning in terms of interpreting observations as averages, which I suspect will not have the flexibility that @koenvandenberge is after.

Inverse probability weights are designed to correct biases in the linear model estimates and are not related to variances or dispersions. As far as I can see, inverse probability weights would not work correctly with dispersion estimation functions or with differential expression tests.

add weights in pseudbulkDGE for edgeR analysis

9770552

subset weights and add docs

89f7755

koenvandenberge added 2 commits February 2, 2023 14:55

fix typo

603b671

subset weights based on gene filtering

c325d39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add weights in pseudbulkDGE for edgeR analysis #107

add weights in pseudbulkDGE for edgeR analysis #107

koenvandenberge commented Feb 1, 2023 •

edited

Loading

LTLA commented Feb 2, 2023

koenvandenberge commented Feb 2, 2023

LTLA commented Feb 4, 2023

gksmyth commented Feb 4, 2023 •

edited

Loading

add weights in pseudbulkDGE for edgeR analysis #107

Are you sure you want to change the base?

add weights in pseudbulkDGE for edgeR analysis #107

Conversation

koenvandenberge commented Feb 1, 2023 • edited Loading

LTLA commented Feb 2, 2023

koenvandenberge commented Feb 2, 2023

LTLA commented Feb 4, 2023

gksmyth commented Feb 4, 2023 • edited Loading

koenvandenberge commented Feb 1, 2023 •

edited

Loading

gksmyth commented Feb 4, 2023 •

edited

Loading