Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<vsel_object>$summaries$<sub[[k]]_or_ref>$w and weights in get_stat() #330

Closed
fweber144 opened this issue Jul 6, 2022 · 5 comments
Closed

Comments

@fweber144
Copy link
Collaborator

fweber144 commented Jul 6, 2022

I'm sorry, but I still don't understand what element <vsel_object>$summaries$<sub[[k]]_or_ref>$w is supposed to contain: The original observation weights (e.g., the observation-specific numbers of trials in a binomial family with > 1 trials for at least one observation)? Or the weights of the CV folds the observations are in? Or some combination of the two? Or something completely different (although I guess it must have to do with the observations because it has length $N$ with $N$ denoting the number of observations)?

@AlejandroCatalina, could you clarify this? Sorry for bothering you, but this is an important issue because it affects all users with nontrivial observation weights (i.e., where at least one observation weight is not 1). Using Git Blame, I tried to find the original commit introducing this element w but I ended up at commit b05a8f4 which is a commit with a lot of large changes and therefore hard to disentangle.

Related questions concerning this element w already came up in #94, #173, and #188 (note the later comments in #94), but have never really been clarified.

The reason why I stumbled across this again is related to #329: Putting #329 into a "real application" context shows that the weights in projpred:::auc() are not specified correctly:

set.seed(6834)
nobs <- 100L
mu <- binomial()$linkinv(-0.42 + rnorm(nobs))
trials <- sample.int(30L, size = nobs, replace = TRUE)
y <- rbinom(nobs, size = trials, prob = mu)
dat <- data.frame(y = y, trials = trials)

library(rstanarm)
options(mc.cores = parallel::detectCores(logical = FALSE))
r_fit <- stan_glm(
  cbind(y, trials - y) ~ 1,
  family = binomial(),
  data = dat,
  seed = 101
)

library(projpred)
vs <- varsel(r_fit,
             method = "forward",
             seed = 12354)
debug(projpred:::auc)
summary(vs, stats = "auc", seed = 43465)

That debugging of the last line shows that inside of projpred:::auc(), the weights are all 1, even though the original ones in dat$trials are not. While tracing down this issue, I realized that argument weights of get_stat() is probably misspecified in general (i.e., all stats of get_stat() are affected). And since all calls of get_stat() are within .tabulate_stats(), it was easy to see that <vsel_object>$summaries$<sub[[k]]_or_ref>$w is the problem here. (Note that despite this issue here, #329 is important on its own because it shows that even if the weights were specified correctly, the returned AUC value would be incorrect.)

fweber144 added a commit to fweber144/projpred that referenced this issue Jul 6, 2022
@fweber144 fweber144 changed the title <vsel_object>$summaries$<sub_or_ref>$w and weights in get_stat() <vsel_object>$summaries$<sub[[k]]_or_ref>$w and weights in get_stat() Jul 6, 2022
@fweber144
Copy link
Collaborator Author

In contrast to cv_varsel(), it seems like varsel() never returns this element <vsel_object>$summaries$<sub[[k]]_or_ref>$w, so is w supposed to contain the weights of the CV folds the observations are in? But then, what about the original observation weights? The example above shows that at least in case of the AUC, argument weights of get_stat() needs the observation weights.

I also noticed that <vsel_object>$summaries$ref$w (so w for the reference model) is always NULL which is strange because it looks like lines

summ <- summ_ref
res <- get_stat(summ$mu, summ$lppd, varsel$d_test, stat, mu.bs = mu.bs,
lppd.bs = lppd.bs, weights = summ$w, alpha = alpha, ...)
expect a non-NULL element w also for the reference model (which would also make sense if w is supposed to contain the weights of the CV folds the observations are in because the reference model's performance is also cross-validated if the submodels' performances are).

@fweber144
Copy link
Collaborator Author

So it seems like element <vsel_object>$summaries$<sub[[k]]_or_ref>$w is supposed to contain the weights of the CV folds the observations are in. The actual observation weights (those supplied by the user) are not taken into account appropriately (in get_stat()), so this needs to be fixed.

@fweber144 fweber144 mentioned this issue Jul 27, 2022
fweber144 added a commit that referenced this issue Jul 27, 2022
@AlejandroCatalina
Copy link
Collaborator

AlejandroCatalina commented Oct 11, 2022 via email

@fweber144
Copy link
Collaborator Author

Thanks for the confirmation :)

@fweber144
Copy link
Collaborator Author

Just for the sake of completeness:

The actual observation weights (those supplied by the user) are not taken into account appropriately (in get_stat()), so this needs to be fixed.

from #330 (comment) has been addressed by 35d41ce (within PR #344) and 0e68163 (which fixes #342).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants