Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsampled LOO #94

Open
fweber144 opened this issue Mar 19, 2021 · 16 comments
Open

Subsampled LOO #94

fweber144 opened this issue Mar 19, 2021 · 16 comments
Labels
bug Bugs.

Comments

@fweber144
Copy link
Collaborator

In cv_varsel.refmodel(), the observation-specific weights are hard-coded to be all equal (line

w <- rep(1, NCOL(solution_terms_cv_ch))
), but in case of subsampling LOO-CV, they might be different (as the comment says there, too). Thus, shouldn't that line be modified?

@AlejandroCatalina
Copy link
Collaborator

Yes. This is a bit less straightforward to fix as we can compute both approximate LOO or full LOO, so I think the right weights should be returned by loo_varsel and kfold_varsel themselves. I'll work on this after resolving the easier issues.

@AlejandroCatalina AlejandroCatalina added the bug Bugs. label Apr 14, 2021
@fweber144
Copy link
Collaborator Author

Related comments/discussions may be found in issue #173 (enumeration point 4) and PR #188. Now that I understand the purpose of w better, I think constant weights for LOO-CV are ok (no matter whether subsampled or not) but they might be inappropriate for K-fold CV with folds of very different sizes.

@fweber144
Copy link
Collaborator Author

fweber144 commented Oct 11, 2021

I have now read Magnusson et al. (2019) and Magnusson et al. (2020), but I'm still confused how projpred wants to perform the subsampled LOO CV. More importantly, I have the strong feeling that the current implementation is not correct. What I learned from Magnusson et al. (2019) and Magnusson et al. (2020) makes me think the problems with the implementation in projpred get deeper than I initially thought.

What is important for projpred is that:

  1. Magnusson et al. (2019) propose to perform probability-proportional-to-size sampling (PPS) combined with the Hansen-Hurwitz (HH) estimator.
  2. Magnusson et al. (2020) propose to perform simple random sampling (SRS) combined with the difference estimator. The reason for this new proposal is that the approach by Magnusson et al. (2019) is not optimal for model comparisons (as explained by Magnusson et al., 2020).

However, in my opinion, neither the approach by Magnusson et al. (2019) nor that by Magnusson et al. (2020) is currently used in projpred. The key equation from Magnusson et al. (2019) is equation (4). The key equation from Magnusson et al. (2020) is equation (7). However, I can't find any of these two equations in projpred. The appropriate place would probably be in lines

projpred/R/summary_funs.R

Lines 171 to 179 in dd4bc36

if (!is.null(lppd.bs)) {
value <- sum((lppd - lppd.bs) * weights, na.rm = TRUE)
value.se <- weighted.sd(lppd - lppd.bs, weights,
na.rm = TRUE) / sqrt(n_notna) * n_notna
} else {
value <- sum(lppd * weights, na.rm = TRUE)
value.se <- weighted.sd(lppd, weights,
na.rm = TRUE) / sqrt(n_notna) * n_notna
}
(these lines only refer to the ELPD as performance statistic; projpred has more performance statistics, of course), but this does not correspond to equation (4) from Magnusson et al. (2019) and neither to equation (7) from Magnusson et al. (2020). The SE estimators from these lines also don't correspond to the SE estimators from Magnusson et al. (2019, equation (6)) and Magnusson et al. (2020, equation (9)). The subsampling itself (performed in lines

projpred/R/cv_varsel.R

Lines 839 to 840 in dd4bc36

w <- exp(lppd - max(lppd))
inds <- sample(seq_along(lppd), size = nloo, prob = w)
) also confuses me. For example, for the approach by Magnusson et al. (2019), I would have expected the logarithms of the predictive densities (not the predictive densities themselves) in the definition of w, and then sampling with replacement.

So if the current implementation of subsampled LOO CV in projpred is not correct, then I think projpred should not support subsampled LOO CV until this is fixed.

Of course, I could be totally wrong and neither Magnusson et al. (2019) nor Magnusson et al. (2020) are relevant for subsampled LOO CV in projpred. But then, I would be even more confused because there is this comment:

projpred/R/cv_varsel.R

Lines 821 to 825 in dd4bc36

## decide which points to go through in the validation based on
## proportional-to-size subsampling as implemented in Magnusson, M., Riis
## Andersen, M., Jonasson, J. and Vehtari, A. (2019). Leave-One-Out
## Cross-Validation for Large Data. In International Conference on Machine
## Learning.
which indicates that projpred wants to use the approach by Magnusson et al. (2019).

If my assumption is correct and projpred needs a fix, then a way to implement either the approach by Magnusson et al. (2019) or that by Magnusson et al. (2020) might be to rely on loo::loo_subsample() (see this vignette). This is just a rough idea, though, and I don't know if it is really feasible.

Of course, given these new findings, my statement

[...] I think constant weights for LOO-CV are ok (no matter whether subsampled or not) [...]

from the comment above (see also enumeration point 4 in #173) is not correct. At that time, I hadn't realized yet that projpred (probably) wants to use the approach by Magnusson et al. (2019) which requires sampling with replacement, which in turn means that each observation can occur more than once in the subsample (in contrast to my statement

[...] after all, they all represent a single observation [...]

from #188) so that when aggregating across the subsampled LOO CV folds, non-constant weights make sense (but note that these would have to be inverse-probability weights, in contrast to the currently used non-inversed probability weights).

References

Magnusson, M., Andersen, M., Jonasson, J., and Vehtari, A. (2019). Bayesian leave-one-out cross-validation for large data. In Proceedings of the 36th International Conference on Machine Learning, 4244-4253. URL: https://proceedings.mlr.press/v97/magnusson19a.html.

Magnusson, M., Vehtari, A., Jonasson, J., and Andersen, M. (2020). Leave-one-out cross-validation for Bayesian model comparison in large data. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 341-351. URL: https://proceedings.mlr.press/v108/magnusson20a.html.

@fweber144
Copy link
Collaborator Author

An alternative to not supporting subsampled LOO CV at all anymore might be to throw a warning concerning a potentially incorrect implementation.

fweber144 added a commit to fweber144/projpred that referenced this issue Dec 9, 2021
AlejandroCatalina pushed a commit that referenced this issue Dec 9, 2021
* Add a warning for the experimental `Student_t()` family (see issue #233).

* Add a warning for the experimental subsampled LOO CV (see issue #94).

* Adapt the tests to the warning for subsampled LOO CV.

* Additive models: Add a note in the docs and a warning in `init_refmodel()` that

this is currently only an experimental feature. See the comment at
<github.com//pull/237#issuecomment-961151653>.

* Tests: Exclude additive models (GAMs and GAMMs) for now, following

commit "Additive models: Add a note in the docs and a warning in `init_refmodel()` that [...]".

* Add a warning for the Gamma family (support for this is only experimental; see

issue #240).
fweber144 added a commit that referenced this issue Apr 1, 2022
* can we run for 2 hours?

* minor fix

* average dis?

* force numeric

* drop dimension if vector

* fix lpd computation #105

* speed up dependencies?

* does this work?

* fake commit

* Update NEWS file. (#111)

* expect length to be n

* bump version

* error parsing

* Fix #106 (#112)

Fix output dimensions of lpd output when only 1 observation is present.

* Update NEWS file and bump version number (#113)

* Add missing `else` cases for `weights` and `offset` in `.extract_model_data()`. (#114)

* Fix formula parsing with multidimensional interactions (#103)

For now, `t2(x, z)` can be added without `x` or `z`.

* fix #115

with a single predictor solution_terms_cv_ch is a vector, ensure it's
always a matrix

* Update/fix NEWS file. (#117)

* update NEWS

* Fixes for argument `nterms` of `proj_helper()` (#110)

* In proj_helper(): `list(...)$nterms` is always `NULL` as there is a predefined argument `nterms`.

* Fix a bug for calls like
```r
mylinpred <- proj_linpred(project(myfit,
                                  solution_terms = c("V3")),
                          nterms = 2)
```
which previously raised an error (but should work without raising an error) and for calls like
```r
mylinpred <- proj_linpred(project(myfit,
                                  solution_terms = c("V3")),
                          nterms = 1)
```
which previously did not raise an error (but should raise an error).

* Fix what is probably a precedence typo, see `?Syntax`.

* In proj_helper(): Use `count_terms_chosen(x$solution_terms)` instead of `(length(x$solution_terms) + 1)` to be better prepared for no-intercept models.

* Argument `nterms` in proj_helper() is passed to project(), so it doesn't match the output of count_terms_chosen() (since count_terms_chosen() counts the intercept but argument `nterms` of project() does not). Therefore, add +1 to `nterms` (if specified). This also affects the behavior of `nterms` in the case `inherits(object, "projection") || (length(object) > 0 && inherits(object[[1]], "projection"))`, i.e. when `nterms` is used to filter `object`.

* Do not use `nterms` in the case `inherits(object, "projection") || (length(object) > 0 && inherits(object[[1]], "projection"))`, i.e. do not use `nterms` to filter `object` if it results from project() (filtering wasn't possible before commit 8909d38e372f5945364d6083d33a02856555cf32 anyway).

* Fixup commit cd04242196ca6d915db9b8c79bc5ba1f3939577f.

* Update/fix NEWS file and add contributor (#118)

* fix nclusters[_pred] handling!!

only use clusters for less than 20 draws

* don't use 5 clusters for kfold predictions

* make sure we are using rstanarm

* Fix a `fetch_submodel()` for intercept-only submodels (#119)

* Fix fetch_submodel() for intercept-only "datafits".

* Fix a unit test according to the change by commit 9264ad830b71385e77bb5e153b44237ed1f70951.

* Update NEWS file.

* Miscellaneous fixes and improvements (#92)

* Increase development-specific version number.

* Remove "*.Rproj" from Git ignore list.

* Add RStudio's default .Rproj file.

* projpred.Rproj: Add "StripTrailingWhitespace: Yes".

* projpred.Rproj: Add "AutoAppendNewline: Yes".

* Add "/src/projpred.dll" to Git ignore list.

* In the .Rproj file: Set "PackageRoxygenize" field to common values.

* NEWS.md: Remove unnecessary whitespace.

* Docs for varsel(): Correct typos and improve formatting (and reflow the roxygen2 comments and then re-document).

* Docs for varsel(): Correct a possible typo in the documentation of argument `object`.

* Docs for varsel(): Remove defaults for `ndraws` and `ndraws_pred` as they could be misunderstood by the user (in fact, the default for `ndraws` and `ndraws_pred` is to be ignored since `nclusters` and `nclusters_pred` are specified by default). Correct the default for `nclusters` (20 instead of 10).

Correct the default for `nclusters_pred` mentioned in a comment in the code (400 instead of 5).

NEWS.md: Incorporate the increase of the defaults for arguments `nclusters` and `nclusters_pred` (already at earlier versions, 2.0.3 and 2.0.4).

* Docs for varsel(): Re-document.

* Docs for project(): Correct a possible typo for argument `nterms`, namely replaced "If a list, [...]" by "If a numeric vector, [...]". Also reflow the corresponding roxygen2 comment and re-document.

* Docs for project(): Unify two `@param regul` items (one of which had slipped into the previous `@param` item).

* Slightly improve an error message in project().

* Comment `fit_glm_callback()` (does not seem to be used).

* Use version number 2.0.4.9000 instead of 2.0.4.9001 (to avoid too fine-grained version numbers).

* Remove an unused ellipsis causing the message `** byte-compile and prepare package for lazy loading\n Note: ... may be used in an incorrect context` upon package installation.

* Simplify/generalize documentation of argument `object` of proj_linpred() and proj_predict() and a related error message.

* Re-document.

* Re-document.

* After the merge with 'develop': Re-apply fixes and formatting in the documentation of varsel.R

* Correct documentation of argument `seed` of varsel.refmodel().

* Fix unit tests (to match commit 4dfb4b74979ee4f1ff78ef931bd9cf57c074d244).

* Improve some test titles.

* Use backticks consistently to indicate code.

* Fix typo in the NEWS file.

* Improve an error message in proj_helper().

* Slightly improve documentation for argument `integrated` of proj_linpred().

* Correct documentation for proj_linpred() and proj_predict(): Only proj_linpred() supports output element `lpd`. Also slightly improve the formatting.

* Update NEWS file.

* Avoid the `no visible binding for global variable` note in R CMD check (see <https://dplyr.tidyverse.org/articles/programming.html#eliminating-r-cmd-check-notes>).

* Fix a syntax error in a commented part of the vignettes' YAML headers.

* Function where() is not exported by package tidyselect, so redefine it in misc.R to avoid a note in R CMD check which would occur for usage of tidyselect:::where().

* Remove a mistakable comment in compute_lpd().

* Slightly improve a comment in proj_linpred().

* Improve an error message in the unit tests.

* Simplify definition of `ynew` in proj_linpred().

* Simplify calls to extract_model_data().

* Simplify the div_minimizer() call in project_submodel() (addresses GitHub issue #88).

* Remove an unnecessary `form <- refmodel$formula` definition.

* Slightly improve an error messsage for `nterms` in proj_helper().

* Use as.vector() instead of as.numeric() (for consistency with the `pred` definition above).

* Revert the creation of a new object `predictions` (introduced by commit 26d83e63c8a7f28bff44a9be002ed72c65ed273f) which is not necessary anymore since commit be0c1dc5639bbf5e00f004f1ce6ded904de208c4.

* Remove argument `nterms` in proj_linpred(), proj_predict(), and proj_helper() since it may simply be passed via the ellipsis.

* Improve documentation of the ellipsis for proj_linpred() and proj_predict().

* Rename the local function proj_predict() to proj_predict_local() to distinguish it from the exported function proj_predict().

* After commit 32a1a53285eb0127cb29503a87fc07674269ecbf, there is no need anymore for proj_helper()'s internal objects `nterms` and `projected_sizes`.

* Use `projs` and `proj` consistently in `proj_helper()` (and thereby save the memory for `proj`).

* Improve the documentation of the return value of proj_linpred() and proj_predict().

* NEWS file: Create a new heading for the current version.

* Revert "fake commit"

This reverts commit 1388efb27bed17ca1158cab48d5a82561474cd8d.

* Remove packages MASS and optimx from Imports field to avoid the R CMD check note
```
Namespaces in Imports field not imported from:

  ‘MASS’ ‘optimx’

  All declared Imports should be used.
```

* Rename the local functions passed by proj_linpred() and proj_predict() to proj_helper() (so that they now contain a midfix "_linpred_" and "_predict_", respectively).

* Move proj_linpred_aux() and proj_predict_aux() out of proj_linpred() and proj_predict() (respectively) to make debugging easier.

* Replace `proj_predict_local` by `onesub_fun` to avoid confusion with proj_predict().

* Be explicit about the origin of the objects used in onesub_fun() (for better readability, but also to avoid the R CMD check note `no visible binding for global variable`).

* Fixup last commit f1c1545c900403001c6f015d3550da2505448c04 ("
Be explicit about the origin of the objects used in onesub_fun() (for better readability, but also to avoid the R CMD check note `no visible binding for global variable`).
"): Also handle proj_predict()'s argument `ndraws`.

* Set the seed for proj_predict() inside of proj_predict(), not inside of proj_helper() (to show explicitly that proj_linpred() does not require a PRNG).

* Avoid a clash between proj_predict()'s former arguments `ndraws` and `seed` with project()'s arguments `ndraws` and `seed`.

* Fix test_misc.R (due to commit d9fea25a649f1290643415e44807ccceec1da20e).

* Fix test_proj_pred.R (due to commit d9fea25a649f1290643415e44807ccceec1da20e).

* Fix a test name.

* Fix a NEWS entry (see <https://discourse.mc-stan.org/t/projpred-behavior-of-argument-search-terms/19132>).

* Fix the documentation of argument `search_terms` in varsel() and cv_varsel() since `search_terms` is usually a character vector, not a list (it seems like `search_terms` can also be a list, but doesn't seem to be the usual case, see e.g. <https://discourse.mc-stan.org/t/projpred-behavior-of-argument-search-terms/19132> or <https://discourse.mc-stan.org/t/projection-predictive-feature-selection-for-multilevel-phylogenetic-models/21402>). Also mention that the intercept needs to be included explicitly. Use the same documentation of argument `search_terms` in varsel() and cv_varsel().

* Replace `source(file.path("helpers", "SW.R"))` by `source(testthat::test_path("helpers", "SW.R"))` to make debugging of tests easier.

* Move the helper functions proj_linpred_aux() and proj_predict_aux() *below* their respective main function (proj_linpred() and proj_predict(), respectively).

* Incorporate the fix from commit 387d73a1706ad07e7b84922ef0322c26bad03bc2 of branch fweber144:fix_nterms_proj_helper (see PR #110).

* Add argument `nterms_filter` to proj_linpred() and proj_predict().

* Fixup commit c1f4c17d571199cb353dae451af58c99a2055015 for the case `inherits(object, "projection")`.

* Replace `nterms_filter` by `filter_nterms` to avoid a clash with argument `nterms` which may be passed via `...`.

* Fix a NEWS entry.

* Fix a test name.

* Improve the documentation for proj_linpred()'s and proj_predict()'s return value.

* Fix a typo in the documentation of argument `offsetnew`.

* Re-document.

* Fix a comment.

* Save memory by avoiding a separate definition of `p_sel <- search_path$p_sel`.

* Break lines in long paste()-d test names.

* Add another line break in a test name.

* Re-indent the test files.

* Reflow the test files to margin width 80.

* Re-indent the R files.

* Re-document.

* Reflow the R files to margin width 80.

* Re-document.

* Apply `min(NCOL(refmodel$mu), [...])` also to user-specified `ndraws`, `nclusters`, `ndraws_pred`, and `nclusters_pred`. Remove unnecessary `ndraws <- ` assignments.

* Pre-specify the defaults for `ndraws` and `ndraws_pred` in the function arguments. Update the documentation.

* Use the same behavior for arguments `ndraws` and `nclusters` in project() as in varsel() and cv_varsel().

* Minor wording improvement in the documentation.

* Update NEWS file.

* Update NEWS file.

* Fix the position (projpred version) of some NEWS entries added by PR #92. (#120)

* refmodel: Remove brmsfit-specific code (#122)

* Fix outdated tests (`ndraws` and `nclusters` are now internally cut off at `NCOL(refmodel$mu)`, so there is no error thrown by project() anymore). (#121)

* Add checks for required arguments passed via the ellipsis. (#123)

* Disallow no-intercept models, improvements for efficiency (#124)

* Avoid calling pseudo_data() if not necessary.

* Remove an unnecessary assignment.

* Remove unused code.

* Throw an error for a reference model without an intercept (see PR #100 and in particular <https://github.com/stan-dev/projpred/pull/100#issuecomment-809464069>).

* Insert GitHub PR number.

* hotfix: explicitly check formula object

* hotfix: fix readme example

* Fix `ndraws` remnants (#128)

* Fix a bug caused by the fact that `ndraws` must be non-NULL now (since commit 16d5afd40908207ebdc6da32e3b6671bf6c581b5).

* Fix a warning message.

* Fix output of print.vselsummary() (either `ndraws` or `nclusters` is used, but not both) and take into account that `ndraws` must be non-NULL now (since commit 16d5afd40908207ebdc6da32e3b6671bf6c581b5).

* Fix PR #128 (#129)

* Suppress an expected warning when using as.matrix.projection() for a clustered projection.

* Fix a bug introduced by PR #128.

* vignettes/quickstart_glmm.Rmd: Remove the term `(1 | phylo)` from the Poisson model (since we have (#133)

`identical(data_pois$phylo, paste0("sp_", data_pois$obs))`,
so including both `obs` and `phylo` doesn't make sense).

* Change arguments `size_sub` and `seed_sub` (#135)

* Apply `size_sub` only to clustered projection. For non-clustered projection, all draws are used.

* Update documentation.

* Rename argument `seed_sub` to `seed_ppd`.

* Update unit tests to the new handling of `size_sub`.

* Rename argument `size_sub` to `nclusters_resample`.

* Update NEWS.md

* Use brms's naming convention also for submodels of class `"subfit"` (#132)

* Resolve an S3 generic/method inconsistency (issued by R CMD check).

* For submodels of class "subfit", make the column names of as.matrix.projection()'s output matrix consistent with other classes of submodels (by fixing coef.subfit()).

* Update NEWS.md

* Update unit tests in test_as_matrix.R to also reflect the use of brms's naming convention.

* NEWS.md: Insert GitHub PR number #132.

* Minor cleaning.

* Move the project() calls out of SW().

* Avoid varsel() in the unit tests for as.matrix.projection() (for the sake of speed).

* Re-add SW() for the binomial reference model (with > 1 trials), see GitHub issue #136.

* Minor improvements for as.matrix.projection()'s unit tests.

* Refactor the unit tests for as.matrix.projection() so they may be extended more easily.

* as.matrix.projection(): Also test the case of a single (clustered) draw.

* Reduce the number of draws to be tested (for the sake of speed).

* Add unit tests for as.matrix.projection() in case of Gaussian and binomial reference models with multilevel but without additive terms.

* Fix `as.matrix.projection()` for 1 (clustered) draw (#130)

* Simplify as.matrix.projection().

* Fix a bug (hard-coded use of as.matrix.lm() for the case of a single draw after projection).

* Add `@keywords internal` to as.matrix.projection().

* Export methods which were previously unexported (necessary to fix the recent commit "Simplify as.matrix.projection().").

* Further simplify as.matrix.projection().

* Remove the as.matrix.noquote() method (doesn't seem to be used and overwrites as.matrix.noquote() from the "base" package).

* Resolve an S3 generic/method inconsistency (NOTEd by R CMD check).

* Do not import the "tidyselect" package (already imported by "dplyr" and causing a NOTE in R CMD check since tidyselect:::where() is imported only indirectly).

* Add as.matrix.glmerMod() which is currently redirecting to as.matrix.lmerMod().

* Add a comment to `%:::%`.

* Use `%:::%` consistently.

* Move `%:::%` from R/gamm4.R to R/misc.R

* Throw an error for unavailable as.matrix.<class_name>() methods.

* Add missing ellipses.

* Add as.matrix.gamm4().

* Update NEWS.md

* Fix typo in the documentation.

* Fix `plot.vsel()` if there is just the intercept-only submodel (#138)

* Fix a bug for `nterms_max` in plot.vsel() if there is just the intercept-only submodel.

* Add a more informative error message in plot.vsel() if there is just the intercept-only submodel.

* Update NEWS.md

* Fix error for L1 search applied to an empty (i.e. intercept-only) reference model (#139)

* Fix a bug for L1 search applied to an empty (i.e. intercept-only) reference model.

* Update NEWS.md

* Minor modifications (mostly no-op) (#140)

* Fix a comment.

* Throw a proper error message for nonsupported families.

* Fix documentation of project()'s ellipsis.

* Fix an error message.

* Fix a bug in compute_lpd() (wrong margin taken for averaging across the posterior draws). But this bug shouldn't be relevant as the corresponding case (`integrated && !is.null(dim(lpd))`) should not occur.

* compute_lpd(): Throw an error in the case `integrated && !is.null(dim(lpd))` as it should not occur.

* Replace a hard-coded method by its generic.

* Remove a line which is unnecessary due to replace_intercept_name(), replace_population_names() and the avoidance of "alpha".

* Remove README.md from .Rbuildignore (to show the README also on the CRAN website).

* Add warnings for argument `solution_terms` of project().

* Fix documentation.

* Replace remaining occurrences of numeric `solution_terms` by character.

* Update NEWS.md

* Remove unused functions.

* Improve documentation for `nterms_max` (state explicitly that it doesn't count the intercept).

* Improve documentation for `nterms` (state explicitly that it doesn't count the intercept).

* Move the common part of `varsel()` and `cv_varsel()`'s documentation out in a single file (a roxygen "template") and refer to that.

* Fix a bug for L1 search applied to an empty (i.e. intercept-only) reference model.

* Update NEWS.md

* Add the `man-roxygen` folder to .Rbuildignore

* Unify the documentation for argument `verbose`.

* Fix an error message (previously giving `argument is missing, with no default`).

* Fix documentation of `nterms_max`.

* Improve formatting of some documentation.

* Fix documentation of `method`.

* Improve documentation of `method`.

* Minor wording improvements in the documentation.

* Fix strange list names.

* Update NEWS.md

* Clarify docs for `cvfun`/`cvfits` (`cvfits` takes precedence over `cvfun`).

* Clarify docs for `cvfits` (mentioning attributes `"K"` and `"folds"`).

* Remove a trailing comma.

* Clarify docs for `cvfits` (mentioning sublist `"fits"`).

* Minor wording improvement in the documentation of argument `object`.

* Minor wording improvement in the documentation.

* Improve documentation of argument `object` (in the "get-refmodel" documentation tag).

* Improve documentation of argument `formula` (in the "get-refmodel" documentation tag).

* Further improve documentation of argument `formula` (in the "get-refmodel" documentation tag).

* Improve title and description in the "get-refmodel" documentation tag.

* Improve documentation of several arguments in the "get-refmodel" documentation tag.

* Improve documentation of several arguments in the "get-refmodel" documentation tag.

* Add an error message for `is.null(extract_model_data)`.

* Harmonize error messages.

* Remove argument `fetch_data` of `get_refmodel.default()` as this argument seems to be a remnant of older code.

* Remove argument `offset` of `get_refmodel.default()` as this argument seems to be a remnant of older code.

* Set defaults for `get_refmodel.default()`'s arguments where possible.

* Remove argument `wobs` of `get_refmodel.default()` as this argument seems to be a remnant of older code.

* Update NEWS.md

* NEWS.md: Insert GitHub PR number.

* Fix PR #135 (#141)

* Revert the order of `proj_linpred()`'s and `proj_predict()`'s arguments as it is on branch `master`.

* Rename argument `nclusters_resample` to `nresample_clusters` to avoid a clash in partial matching when supplying `project()`'s argument `nclusters` via `...`.

* Rename argument `seed_ppd` to `ppd_seed` to avoid a clash in partial matching when supplying `project()`'s argument `seed` via `...`.

* Fix `extract_model_data()` in `get_refmodel.stanreg()` (#142)

* Fix a bug in get_refmodel.stanreg().

* Update NEWS.md

* Enhancements (and fixes) for `proj_linpred()` and `proj_predict()` (#143)

* `test_datafit.R`: Improve indentation and remove a trailing comma.

* Add brackets to if() to avoid a strange error in RStudio.

* Avoid an unnecessary repetition of proj_linpred() (otherwise, `predd_list` would be unused).

* `test_proj_pred.R`: Generalize the value supplied to `info`, making the local objects `i_inf` obsolete.

* Improve indentation.

* Generalize the iterator for `i` (always `seq_len(nfams)` with `nfams <- length(fam_nms)`).

* Improve/correct test names and add test "output of proj_linpred is sensible with \"refmodel\" object as input".

* Iterate over names, not indices (therefore also remove `nfams`).

* Add expectations.

* Use integer instead of numeric (to facilitate comparisons).

* Speed up the tests from `test_proj_pred.R` by specifying (where necessary) `nclusters` and `nclusters_pred` in calls to varsel(), project(), proj_linpred(), and proj_predict().

* Fix an expectation after using `nclusters` and `nclusters_pred` more extensively.

* Avoid the soft-deprecated argument `info` of `testthat::expect_equal()` and friends by

following the advice from `?testthat::quasi_label` (see also
`?"equality-expectations"`).

* Remove old argument `intercept`.

* Add object `nresample_clusters_tst` and fix an incorrect expectation.

* Move projpred-related code out of SW() where warnings are not expected.

* Remove unnecessary specifications of `nresample_clusters`.

* Move some code for better readability.

* Argument `nterms_max` doesn't count the intercept, so reduce to `nterms`.

* Improve/correct test names and add test "output of proj_predict is sensible with \"refmodel\" object as input".

* Use a consistent order of arguments `transform` and `integrated`.

* Efficiency improvement in proj_linpred_aux(): Use `crossprod(x, y)` for `t(x) %*% y`.

* Let `proj_linpred()` always return a S x N matrix, no matter the value of `integrated`.

* Minor wording improvements in the documentation of `proj_linpred()`.

* Add dim() tests for proj_linpred() and improve some of the existing tests.

* Add test "proj_linpred's output is also correct in edge cases".

* Remove duplicated test/expectation.

* Extend the output structure (dimensionality) tests for `proj_predict()`.

* Use a consistent test name structure.

* Add test "proj_predict: output structure is also correct in edge cases".

* Remove unnecessary uses of `y`.

* Also test `ndraws` instead of `nclusters`.

* `proj_linpred_aux()`: Do not transpose `pred` internally; only at the very end. This is needed for families where the link function and the inverse link function are no simple scalar functions.

Such families may arise, e.g., in the augmented-data projection.

* Remove remnant argument `integrated´ from `compute_lpd()`.

* Average the LPD (across the projected posterior draws) instead of taking the LPD at the averaged linear predictors.

* Simplify `log_weighted_mean_exp()`.

* Docs: Use math mode.

* Update NEWS.md

* Enhance documentation.

* Use `testthat::expect_named()`.

* Fix an example (remnant occurrence of argument `nv`).

* Enhance formatting and comments in the example for `proj_linpred()` and `proj_predict()`.

* Insert GitHub PR number in NEWS.md

* Offer the possibility to run `proj_linpred()` without supplying the response in `newdata`. Before, this reprex:
```r
library(rstanarm)
data(bball1970)
options(mc.cores = parallel::detectCores(logical = FALSE))
myfit <- stan_glm(
  cbind(Hits, AB - Hits) ~ 1,
  data = bball1970,
  family = binomial("logit"),
  prior_intercept = normal(-1, 1),
  seed = 101
)
library(projpred)
set.seed(36467)
i_resampled <- sample.int(nrow(bball1970))
stopifnot(identical(sort(i_resampled),
                    seq_len(nrow(bball1970))))
mynewdata <- bball1970[i_resampled,
                       setdiff(names(bball1970), "Hits"),
                       drop = FALSE]
mypl <- proj_linpred(myfit,
                     solution_terms = character(),
                     nclusters = 2,
                     newdata = mynewdata)
```
threw the following error:
```
Error in eval(formula[[2]], data, environment(formula)) :
  object 'Hits' not found
```
Now, it returns `NULL` for output element `lpd`.

* Update NEWS.md

* `proj-pred` docs: Use `\code{\link{<function_name>}}`.

* `proj-pred` docs: Enhance the documentation of arguments `newdata`, `offsetnew`, `weightsnew`, `transform`, and `integrated`.

* Fix documentation for `ppd_seed`.

* Fix documentation of `weightsnew`.

* `proj-pred` docs: Use "only." instead of "only:" to avoid ambiguities.

* `proj-pred` docs: Minor improvements.

* Re-document.

* Documentation of `ppd_seed`: Clarify the meaning of the abbreviation "PPD".

* Replace `ppd_seed` by `.seed`.

* Minor modifications, mostly documentation (#153)

* Add argument `extract_y` also to `extract_model_data()` in `get_refmodel.default()`.

* Let the user pass arguments via the ellipsis (`...`) from `get_refmodel.default()` to `init_refmodel()`. Enhance documentation in `refmodel.R`. Minor cleaning in `refmodel.R`.

* Re-document.

* Minor cleaning in the docs of `project.R`.

* Fix documentation (KL output of `project()`).

* Update and fix documentation (output of `project()`).

* Re-indent documentation (output of `project()`).

* Re-document.

* Minor enhancements in the documentation of `project()`.

* Remove `glm_forward_c()` and `glm_forward()` (unused).

* Improve documentation of argument `seed` and perform minor cleaning.

* Re-document.

* Add a missing ellipsis.

* `linear_multilevel_mle()`: Remove unused argument `regul`.

* Use a roxygen template for arguments `newdata`, `offsetnew`, `weightsnew`.

* Enhance a comment.

* Don't apply the inverse-link within `ref_predfun_datafit()`.

* In `ref_predfun_datafit()`, the case `!is.null(fit)` should not occur (since a `"refmodel"` of additional class `"datafit"` is only possible if `object` (which is later used as `fit`) is `NULL`).

* Simplify `ref_predfun_datafit()`.

* Document that argument `ref_predfun` is degenerate (

i.e., ignored and the internal placeholder always returns `NA`) if argument
`object` is `NULL` (i.e., for a `"refmodel"` of additional class `"datafit"`).

* Minor cleaning in `varsel.R`.

* Minor cleaning in `cv_varsel.R`.

* Minor cleaning in `methods.R`.

* Enhance documentation for `project()`'s argument `nterms`.

* Remove `fit_glm_callback()` (commented).

* Minor cleaning in `divergence_minimizers.R`.

* `parse_args_varsel()`: Improve readability of `nterms_max`-related code.

* `parse_args_varsel()`: Improve readability of `nterms_max`-related and `intercept`-related code.

* `NEWS.md`: Shorten commit hashes to 7 characters.

* Update `NEWS.md`.

* fix #161: avoid Inf nclusters_pred when both NULL

* fix #162: offsetnew not always used in predict.refmodel

* fix #164: suggest_size consistently shouldn't include the intercept

* fix #160: wrong handling of cvfits$omitted

* fix #159: don't include intercept in solution_terms

* fix #158: don't add main effects to match solution_terms

* fix #157: provide informative error when projecting fit object with
cv_search = FALSE

* fix #144: refer to lme4::ranef

* fix #146: error when expanding the formula with a naked group effect

* Fix the fix for issue #159 (no. 2) (#166)

* Fix the fix for issue #159: Consistently return type `character`.

* Make sure `search_forward([...])$solution_terms` has type `character`. (This commit should only be relevant for models without an intercept.)

* Simplify `fetch_submodel()` for the case `nterms == 0`.

* Consistently use `head()`.

* Fix an issue with `filter_nterms` caused by the fix for issue #159.

* Generalize the fix for the `filter_nterms` issue caused by the fix for issue #159 and use this generalized fix when constructing the names for `proj_helper()`'s output object.

* fix #169: set 0 relative penalty for the intercept

* fix #167: add offset slot to d_test

* fix #146: \\|

* Fix `proj_predfun()` for GLMMs (#174)

* `linear_multilevel_proj_predfun()`: Ensure a 1-column matrix as output if there is only 1 projected (clustered) draw, consistently with `linear_proj_predfun()` and `additive_proj_predfun()`.

* `additive_proj_predfun()`: Remove a now obsolete `as.matrix()` call.

* fix #171: l1 search is only supported for GLMs

* Fix tests (#170)

* Adopt `test_as_matrix.R` to the fix of issue #158.

* Adopt `test_project.R` to the fix of issue #159.

* Adopt `test_varsel.R` to the fix of issue #167.

* fix #172: pass weigts[inds]

* Compared to v1.0.6, Rcpp v1.0.7 introduced a minor change in the automatically created `src/RcppExports.cpp` file. (#175)

* fix #167: add d_test offset to relevant places

* Fix `proj_predfun()` for `"datafit"`s (where the case `is.null(beta)` is possible, so `predict.subfit()` needs to make sure to return a matrix in that case). (#177)

* Fix the names of `summary.vsel()$selection` for `"vsel"` objects created by `varsel()` (where `is.null(cv_suffix)`). (#179)

* fix search forward when search_terms are not consecutive in size

* update rcpp

* minor fix undefined variable

* Fix PR #142 (#184)

* Re-apply Rcpp PR #175 (i.e., revert commit 2529121133c1acf1b8a8905e611c79480c39bcf7). (#187)

* Fix commit e529ec1080e45de2b27aaaa6efde8ff5285e0f90 ("minor fix undefined variable"). (#188)

* Collection of minor changes (#189)

* Avoid `R CMD check` NOTE "no visible global function definition for ‘tail’".

* Add package optimx to `Suggests:` field (may be needed when running tests via `R CMD check`).

* Comment `get_replace_response()` (seems to be unused).

* Revert/adjust an incorrect comment modification introduced by PR #153.

* `ref_predfun_datafit()`: Remove unused argument `offset`.

* `print.vselsummary()`: Remove object `nterms_max` (and replace its later occurrence) to

make it clear that `print.vselsummary()` has no purpose other than printing.

* Minor cleaning.

* Define `hf()` (from `kfold_varsel()`) globally.

* Comment `.loo_subsample()` (seems to be unused).

* Fix a typo in a comment.

* Consistently use `inherits()`.

* `project()`: Improve readability of a modification of `cv_search`.

* Remove an unnecessary comment.

* `parse_args_cv_varsel()`: Remove unnecessary redefinitions of `nclusters` and `nclusters_pred`.

* `.get_sub_summaries()`: Remove unused `solution_terms` object.

* Enhance a comment.

* Replace the abbreviation `object$fam$` by `object$family$` to avoid ambiguities.

* Remove unused object `w` in `predict.subfit()`.

* After line `sub_fit <- refmodel$div_minimizer(`: Swap order of arguments `family` and `weights` (for consistency with the definitions of the divergence minimizers).

* Package **rngtools** is not necessary (use `.Random.seed` as is already done at similar places).

* Minor cleaning.

* Fix commit e68c6fd6d19d68e9c121a6a94b740517bc085d3d. (#192)

* Fix issue #185. (#193)

* `select_possible_terms_size()`: Avoid that variables in an interaction term are ordered differently than in the reference model. (#191)

* Fix `"datafit"`s with binomial family and two-column response (#194)

* Improve a comment concerning `.get_standard_y()`.

* `.get_standard_y()`: Add an error for a two-column response and a non-binomial family. Improve an existing error message.

* Fix a bug introduced by PR #193 when calculating `mu` for `"datafit"`s with a binomial family and a two-column response. Also fix another (potential) bug (in `mu <- y / weights`, `weights` could have length zero).

* Fix a bug in `.get_standard_y()` (the second column of a two-column response is the number of failures, not the number of trials).

* Use `unname()` in `.get_standard_y()` consistently.

* Use `+` instead of `rowSums()` to avoid converting integer to numeric.

* Fix offset issues (#196)

* Add a workaround for rstanarm issues stan-dev/rstanarm#541 and stan-dev/rstanarm#542.

* Since `posterior_linpred()` is supposed to include the offsets in its result, subtract them in `ref_predfun()`.

* Improve comments.

* Adopt the tests to PR #196. (#197)

* Enhance/fix `fit_glmer_callback()` (#201)

* Improve indentation.

* `fit_glmer_callback()`: Pass ellipsis (`...`) to `lme4::lmer()` and `lme4::glmer()`.

* Avoid the `preprocess_data()` call in `fit_glmer_callback()` since:
  - the scaling of predictors changes their regression coefficients;
  - `preprocess_data()` is not adopted to categorical predictors;
  - `preprocess_data()`'s implementation for interactions is probably incorrect.

* `fit_glmer_callback()`: Handle the error "pwrssUpdate did not converge in (maxit) iterations".

* Fix PR #196 for `"brmsfit"`s. (#203)

* Fix observation weights (#198)

* Fix observation weights for L1 search and remove other code which is obsolete.

* Don't normalize observation weights.

* Fix a bug in `proj_linpred()` (for `integrated = TRUE` and `is.null(ynew)`). (#209)

* `print.vselsummary()`: Don't print the formula environment. (#214)

* Refactored tests (#217)

* Clean-up code related to reference models (#219)

* Remove argument `folds` (#220)

* `init_refmodel()`: Remove argument `folds` (effectively unused).

* Don't run the brms tests for brms version <= 2.16.1 (see the

warning message added here for details).

* Fixes for the binomial family (#221)

* Improve warnings and errors for the binomial family.

* Fix issue #136.

* Adapt the tests.

* Add test "binomial family with 1-column response and weights which are not all ones warns"

* Fix in `print.vsel()` (argument `digits` ignored) (#222)

* Miscellaneous modifications and fixes (#223)

* Use the S3 system for `solution_terms()`. This also allows to have a `solution_terms.projection()` method.

* Add warnings to `project()` when `cv_search` is modified. Improve error messages. Use the new `solution_terms()` S3 generic.

* Adapt the tests.

* Add a warning when `cv_search` is modified in `varsel()` or `cv_varsel()`. Make the

defaults clear from the function signature. This also fixes a
bug: `is.null(cv_search)` was not the default anymore (neither in `project()`, nor
in `varsel()`, nor in `cv_varsel()`), especially not for `datafit`s where this
condition is used to set `cv_search` to `FALSE`.

* Fix a bug in `predict_multilevel_callback()` (concerning inheritance), namely:

```r
library(rstanarm)
data(bball1970)
options(mc.cores = parallel::detectCores(logical = FALSE))
myfit <- stan_glmer(
  cbind(Hits, AB - Hits) ~ (1 | Player),
  data = bball1970,
  family = binomial("logit"),
  prior_intercept = normal(-1, 1),
  seed = 101
)
library(projpred)
myprj <- project(myfit,
                 solution_terms = c("(1 | Player)"),
                 nclusters = 2)
bball1970_new <- data.frame(Player = "newplayer1", Hits = 11, AB = 45)
mylinpred <- proj_linpred(myprj, newdata = bball1970_new)
```

* Fix a typo.

* Improve readability in `.loo_subsample_pps()`.

* Add an error to `as.matrix.projection()` for objects based on `"datafit"`s.

* Add an error to `project()` for `"datafit"` objects.

* Adapt the tests.

* Remove defaults from `parse_args_cv_varsel()` (should be set in `cv_varsel()` and then `cv_varsel()` should pass all these arguments to `parse_args_cv_varsel()`.

* Clean up `parse_args_cv_varsel()`. Add a warning when `cv_method` is modified. Make the defaults clear from the function signature.

* Adapt the tests.

* `project()`: Move the error for `"datafit"`s to the beginning (
otherwise, `cv_search = FALSE` causes the error "Please provide an `object` of class "vsel" or use `cv_search = TRUE`."
).

* Simplify `.init_kfold_refmodel()` and thereby resolve an inconsistency (which

doesn't seem to have had any consequences up to now, though): Argument `obs` of
`k_refmodel$fetch_data()` referred to the original dataset whereas argument
`obs` of `k_refmodel$family$mu_fun()` referred to the k-fold dataset. This also
affected how `newdata` was subsetted.

* Minor cleaning in `.get_sub_summaries()`.

* Use `as.data.frame()` in `fetch_data()`.

* Remove argument `newdata` of `fetch_data_wrapper()` (which

becomes `$fetch_data()`) since this was only used in
`$mu_fun()` and can be solved more easily there.

* `predict.refmodel()`: Replace `inherits(ynew, "numeric")` by `is.numeric(ynew)` (since
`inherits(1L, "numeric")` is `FALSE` but `is.numeric(1L)` is `TRUE`
).

* Adapt the tests.

* `get_refmodel.stanreg()`: Add an input check for `object$data`.

* Adapt the tests.

* Move `fetch_data()` unchanged from divergence_minimizers.R to refmodel.R (the more appropriate place).

* Improve documentation of `div_minimizer`.

* Minor cleaning in `misc.R`.

* Minor cleaning in `project.R`.

* `project()`: Improve the warning message when dropping solution terms and notify the user of the possible solution terms.

* Adapt the tests.

* `as.matrix.projection()`: Check the class of all submodel fits (there are as many of them as there are clusters of draws), not just of the first one.

* `predict.refmodel()`: Set a default of `newdata = NULL`.

* `predict.refmodel()`: Simplify input checks and remove incorrect ones (`offsetnew` and `weightsnew` may be numeric vectors, for example).

* `predict.refmodel()`: Remove some vertical whitespace.

* `proj_helper()`: Minor cleaning.

* `cv_varsel.R`: Minor cleaning.

* `ref_predfun()` for `"datafit"`s: Minor cleaning.

* Minor cleaning in `methods.R`.

* Ensure that all calls to `extract_model_data()` have all arguments named. This

is important for `extract_model_data()` functions which have `...` instead of
explicit `wrhs` and `orhs` arguments (see, e.g., brms version <= 2.16.1).

* Check the weights and offsets after all calls to `extract_model_data()` (and if

necessary, adapt them). This is important for `extract_model_data()` functions
which return weights and offsets of length 0 (see, e.g., brms version <=
2.16.1).

* Comment out code related to Pareto k-values (currently not needed).

* Minor wording improvement of the error message `ynew must be a numerical vector`.

* In the `!validate_search` part of `loo_varsel()`, the call to

`.get_sub_summaries()` is not needed.

* `varsel()` and `kfold_varsel()`: Rename `p_sub` to `submodels` (for consistency with `cv_varsel()` and to avoid confusion with `p_sel`, `p_pred`, and `p_ref`).

* Minor improvement of an error message for `nloo`.

* `suggest_size.vsel()`: Replace `pct = 0.0` by `pct = 0`.

* Remove argument `weights` (the observation weights) for prediction (not needed). This (#224)

fixes issue #163.

* Fixes for the predictive variance functions (#225)

* Fix a bug in the binomial and Poisson `predvar()`s (currently

not throwing an error in `var[, s, drop = FALSE]` (in `linear_mle()`) due to
lazy evaluation, but unsafe since non-lazy evaluation might be needed in the
future
).

* Adapt the tests.

* Always ensure a N x S matrix as output when calling `family$predvar()` for multiple (clustered) draws, even for N = 1.

* Docs: Activate Markdown syntax (#226)

* Turn on Markdown support for **roxygen2**.

* Re-document with Markdown support enabled (changes

indentation and prepends another `\` to `\` occurrences
).

* Fix percentages after enabling Markdown support in the **roxygen2** documentation.

* Handle issue stan-dev/rstanarm#546. (#227)

* Fix a bug introduced by PR #196 (concerning `predict.refmodel()` in (#228)

* Extend tests (#229)

* Activate tests for GAMs and GAMMs.

* Activate the tests for all families (`fam_nms`).

* Activate `run_cvfits_all`.

* Activate `run_brms` (based on `NOT_CRAN`).

* Activate more `"brmsfit"` tests (more precisely, all `fam_nms`).

* Activate more `"brmsfit"` tests (more precisely, all `mod_nms`).

* For speed reasons: Revert commit "Activate more `"brmsfit"` tests (more precisely, all `mod_nms`)."

* Activate more `"brmsfit"` tests (more precisely, the special formula).

* Restructured and modified divergence minimizers (#230)

* Minor improvements in the general package documentation (and start using some Markdown syntax).

* Call the "callback" fitters in a consistent and safer way and use the ellipsis (`...`) consistently. This

is a key step for letting `fit_glmer_callback()` pass `var` and `regul` to `fit_glm_ridge_callback()`.

* Pass `regul` via `...`.

* Pass the correct `var` from `fit_glmer_callback()` to

`fit_glm_ridge_callback()`. Improve documentation of arguments `ref_predfun`,
`proj_predfun`, and `div_minimizer`.

* `divergence_minimizers.R`: Exclude arguments from `...` which cannot be passed to the final submodel fitters.

* Remove `suppressMessages()` and `suppressWarnings()` where those calls might hide important problems to the user (apart from GAMs and GAMMs; they need to be checked separately).

* After avoiding `suppressMessages()`, explicitly ignore singularity problems in the **lme4** fitters (by default) to avoid messages for these problems again.

* Adapt the tests (now,

`regul` can also have an effect for GLMMs since `fit_glmer_callback()` passes
`regul` via `...` to `fit_glm_ridge_callback()`
).

* `divergence_minimizers.R`: Improve some comments.

* `validate_response_formula()`: Always return a list. This makes it possible to simplify the divergence minimizers.

* Adapt the tests to the modified `validate_response_formula()` (and use `nobsv` predictive variances in `var_crr` of `test_div_minimizer.R`).

* `proj_predfun` functions: Now that `validate_response_formula()` always returns

a list (and the divergence minimizers have been adapted accordingly): Always
expect a list (and simplify the `proj_predfun` functions accordingly).

* `predict_multilevel_callback()` is only used once, so avoid an explicit definition (to keep debugging simple).

* Re-introduce `suppressMessages()` and `suppressWarnings()` when calling

the **lme4** fitters (because of
`Warning in pwrssUpdate([...]): non-integer x = [...]`).

* Simplify the divergence minimizers so that `lapply()` is

used around the call to `refmodel$div_minimizer()` in
`project_submodel()`. This simplifies the introduction of
`foreach()` later when parallelizing the projection (even
though this simplification is not necessary for the
parallelization).

* Introduce a separate formula argument excluding multilevel terms for

additive models. This is necessary to avoid a breaking change for
users with a custom divergence minimizer for additive models.

* Fix `test_div_minimizer.R`

* Restructure `test_div_minimizer.R`

* `proj_predfun` docs: The return value must always be a matrix.

* Minor improvement in the docs for the return value of `div_minimizer`.

* Improve readability of `split_formula_random_gamm4()`.

* Determine the `div_minimizer`s and `proj_predfun`s based on the submodel's formula, not the

reference model's formula. Also simplify the definitions and calls to the
`fit_<...>_callback()` functions by passing as many arguments as possible via
`...`.

* `fit_gamm_callback()`: Move `suppressMessages(suppressWarnings())` to the innermost place (to avoid

suppressing messages and warnings from code outside of the `gamm4()` call.

* Ensure that `fit_glmer_callback()` doesn't get lost in an infinite loop.

* Omit the offsets for the "brnll" `fam_nm` to have some scenarios without offsets (apart from the workarounds for rstanarm GAM and GAMM fits). (#231)

* Refactor `search_forward()` (#232)

* `search_forward()`: Remove unnecessary call to `min()`.

* Simplify (and improve readability of) `search_forward()`.

* Remove obsolete name `"sub_fit"` (that name was confusing because

it was repeated for all model sizes, so the list
`<vsel_object>$search_path$sub_fits` had only duplicated names in
case of forward search).

* Major enhancements in the docs (#233)

* Update roxygen2 version.

* Consistently use `#'` (not `##'`) for roxygen2 comments.

* Speed up the examples in the documentation (and minor other improvements in the docs).

* Major enhancements in the documentation.

* Move documentation for `extract_model_data` to section "Details".

* Enhance documentation (continued).

* Fix issue #190 (note in documentation concerning interactions in L1 search).

* Improve formatting in the `CITATION` file (make it more consistent).

* Enhance documentation (continued).

* Add `print()` to examples (otherwise, only output from the last line will be printed).

* Add documentation for `as.matrix.projection()`.

* Show usage of the `posterior` package (in the `as.matrix.projection()` example).

* Argument `regul` is only relevant for submodels which are GLMs. Unify the documentation for `regul` (two occurrences).

* Show usage of **bayesplot** in the example for `as.matrix.projection()`.

* Enhance documentation (continued; `d_test`).

* Use `@inheritParams varsel` instead of `@template args-vsel`.

* Use a consistent (and enhanced) documentation for arguments `seed` and `.seed`.

* Enhance documentation (continued; fix bugs).

* Use roxygen2 syntax also in a non-roxygen2 comment (to avoid inconsistencies in case that comment is turned into a roxygen2 comment).

* Use roxygen2 syntax also in an invisible documentation (to avoid inconsistencies in case that documentation is turned on).

* In `DESCRIPTION`, put the `Package:` field at the beginning because some search routines require this.

* Enhance the `Description:` field in `DESCRIPTION`.

* Re-order the `DESCRIPTION` fields (affects the ordering in the PDF manual).

* Use Markdown syntax consistently.

* Resolve minor inconsistencies in the documentation.

* Resolve minor inconsistencies regarding the abbreviation "CV".

* Enhance documentation (continued; indentation and other formatting).

* Enhance documentation (continued; `man-roxygen/args-newdata.R`).

* Enhance documentation (continued; mainly `refmodel-init-get`; this probably also

solves issue #156).

* Enhance documentation (continued; `predict.refmodel()`).

* Enhance documentation (continued; mainly `project()`; this probably also solves

a part of issue #154 (namely the documentation) for now).

* Document the now consistent behavior of `suggest_size()` (see issue #164).

* Minor wording improvements.

* Enhance documentation (continued; `projection-linpred-predict`).

* Rename documentation tag `projection-linpred-predict` to `pred-projection`.

* Enhance documentation (continued; `pred-projection`).

* Enhance documentation (continued; mainly `varsel()`).

* Enhance documentation (continued; mainly `cv_varsel()`).

* Use a consistent language for "response" and "predictors".

* Minor improvements in `data.R`.

* Minor improvement for `acc`/`pctcorr`.

* Minor wording improvement.

* Make journal volume numbers bold (as in other R docs).

* Add references for PSIS-LOO CV.

* Fix a typo in a reference.

* Examples: Add a note concerning `chains = 2, iter = 500`.

* Add an example for a custom reference model via `init_refmodel()` (which

partly solves issue #125) and perform minor improvements in related
example code.

* `DESCRIPTION`: Add the `Date:` field (important for the `CITATION` file).

* Fix a typo in the docs (`as.matrix.projection()` example).

* Minor wording improvement in the docs (`as.matrix.projection()` example).

* Docs: Remove "[stat<...>]" for arXiv links, according to the CRAN convention.

* Docs: Add a comment concerning the relevance of

`ndraws<_pred>` and `nclusters<_pred>` in case of `cv_search = FALSE`.

* Docs: Use the terms "search step" and "evaluation step" (as a repeating pattern, this

should be easier to recognize for readers).

* Docs: Avoid the `@details [...] Notes: <list>` construction (looks

strange in the final documentation). Unfortunately, `@note` can't
be used since otherwise, `cv_varsel()`'s documentation couldn't
inherit that section from `varsel()`'s documentation.

* There are no Gaussian-only `stats` (at least it isn't implemented this

way). Therefore, adapt the documentation.

* Docs: Add more details concerning argument `method`.

* Minor wording improvements in the general package documentation (help topic

`projpred-package`).

* References in the docs: Replace "doi:[<...>](<...>)" by "DOI: [<...>](<...>)". In

the docs, this does not seem to be necessary, but in the vignette, the former
way caused the DOI hyperlinks to be rendered incorrectly.

* Docs: Handle section "References" consistently.

* Replace "arxiv" by "arXiv".

* Docs: Replace "search step" and "evaluation step" by

"search part" and "evaluation part", respectively. The reason is a possible
confusion of the term "search step" with the steps of a stepwise variable
selection, see e.g. `?step`. The second reason is that the term "step" might
suggest an EM-like alternation of steps (which is not the case here, since the
search is completed first and only after that, the evaluation takes place).

* Minor wording improvements in the general package documentation (available

at ``?`projpred-package` ``).

* Fix a reflow bug in the general package documentation (available

at ``?`projpred-package` ``).

* `DESCRIPTION`: Minor cleaning.

* `DESCRIPTION`: Reflow to margin width 80.

* Activate the examples in Travis CI checks.

* Miscellaneous modifications and fixes (again) (#234)

* Remove `as.matrix.list()` (neither used nor needed).

* Define the default `baseline` in the `vsel` methods directly (instead of

using `NULL`) and remove the `is.null()` case in `.validate_baseline()`.

* Docs: Mention that for argument `baseline`, the best submodel is decided in terms of `stats[1]`.

* Docs of `summary.vsel()`: Minor wording improvements for argument `type`.

* Docs of argument `baseline`: Mention the relevance.

* Simplify `cvfun()` in `get_refmodel.stanreg()`.

* Set a visible default for argument `K` of `cv_varsel()`.

* Improve y-axis label in `plot.vsel()`.

* There are no Gaussian-only `stats` (at least it isn't implemented this

way). Therefore, adapt the tests.

* Fix issue #210.

* Adapt the tests to the fix for issue #210.

* Harmonize definitions and usages of `%<name>%` binary infix operators.

* Unify the equivalent special binary (infix) operators `%||%` and `%ORifNULL%`.

* Avoid function definitions without curly braces because they're not shown in

RStudio's document outline.

* Remove `.get_proj_handle()` (not necessary and had an incorrect default for `regul`).

* `search_forward()`: Remove unused argument `increasing_order`.

* `projfun.R`: Enhance comments.

* `predict.subfit()`: Don't access `subfit$x` if not necessary.

* Minor cleaning.

* Remove more unused methods.

* `get_refmodel.stanreg()`: Throw an error if **rstanarm** is not installed.

* Replace "search path" by "solution path" (at least where relevant and easily possible) to

avoid a (purely optical) confusion with "search part".

* Parallelization of the projection (#235)

* Parallelize the projection (i.e., parallelize across the projected draws). A

related discussion may be found in issue #77.

* Introduce global option `projpred.nprjdraws_parallel`.

* Extend docs for parallelization.

* Improve the documentation for global option `projpred.nprjdraws_parallel` (and

adapt the code correspondingly).

* Increase the default for option `projpred.nprjdraws_parallel` from

100 to 200 to ensure a speed improvement even for `devtools::load_all()` instead
of `library(projpred)` (for `devtools::load_all()`, parallelization seems to be
a bit slower).

* Iterate over the objects in the `foreach()` loop themselves instead of their index. This should

avoid the export of those (large) objects.

* Use the `.export` argument in the `foreach()` call to handle exports manually. This could

avoid the unnecessary export of (large) objects which are unused.

* Use the `.noexport` argument in the `foreach()` call to handle exports manually. This could

avoid the unnecessary export of (large) objects which are unused.

* Check the most important (largest) objects which should not be exported.

* Set `.noexport` manually.

* Explain why we have the global option `projpred.nprjdraws_parallel`.

* Set `.export` manually (necessary for the `doFuture` package with `options(doFuture.foreach.export = ".export")`).

* Make the parallelization of the projection an optional feature.

* Add comments concerning the parallelization.

* Rename option `projpred.nprjdraws_parallel` to `projpred.prll_prj_trigger`.

* Throw errors if packages **foreach** and **iterators** are not installed.

* Add package **parallel** to `Suggests:`.

* Add package **doFuture** to `Suggests:` (only needed for the parallel tests, though, which are not run on CRAN by default).

* Add packages **future** and **future.callr** to `Suggests:`.

* `bootstrap()`: Remove argument `oobfun` which is not used within **projpred** (and makes future modifications of `bootstrap()` more complicated).

* `bootstrap()`: Replace `seq.int()` by `seq_len()`, as recommended in `?seq.int`.

* `bootstrap()`: Improve a comment.

* Parallelize `bootstrap()`.

* Extend the docs for the parallelized projection (in ``?`projpred-package` ``).

* Test the `"noclust"` setting (for projecting from a reference model) more often.

* Avoid a failing "`offsetnew` works" test due to numerical inaccuracies for extreme values.

* Add test "PRNG is not taking place where not expected".

* Simplify the test "PRNG is not taking place where not expected".

* Rename test
"PRNG is not taking place where not expected"
to
"non-clustered projection does not require a seed".

* Add tests for parallelization.

* Move the code for the sequential `bootstrap()` run from `setup.R` to `test_parallel.R`.

* Revert the parallelization of the bootstrap. The first reason is

that in order for the parallel test results to match the sequential ones
exactly, we would have to use doRNG::`%dorng%`() also in the sequential case of
`bootstrap()` and then, for the sequential results in the tests, not register
any **foreach** backend at all or use `registerDoSEQ()`. The second reason is
that, because of the parallelization overhead, the parallelized bootstrap didn't
provide a speed improvement, at least not up to the number of bootstrap samples
used by default (2000).

* Even if not on CRAN, `R CMD check` imposes a limit of 2 cores (at least by the defaults used in RStudio).

* `vignettes/quickstart_glmm.Rmd`: Replace `ndraws = 10` by `nclusters = 10` (not necessary, but didactically better since an `ndraws` <= 20 is passed to `nclusters`).

* `vignettes/quickstart_glmm.Rmd`: Fix typo.

* `vignettes/quickstart_glmm.Rmd`: Remove unused argument `ns`.

* `vignettes/quickstart_glmm.Rmd`: Fix the parameter names supplied to `mcmc_areas()`.

* vignettes/children/SETTINGS-knitr.txt: Improve the check for the `params$eval` parameter.

* Use `nclusters = 20` (or `nclusters_pred = 20`). Also improve the wording of some text.

* Further reduce the runtime for the vignette.

* vignettes/quickstart_glmm.Rmd: Error corrected (

but `(1 | obs)` and `(1 | phylo)` should be equally important as they are both
somehow estimated to have a hierarchical SD of exactly 1 in the submodels; this
needs to be inspected
).

* vignettes/quickstart_glmm.Rmd: Minor wording improvements.

* vignettes/quickstart_glmm.Rmd: Remove the term `(1 | phylo)` from the model (since we have

`identical(data_pois$phylo, paste0("sp_", data_pois$obs))`, so including both
`obs` and `phylo` doesn't make sense).

* Further reduce the runtime for the vignettes.

* vignettes/quickstart.Rmd: Remove unused argument `ns`.

* Add `rmarkdown::` to the output type of the vignettes.

* Vignettes: Activate the "vignette" YAML field.

* Adapt ignore lists to vignettes (as done by `usethis::use_vignette("<vignette_name>")`).

* vignettes/quickstart_glmm.Rmd: Further reduce the runtime by reducing the number of individuals in the `VerbAgg` dataset.

* Add package "tidyr" to `Suggests` field (needed for building the GLMM vignette).

* Add package "nlme" to `Suggests` field (needed for building the GLMM vignette).

* Remove the html files from folder "vignettes" since CRAN takes the vignettes from "inst/doc" (created automatically by devtools::build(), for example).

* vignettes/quickstart.Rmd: Further reduce the runtime by restricting `nterms_max`.

* Add `rmarkdown` to the `VignetteBuilder` field in the DESCRIPTION file (see <https://cran.r-project.org/doc/manuals/r-release/R-exts.html>).

* Fix indentation.

* Replace the two former vignettes (GLM and GLMM) by a single new one which uses `datasets::CO2` as the only example.

* Update the link in the documentation to the vignette.

* Minor wording improvement after the two vignettes have been replaced by a single one.

* For the new vignette: Add package `datasets` (part of basic R installation) to `Suggests:` and

remove packages `tidyr` and `nlme`.

* Update .gitignore after the two vignettes have been replaced by a single one.

* Remove unused `SETTINGS-<...>.txt` files.

* Change figure size.

* Vignette: Fix a bug (`cv_varsel()` code was not shown).

* Update `.Rbuildignore`.

* Use `within()` for `CO2`.

* Use cache only where necessary and reasonable (see <https://bookdown.org/yihui/rmarkdown-cookbook/cache.html>).

* Since `SETTINGS-knitr.txt` is only used once, we can move its content to the place where it is used.

* Simplify the setup chunk.

* Minor fixes and improvements in the vignette.

* Handle the encoding as suggested by `usethis::use_vignette()`.

* Use the default `comment = '##'` instead of `comment = NA`.

* Add section "Troubleshooting" and explain `varsel()` a bit more.

* Use the terms "search step" and "evaluation step" (as a repeating pattern, this

should be easier to recognize for readers).

* Add more info concerning the failing projected posterior predictive check (PPPC).

* Attach the **rstanarm** package in a separate chunk (needed for caching to work).

* Minor wording improvements.

* Switch back to the GLM example for the vignette (i.e., don't use the

`CO2` GLMM example anymore) and introduce parallelization.

* Use the term "MCMC" more consistently.

* Fix incorrectly rendered DOI hyperlinks.

* Vignette: Replace "search step" and "evaluation step" by

"search part" and "evaluation part", respectively. The reason is a possible
confusion of the term "search step" with the steps of a stepwise variable
selection, see e.g. `?step`. The second reason is that the term "step" might
suggest an EM-like alternation of steps (which is not the case here, since the
search is completed first and only after that, the evaluation takes place).

* Fix a URL.

* Minor wording improvement in the vignette.

* Fix the ordering of the references in the vignette.

* Adapt the package description to the new vignette.

* Update `README.md` to the replacement of the two former vignettes by a single new one.

* Activate the vignette in `R CMD check` and `R CMD build`.

* Activate the vignette in Travis CI checks (and builds).

* Don't use markup in the "title" YAML field because

this causes the following warning when building the **pkgdown** website (see
also r-lib/pkgdown#989); note that setting the option mentioned in the warning
below only helps when building the standalone vignette, not when building the
**pkgdown** website:
```
The vignette title specified in \VignetteIndexEntry{} is different from the
title in the YAML metadata. The former is "projpred: Projection predictive
feature selection", and the latter is "**projpred**: Projection predictive
feature selection". If that is intentional, you may set
options(rmarkdown.html_vignette.check_title = FALSE) to suppress this check.
```

* Add newlines after figures (necessary for the **pkgdown** website).

* Fix <https://github.com/stan-dev/projpred/pull/237#discussion_r732711453>.

* Mention other possible types of reference models. Minor

improvements in some other sections, too.

* Additive models: Add notes that this is currently only an experimental feature.

* Move the content of the setup chunk back to its own

file `vignettes/children/SETTINGS-knitr.Rmd`. This is done for consistency with
other Stan-related R packages. However, note that these "children" files differ
slightly between the Stan-related R packages. Here, the templates were
**cmdstanr**, **bayesplot**, and **rstanarm**.

* Analogously to **cmdstanr** and **bayesplot**, add a TOC to the vignette.

* Wording improvements (see PR review).

* Update `README` (#245)

* Docs: Follow-up for PR #233 etc. (#246)

* Docs: Replace the term "credible" (intervals/bounds) by

"confidence" (intervals/bounds), following the discussion at PR #233. Also
perform some more or less related (minor) improvements.

* Docs: Add reference for Gelman and Hill (2006) (used to be

Gelman and Hill, 2007), following the discussion at PR #233.

* Doc topic `refmodel-init-get`: Ensure that sections don't

seem like subsections of section "Value".

* Refactor `summary.vsel()` to improve readability. (#247)

* Tests: Minor enhancements (#248)

* Tests: Adapt a special case to the fact that the submodel fitter is now

decided based on the submodel's (not the reference model's) formula.

* Minor improvements in `run_snaps` and `run_prll` (and the associated comments).

* Minor improvements in the comments for `run_snaps` and `run_prll`.

* Handle issue stan-dev/rstanarm#551 and other `cvfun` changes (#249)

* Remove `...` from `cvfun` definitions since all calls to `cvfun()` (which is currently only a single one) currently only use argument `folds`.

* Improve an error message related to `cvfun`.

* `cvfun` in `get_refmodel.stanreg()`: Force `cores = 1` in the call to `rstanarm::kfold()` (see issue stan-dev/rstanarm#551).

* Vignette: Set `options(mc.cores = ncores)` to make use of parallelization in `rstanarm::kfold()`.

* Offsets: Wrapper for `ref_predfun` (#250)

* Move the subtraction of the offsets out of the default `ref_predfun` by

introducing a wrapper function around the user-specified (or default)
`ref_predfun`. This has the advantage of avoiding a breaking change for users
with custom `ref_predfun`s.

* Fix a bug when calling `ref_predfun()` with actual `newdata` on

a `stanreg` object with offsets. This was already anticipated in PR #196 and is
now necessary because the subtraction of the offsets was moved out of the
default `ref_predfun` and into a wrapper function. However, this fix is still
just a fragile quick-and-dirty solution. It is more desirable to adopt the
**brms** way, as discussed in PR #196 and issue #218.

* Docs, comments, defaults (#251)

* Argument `transform` of `posterior_linpred()` has a default of `FALSE`.

* `misc.R`: Add a comment concerning the origin of `.is.wholenumber()`.

* Minor wording improvement in docs and a comment.

* Reformat and reword some larger comments in `.get_refdist()`.

* `fetch_data()`: Argument `newdata` has a default of `NULL`.

* Warnings for experimental features (#252)

* Add a warning for the experimental `Student_t()` family (see issue #233).

* Add a warning for the experimental subsampled LOO CV (see issue #94).

* Adapt the tests to the warning for subsampled LOO CV.

* Additive models: Add a note in the docs and a warning in `init_refmodel()` that

this is currently only an experimental feature. See the comment at
<github.com/stan-dev/projpred/pull/237#issuecomment-961151653>.

* Tests: Exclude additive models (GAMs and GAMMs) for now, following

commit "Additive models: Add a note in the docs and a warning in `init_refmodel()` that [...]".

* Add a warning for the Gamma family (support for this is only experimental; see

issue #240).

* Fix issue #242 (#253)

* Fix issue #242 (Refactors and bugs in `.init_submodel()`), i.e., perform th…
fweber144 added a commit to fweber144/modelselection that referenced this issue May 10, 2022
experimental in projpred (see stan-dev/projpred#94). Setting `nloo = n` should not change results compared to omitting argument `nloo`, but I guess it's better not to point to an experimental feature where not needed.
@fweber144
Copy link
Collaborator Author

See also

I also noticed that <vsel_object>$summaries$ref$w (so w for the reference model) is always NULL which is strange because it looks like lines

summ <- summ_ref
res <- get_stat(summ$mu, summ$lppd, varsel$d_test, stat, mu.bs = mu.bs,
lppd.bs = lppd.bs, weights = summ$w, alpha = alpha, ...)
expect a non-NULL element w also for the reference model (which would also make sense if w is supposed to contain the weights of the CV folds the observations are in because the reference model's performance is also cross-validated if the submodels' performances are).

in this comment. When fixing subsampled LOO CV, I guess that also needs to be fixed.

@fweber144
Copy link
Collaborator Author

More issues/questions concerning subsampled LOO CV:

  1. NAs in get_stat()'s baseline objects mu.bs and lppd.bs are not handled consistently across the different stats (I think the handling from RMSE and AUC also needs to be used for all other stats). Furthermore, I think n_notna needs to be updated according to the NAs in the baseline objects mu.bs and lppd.bs if baseline = "best". I have now (PR Fix #349 #350) also added n_notna.bs which could perhaps be re-used.
  2. For the AUC, the modified mu doesn't even seem to be used (see lines

    projpred/R/summary_funs.R

    Lines 295 to 307 in e7080db

    auc.data <- cbind(y, mu, wcv)
    if (!is.null(mu.bs)) {
    mu.bs[is.na(mu)] <- NA # compute the AUCs using only those points
    mu[is.na(mu.bs)] <- NA # for which both mu and mu.bs are non-NA
    auc.data.bs <- cbind(y, mu.bs, wcv)
    value <- auc(auc.data) - auc(auc.data.bs)
    value.bootstrap1 <- bootstrap(auc.data, auc, ...)
    value.bootstrap2 <- bootstrap(auc.data.bs, auc, ...)
    value.se <- sd(value.bootstrap1 - value.bootstrap2, na.rm = TRUE)
    lq_uq <- quantile(value.bootstrap1 - value.bootstrap2,
    probs = c(alpha_half, one_minus_alpha_half),
    names = FALSE, na.rm = TRUE)
    } else {
    ).
  3. It seems like loo_varsel()$d_test always contains all observations as test points, but line
    res_diff <- get_stat(summ$mu, summ$lppd, varsel$d_test, stat,
    doesn't reduce this down to the subsampled observations. Is this on purpose?

@fweber144 fweber144 changed the title LOO subsampling weights in cv_varsel.refmodel() Subsampled LOO Jun 30, 2023
fweber144 added a commit to fweber144/projpred that referenced this issue Nov 3, 2023
(distracts from the fact that because of possible bugs and possible
methodological deficiencies described in stan-dev#94, the `nloo` feature is experimental
in the first place).
fweber144 added a commit to fweber144/projpred that referenced this issue Nov 3, 2023
(distracts from the fact that because of possible bugs and possible
methodological deficiencies described in stan-dev#94, the `nloo` feature is experimental
in the first place).
fweber144 added a commit to fweber144/projpred that referenced this issue Nov 6, 2023
(distracts from the fact that because of possible bugs and possible
methodological deficiencies described in stan-dev#94, the `nloo` feature is experimental
in the first place).
fweber144 added a commit that referenced this issue Nov 7, 2023
(distracts from the fact that because of possible bugs and possible
methodological deficiencies described in #94, the `nloo` feature is experimental
in the first place).
fweber144 added a commit to fweber144/projpred that referenced this issue Nov 9, 2023
subsampled PSIS-LOO CV (`nloo`) with `deltas = TRUE` and `baseline = "best"`.

For `baseline = "ref"`, this is only a refactor improving the safety and
readability of `get_stat()`'s handling of `NA`s because for `baseline = "ref"`,
we should always have no `NA`s in `lppd.bs` and `mu.bs`, so in that case,
`n_notna` did not require an adjustment and also because the math operations
connecting `mu` and `mu.bs` (analogously for `lppd` and `lppd.bs`) ensured that
only the "inner join" of non-`NA` elements (i.e., the set of observations for
which both `mu` and `mu.bs` (analogously for `lppd` and `lppd.bs`) are not `NA`)
is used.

This addresses question 1 from
<stan-dev#94 (comment)>.
fweber144 added a commit to fweber144/projpred that referenced this issue Nov 9, 2023
@fweber144
Copy link
Collaborator Author

The comment from #94 (comment) has been addressed by commit 2bd6b4e (within PR #475).

Question 1 from #94 (comment) has now been addressed by commit 7232148 (within PR #475).

Question 2 from #94 (comment) has now been addressed by commit d109c37 (within PR #475).

Question 3 from #94 (comment) doesn't need to be addressed (i.e., it can be ignored) because that behavior was probably indeed on purpose (non-subsampled observations have NA as predicted value and so they will be ignored by get_stat()).

@fweber144
Copy link
Collaborator Author

For the future: In the code, I have added comments starting with TODO (subsampled PSIS-LOO CV) which need to be addressed. However, addressing these comments does not guarantee a correctly behaving subsampled PSIS-LOO CV (in particular, there are probably deeper methodological issues that need to be addressed as well, see above).

@fweber144
Copy link
Collaborator Author

And a question to @AlejandroCatalina: Is it on purpose that the bootstrap samples are drawn from the set of all observations (i.e., with the NA observations included)? Couldn't it also make sense to draw the bootstrap samples from only the subsampled observations (i.e., the non-NA observations)?

@AlejandroCatalinaF
Copy link

Yeah it would be okay too I think

@fweber144
Copy link
Collaborator Author

@avehtari: Still to do (or to check) here:

@avehtari
Copy link
Collaborator

(Sorry for not checking this 2 years ago!)

  • loo_ref_oscale <- apply(loglik_forPSIS + lw, 2, log_sum_exp) at

    loo_ref_oscale <- apply(loglik_forPSIS + lw, 2, log_sum_exp)

    is equivalent to pointwise elpd_loo's

  • loo_subsample_pps function has wcv <- exp(lppd - max(lppd)) which is wrong, and should be either abs(lppd) or -lppd + max(lppd) +1.

  • loo_subsample_pps returns the weights, but they are not used elsewhere in the code, and thus the current code is not actually doing PPS

  • After re-reading both PPS and difference estimator papers, I think we should use difference estimator.

  • The main purpose of using subsampling LOO in projpred is when validate_search=TRUE so that we can reduce the number of times the search is repeated. If we assume that full LOO has been used for the full data search path, we can use full data LOO for model size k as an approximation for all models of size k. When estimating the LOO performance in validate_search for a model size k, the first term in eq (7) is the full data model size k model elpd (using the reference model PSIS-LOO weights), and in the second term \pi_j is the sum of pointwise elpds from different search paths for model size k. We get the elpd difference to the reference model correspondingly. The benefit of difference estimator with SRS is that we get the diff_se via eq (9).

  • Difference estimator with SRS is using resampling without replacement, and there are no weights.

@avehtari
Copy link
Collaborator

Here is a demonstration of subsampling LOO with simple random sampling and difference estimator

library(rstanarm)
library(projpred)
library(tidyr)
library(ggplot2)

data("df_gaussian")
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)

# Prior guess for the number of relevant (i.e., non-zero) regression
# coefficients:
p0 <- 5
# Number of observations:
N <- nrow(dat_gauss)
# Hyperprior scale for tau, the global shrinkage parameter (note that for the
# Gaussian family, 'rstanarm' will automatically scale this by the residual
# standard deviation):
tau0 <- p0 / (D - p0) * 1 / sqrt(N)
( D <- sum(grepl("^X", names(dat_gauss))) )

set.seed(5078022)
refm_fit <- stan_glm(
  y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 + X12 + X13 + X14 +
    X15 + X16 + X17 + X18 + X19 + X20,
  family = gaussian(),
  data = dat_gauss,
  prior = hs(global_scale = tau0),
  ### Only for the sake of speed (not recommended in general):
  chains = 2, iter = 1000,
  ###
  refresh = 0
)


refm_obj <- get_refmodel(refm_fit)

# Fast with validate_search = FALSE
cvvs_fast <- cv_varsel(
  refm_obj,
  validate_search = FALSE,
  nterms_max = 20
)

plot(cvvs_fast, alpha=0.1, deltas=TRUE,
     text_angle=45, size_position = 'primary_x_top') +
  geom_vline(xintercept=seq(0,20,by=5), colour='black', alpha=0.1)

# For comparison purposes slow with validate_search = TRUE and no subloo
cvvs <- cv_varsel(
  cvvs_fast,
  validate_search = TRUE
)

plot(cvvs, alpha=0.1, deltas=TRUE, show_cv_proportions=FALSE,
     text_angle=45, size_position = 'primary_x_top') +
  geom_vline(xintercept=seq(0,20,by=5), colour='black', alpha=0.1)

# Subsampling LOO with difference estimator and simple resampling
# starts with the simple resampling. For demonstration, we use a
# subsample of nloo=25. nloo should not be too small, but in this case
# the full N is only 100, so we use quite small nloo for demonstration
# purposes.
nloo=25
set.seed(68)
# simple random sample
sub_inds=sample(1:N, nloo)

# For subsampling LOO with difference estimator we need surrogate
# results that are close to full loo results. In case of search we can
# expect that the fast submodel LOO with validate_search=FALSE to be a
# good surrogate.
#
cvvs_fast_perf <- performances(cvvs_fast)
cvvs_perf <- performances(cvvs)
# For Intercept only and the full model the validate_search=TRUE
# results are the same as with validate_search=FALSE.
elpd_diffe <- numeric()
k <- 0
elpd_diffe[k+1] = sum(cvvs_fast$summaries$sub[[k+1]]$lppd)
k <- 20
elpd_diffe[k+1] = sum(cvvs_fast$summaries$sub[[k+1]]$lppd)
elpd_sub <- elpd_diffe
# For the rest we use the difference estimator
for (k in 1:(20-1)) {
  # proxy is the the corresponding validate_search=FALSE result
  elpd_loo_hat <- cvvs_fast$summaries$sub[[k+1]]$lppd
  # For demonstration purposes, instead of running the search with
  # subsampling, we subsample results from the run with
  # validate_search=TRUE and nloo=N
  elpd_loo_sub <- cvvs$summaries$sub[[k+1]]$lppd[sub_inds]
  # The difference estimator
  elpd_diffe[k+1] <- sum(elpd_loo_hat) + N/nloo * sum(elpd_loo_sub - elpd_loo_hat[sub_inds])
  # As comparison compute the simple subsample average
  elpd_sub[k+1] <- N/nloo * sum(elpd_loo_sub)
}

# Compare 1) validate_search=FALSE elpd, 2) validate_search=TRUE with
# subsampling difference estimator elpd, 3) validate_search=TRUE with
# subsampling with subsample average
data.frame(k=0:20,
           fast=cvvs_fast_perf$submodels[,'elpd']-cvvs_perf$submodels[,'elpd'],
           diffe=elpd_diffe-cvvs_perf$submodels[,'elpd'],
           simple=elpd_sub-cvvs_perf$submodels[,'elpd'])|>
  pivot_longer(cols = !k,
               names_to = 'method',
               values_to = 'elpd') |>
  ggplot(aes(x=k, y=elpd, color=method)) +
  geom_line() +
  labs(y='difference to slow elpd')

image

@avehtari
Copy link
Collaborator

loo package has diff_srs_est() function, which estimates also the variance (needed for SEs), but the code has some mismatch with the corresponding paper, and the estimated variance can be negative, so waiting stan-dev/loo#237 to be solved, before continuing here

@fweber144
Copy link
Collaborator Author

Thanks for the clarifications. Now it's clear to me what projpred is supposed to do in case of subsampled LOO.

@fweber144
Copy link
Collaborator Author

Concerning

loo_subsample_pps returns the weights, but they are not used elsewhere in the code

I think these weights (called wcv in the code) are used at some places:

  • In auc():
    wcv <- x[, 3]
  • In get_stat():

    projpred/R/summary_funs.R

    Lines 286 to 287 in 9860c0e

    get_stat <- function(mu, lppd, y_wobs_test, stat, mu.bs = NULL, lppd.bs = NULL,
    wcv = NULL, alpha = 0.1, ...) {
    (calls of get_stat() with non-NULL argument wcv are:

    projpred/R/summary_funs.R

    Lines 215 to 217 in 9860c0e

    res_diff <- get_stat(summ$mu, summ$lppd, varsel$y_wobs_test, stat,
    mu.bs = summ_ref$mu, lppd.bs = summ_ref$lppd,
    wcv = summ$wcv, alpha = alpha, ...)

    projpred/R/summary_funs.R

    Lines 241 to 246 in 9860c0e

    res <- get_stat(summ$mu, summ$lppd, varsel$y_wobs_test, stat,
    mu.bs = mu.bs, lppd.bs = lppd.bs, wcv = summ$wcv,
    alpha = alpha, ...)
    diff <- get_stat(summ$mu, summ$lppd, varsel$y_wobs_test, stat,
    mu.bs = summ_ref$mu, lppd.bs = summ_ref$lppd,
    wcv = summ$wcv, alpha = alpha, ...)
    ).

I just wanted to add this for the sake of completeness. As you said, simple random sampling (SRS, combined with the difference estimator) doesn't require weights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs.
Projects
None yet
Development

No branches or pull requests

4 participants