Merge pull request #372 from stan-dev/pareto_diags_vignette

new Pareto-k diagnostics vignette
stan-dev · Jun 28, 2024 · 470075c · 470075c
2 parents 44fc43e + 128cf09
commit 470075c
Show file tree

Hide file tree

Showing 2 changed files with 246 additions and 2 deletions.
diff --git a/vignettes/pareto_diagnostics.Rmd b/vignettes/pareto_diagnostics.Rmd
@@ -0,0 +1,244 @@
+---
+title: "Pareto-khat diagnostics"
+author: "Aki Vehtari"
+output: 
+  rmarkdown::html_vignette:
+    toc: true
+    toc_depth: 3
+vignette: >
+  %\VignetteIndexEntry{Pareto-khat diagnostics}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+## Introduction 
+
+The paper
+
+* Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and
+  Jonah Gabry (2024). Pareto smoothed importance sampling.
+  *Journal of Machine Learning Research*, 25(72):1-58.
+
+presents Pareto smoothed importance sampling, but also
+Pareto-$\hat{k}$ diagnostic that can be used when estimating any
+expectation based on finite sample. This vignette illustrates the use of
+these diagnostics.
+The individual diagnostic functions are `pareto_khat()`, `pareto_min_ss()`, `pareto_convergence_rate()` and `pareto_khat_threshold()`. The function `pareto_diags()` will return all of these.
+
+Additionally, the `pareto_smooth()` function can be used to transform draws by smoothing the tail(s).
+## Example
+
+```{r setup}
+library(posterior)
+library(dplyr)
+options(pillar.neg = FALSE, pillar.subtle=FALSE, pillar.sigfig=2)
+```
+
+### Simulated data
+
+Generate `xn` a simulated MCMC sample with 4 chains each with 1000
+iterations using AR process with marginal normal(0,1)
+
+```{r simulate-data-1}
+N <- 1000
+phi <- 0.3
+set.seed(6534)
+dr <- array(data=replicate(4,as.numeric(arima.sim(n = N,
+                                                list(ar = c(phi)),
+                                                sd = sqrt((1-phi^2))))),
+         dim=c(N,4,1)) %>%
+  as_draws_df() %>%
+  set_variables('xn')
+```
+
+Transform `xn` via cdf-inverse-cdf so that we have variables that
+have marginally distributions $t_3$, $t_{2.5}$, $t_2$, $t_{1.5}$,
+and $t_1$. These all have thick tails. In addition $t_2$,
+$t_{1.5}$, and $t_1$ have infinite variance, and $t_1$ (aka Cauchy)
+has infinite mean. 
+
+```{r simulate-data-2}
+drt <- dr %>%
+  mutate_variables(xt3=qt(pnorm(xn), df=3),
+                   xt2_5=qt(pnorm(xn), df=2.5),
+                   xt2=qt(pnorm(xn), df=2),
+                   xt1_5=qt(pnorm(xn), df=1.5),
+                   xt1=qt(pnorm(xn), df=1))
+```
+
+### MCMC convergence diagnostics
+
+We examine the draws with the default `summarise_draws()`.
+
+```{r summarise_draws}
+drt %>%
+  summarise_draws()
+```
+
+All the usual convergence diagnostics $\widehat{R}$, Bulk-ESS, and
+Tail-ESS look good, which is fine as they have been designed to
+work also with infinite variance (Vehtari et al., 2020).
+
+If these variables would present variables of interest for which
+we would like to estimate means, we would be also interested in
+Monte Carlo standard error (MCSE, see case study [How many
+iterations to run and how many digits to
+report](https://users.aalto.fi/~ave/casestudies/Digits/digits.html)).
+
+```{r summarise_draws-mcse}
+drt %>%
+  summarise_draws(mean, sd, mcse_mean, ess_bulk, ess_basic)
+```
+
+Here MCSE for mean is based on standard deviation and Basic-ESS,
+but these assume finite variance. We did sample also from
+distributions with infinite variance, but given a finite sample size, the
+empirical variance estimates are always finite, and thus we get
+overoptimistic MCSE.
+
+## Pareto-$\hat{k}$
+
+To diagnose whether our variables of interest may have infinite
+variance and even infinite mean, we can use Pareto-$\hat{k}$
+diagnostic.
+
+```{r summarise_draws-pareto_khat}
+drt %>%
+  summarise_draws(mean, sd, mcse_mean, ess_basic, pareto_khat)
+```
+
+$\hat{k} \leq 0$ indicates that all moments exist, and the inverse
+of positive $\hat{k}$ tells estimate for the number of finite (fractional)
+moments. Thus, $\hat{k}\geq 1/2$ indicates infinite variance,
+and $\hat{k}\geq 1$ indicates infinite mean. Sometimes very thick
+distribution tails may affect also sampling, but assuming sampling
+did go well, and we would be interested only in quantiles, infinite
+variance and mean are not a problem. But if we are interested in mean,
+then we need to care about the number of (fractional) moments. Here we
+see $\hat{k} \geq 1/2$ for $t_2$, $t_{1.5}$, and $t_{1}$, and
+we should not trust their `mcse_mean` values. Without trustworthy MCSE
+estimate we don't have good estimate of how accurate the mean estimate is.
+Furthermore, as $\hat{k} \geq 1$ for $t_{1}$, the mean is not finite and
+the mean estimate is not valid.
+
+## Pareto smoothing
+
+If we really do need those mean estimates, we can improve
+trustworthiness by Pareto smoothing,  which replaces
+extreme tail draws with expected ordered statistics of Pareto
+distribution fitted to the tails of the distribution. Pareto
+smoothed mean estimate (computed using Pareto smoothed draws) has
+finite variance with a cost of some bias which we know when it is
+negligible. As a thumb rule when $\hat{k}<0.7$, the bias is
+negligible.
+
+We do Pareto smoothing for all the variables.
+
+```{r pareto_smooth}
+drts <- drt %>% 
+  mutate_variables(xt3_s=pareto_smooth(xt3),
+                   xt2_5_s=pareto_smooth(xt2_5),
+                   xt2_s=pareto_smooth(xt2),
+                   xt1_5_s=pareto_smooth(xt1_5),
+                   xt1_s=pareto_smooth(xt1)) %>%
+  subset_draws(variable="_s", regex=TRUE)
+```
+
+Now the `mcse_mean` values are more trustworthy when $\hat{k} < 0.7$.
+When $\hat{k}>0.7$ both bias and variance grow so fast that Pareto smoothing
+rarely helps (see more details in the paper).
+
+```{r summarise_draws-pareto_khat-2}
+drts %>%
+  summarise_draws(mean, mcse_mean, ess_basic, pareto_khat)
+```
+
+
+## Minimum sample size required
+
+The bias and variance depend on the sample size, and we can
+use additional diagnostic `min_ss` which tells the minimum sample size needed
+so that `mcse_mean` can be trusted.
+
+```{r summarise_draws-min_ss}
+drt %>%
+  summarise_draws(mean, mcse_mean, ess_basic,
+                  pareto_khat, min_ss=pareto_min_ss)
+```
+
+Here required `min_ss` is smaller than `ess_basic` for all except $t_1$, for
+which there is no hope.
+
+## Convergence rate
+
+Given finite variance, the central limit theorem states that to halve
+MCSE we need four times bigger sample size. With Pareto smoothing,
+we can go further, but the convergence rate decreases when $\hat{k}$ increases.
+
+```{r summarise_draws-conv_rate}
+drt %>%
+  summarise_draws(mean, mcse_mean, ess_basic,
+                  pareto_khat, min_ss=pareto_min_ss,
+                  conv_rate=pareto_convergence_rate)
+```
+
+We see that with $t_2$, $t_{1.5}$, and $t_1$  we need $4^{1/0.86}\approx 5$,
+$4^{1/0.60}\approx 10$, and $4^{1/0}\approx \infty$ times bigger sample sizes to
+halve MCSE for mean.
+
+## Pareto-$\hat{k}$-threshold
+
+The final Pareto diagnostic, $\hat{k}$-threshold, is useful for smaller sample sizes. Here we select only 100 iterations per chain to get total of 400 draws.
+
+```{r summarise_draws-khat_threshold}
+drt %>%
+  subset_draws(iteration=1:100) %>%
+  summarise_draws(mean, mcse_mean, ess_basic,
+                  pareto_khat, min_ss=pareto_min_ss,
+                  khat_thres=pareto_khat_threshold,
+                  conv_rate=pareto_convergence_rate)
+```
+
+With only 400 draws, we can trust the Pareto smoothed result only when
+$\hat{k}<0.62$. For $t_{1.5}$ $\hat{k}\approx 0.64$, and `min_ss` reveals
+we would probably need more than 560 draws to be on the safe side.
+
+## Pareto diagnostics
+
+We can get all these diagnostics with `pareto_diags()`, and it's
+easy to use it also for derived quantities.
+
+```{r summarise_draws-pareto_diags}
+drt %>%
+  mutate_variables(xt2_5_sq=xt2_5^2) %>%
+  subset_draws(variable="xt2_5_sq") %>%
+  summarise_draws(mean, mcse_mean,
+                  pareto_diags)
+```
+
+## Discussion
+
+All these diagnostics are presented in Section 3 and summarized in
+Table 1 in PSIS paper (Vehtari et al., 2024).
+
+If you don't need to estimate means of thick tailed distributions,
+and there are no sampling issues due to thick tails, then you don't
+need to check existence of finite variance, and thus there is no
+need to check Pareto-$\hat{k}$ for all the parameters and derived
+quantities.
+
+It is possible that the distribution has finite variance, but
+pre-asymptotically given a finite sample size the behavior can be
+similar to infinite variance. Thus the diagnostic is useful even in
+cases where theory guarantees finite variance.
+
+## Reference
+
+Vehtari, A., Simpson, D., Gelman, A., Yao, Y., & Gabry, J. (2024).
+Pareto smoothed importance sampling.
+*Journal of Machine Learning Research*, 25(72):1-58.
+
+Vehtari A., Gelman A., Simpson D., Carpenter B., & Bürkner P. C. (2020).
+Rank-normalization, folding, and localization: An improved Rhat for assessing
+convergence of MCMC. *Bayesian Analysis*, 16(2):667-718.
+
diff --git a/vignettes/posterior.Rmd b/vignettes/posterior.Rmd
@@ -396,11 +396,11 @@ with `draws` objects, please open an issue at
 
 ## References
 
-Gelman A., Carlin J. B., Stern H. S., David B. Dunson D. B., Aki Vehtari A.,
+Gelman A., Carlin J. B., Stern H. S., David B. Dunson D. B., Vehtari, A.,
 & Rubin D. B. (2013). *Bayesian Data Analysis, Third Edition*. Chapman and
 Hall/CRC.
 
 Vehtari A., Gelman A., Simpson D., Carpenter B., & Bürkner P. C. (2020).
 Rank-normalization, folding, and localization: An improved Rhat for assessing
-convergence of MCMC. *Bayesian Analysis*.
+convergence of MCMC. *Bayesian Analysis*, 16(2):667-718.