-
-
Notifications
You must be signed in to change notification settings - Fork 24
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #372 from stan-dev/pareto_diags_vignette
new Pareto-k diagnostics vignette
- Loading branch information
Showing
2 changed files
with
246 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,244 @@ | ||
--- | ||
title: "Pareto-khat diagnostics" | ||
author: "Aki Vehtari" | ||
output: | ||
rmarkdown::html_vignette: | ||
toc: true | ||
toc_depth: 3 | ||
vignette: > | ||
%\VignetteIndexEntry{Pareto-khat diagnostics} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
## Introduction | ||
|
||
The paper | ||
|
||
* Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and | ||
Jonah Gabry (2024). Pareto smoothed importance sampling. | ||
*Journal of Machine Learning Research*, 25(72):1-58. | ||
|
||
presents Pareto smoothed importance sampling, but also | ||
Pareto-$\hat{k}$ diagnostic that can be used when estimating any | ||
expectation based on finite sample. This vignette illustrates the use of | ||
these diagnostics. | ||
The individual diagnostic functions are `pareto_khat()`, `pareto_min_ss()`, `pareto_convergence_rate()` and `pareto_khat_threshold()`. The function `pareto_diags()` will return all of these. | ||
|
||
Additionally, the `pareto_smooth()` function can be used to transform draws by smoothing the tail(s). | ||
## Example | ||
|
||
```{r setup} | ||
library(posterior) | ||
library(dplyr) | ||
options(pillar.neg = FALSE, pillar.subtle=FALSE, pillar.sigfig=2) | ||
``` | ||
|
||
### Simulated data | ||
|
||
Generate `xn` a simulated MCMC sample with 4 chains each with 1000 | ||
iterations using AR process with marginal normal(0,1) | ||
|
||
```{r simulate-data-1} | ||
N <- 1000 | ||
phi <- 0.3 | ||
set.seed(6534) | ||
dr <- array(data=replicate(4,as.numeric(arima.sim(n = N, | ||
list(ar = c(phi)), | ||
sd = sqrt((1-phi^2))))), | ||
dim=c(N,4,1)) %>% | ||
as_draws_df() %>% | ||
set_variables('xn') | ||
``` | ||
|
||
Transform `xn` via cdf-inverse-cdf so that we have variables that | ||
have marginally distributions $t_3$, $t_{2.5}$, $t_2$, $t_{1.5}$, | ||
and $t_1$. These all have thick tails. In addition $t_2$, | ||
$t_{1.5}$, and $t_1$ have infinite variance, and $t_1$ (aka Cauchy) | ||
has infinite mean. | ||
|
||
```{r simulate-data-2} | ||
drt <- dr %>% | ||
mutate_variables(xt3=qt(pnorm(xn), df=3), | ||
xt2_5=qt(pnorm(xn), df=2.5), | ||
xt2=qt(pnorm(xn), df=2), | ||
xt1_5=qt(pnorm(xn), df=1.5), | ||
xt1=qt(pnorm(xn), df=1)) | ||
``` | ||
|
||
### MCMC convergence diagnostics | ||
|
||
We examine the draws with the default `summarise_draws()`. | ||
|
||
```{r summarise_draws} | ||
drt %>% | ||
summarise_draws() | ||
``` | ||
|
||
All the usual convergence diagnostics $\widehat{R}$, Bulk-ESS, and | ||
Tail-ESS look good, which is fine as they have been designed to | ||
work also with infinite variance (Vehtari et al., 2020). | ||
|
||
If these variables would present variables of interest for which | ||
we would like to estimate means, we would be also interested in | ||
Monte Carlo standard error (MCSE, see case study [How many | ||
iterations to run and how many digits to | ||
report](https://users.aalto.fi/~ave/casestudies/Digits/digits.html)). | ||
|
||
```{r summarise_draws-mcse} | ||
drt %>% | ||
summarise_draws(mean, sd, mcse_mean, ess_bulk, ess_basic) | ||
``` | ||
|
||
Here MCSE for mean is based on standard deviation and Basic-ESS, | ||
but these assume finite variance. We did sample also from | ||
distributions with infinite variance, but given a finite sample size, the | ||
empirical variance estimates are always finite, and thus we get | ||
overoptimistic MCSE. | ||
|
||
## Pareto-$\hat{k}$ | ||
|
||
To diagnose whether our variables of interest may have infinite | ||
variance and even infinite mean, we can use Pareto-$\hat{k}$ | ||
diagnostic. | ||
|
||
```{r summarise_draws-pareto_khat} | ||
drt %>% | ||
summarise_draws(mean, sd, mcse_mean, ess_basic, pareto_khat) | ||
``` | ||
|
||
$\hat{k} \leq 0$ indicates that all moments exist, and the inverse | ||
of positive $\hat{k}$ tells estimate for the number of finite (fractional) | ||
moments. Thus, $\hat{k}\geq 1/2$ indicates infinite variance, | ||
and $\hat{k}\geq 1$ indicates infinite mean. Sometimes very thick | ||
distribution tails may affect also sampling, but assuming sampling | ||
did go well, and we would be interested only in quantiles, infinite | ||
variance and mean are not a problem. But if we are interested in mean, | ||
then we need to care about the number of (fractional) moments. Here we | ||
see $\hat{k} \geq 1/2$ for $t_2$, $t_{1.5}$, and $t_{1}$, and | ||
we should not trust their `mcse_mean` values. Without trustworthy MCSE | ||
estimate we don't have good estimate of how accurate the mean estimate is. | ||
Furthermore, as $\hat{k} \geq 1$ for $t_{1}$, the mean is not finite and | ||
the mean estimate is not valid. | ||
|
||
## Pareto smoothing | ||
|
||
If we really do need those mean estimates, we can improve | ||
trustworthiness by Pareto smoothing, which replaces | ||
extreme tail draws with expected ordered statistics of Pareto | ||
distribution fitted to the tails of the distribution. Pareto | ||
smoothed mean estimate (computed using Pareto smoothed draws) has | ||
finite variance with a cost of some bias which we know when it is | ||
negligible. As a thumb rule when $\hat{k}<0.7$, the bias is | ||
negligible. | ||
|
||
We do Pareto smoothing for all the variables. | ||
|
||
```{r pareto_smooth} | ||
drts <- drt %>% | ||
mutate_variables(xt3_s=pareto_smooth(xt3), | ||
xt2_5_s=pareto_smooth(xt2_5), | ||
xt2_s=pareto_smooth(xt2), | ||
xt1_5_s=pareto_smooth(xt1_5), | ||
xt1_s=pareto_smooth(xt1)) %>% | ||
subset_draws(variable="_s", regex=TRUE) | ||
``` | ||
|
||
Now the `mcse_mean` values are more trustworthy when $\hat{k} < 0.7$. | ||
When $\hat{k}>0.7$ both bias and variance grow so fast that Pareto smoothing | ||
rarely helps (see more details in the paper). | ||
|
||
```{r summarise_draws-pareto_khat-2} | ||
drts %>% | ||
summarise_draws(mean, mcse_mean, ess_basic, pareto_khat) | ||
``` | ||
|
||
|
||
## Minimum sample size required | ||
|
||
The bias and variance depend on the sample size, and we can | ||
use additional diagnostic `min_ss` which tells the minimum sample size needed | ||
so that `mcse_mean` can be trusted. | ||
|
||
```{r summarise_draws-min_ss} | ||
drt %>% | ||
summarise_draws(mean, mcse_mean, ess_basic, | ||
pareto_khat, min_ss=pareto_min_ss) | ||
``` | ||
|
||
Here required `min_ss` is smaller than `ess_basic` for all except $t_1$, for | ||
which there is no hope. | ||
|
||
## Convergence rate | ||
|
||
Given finite variance, the central limit theorem states that to halve | ||
MCSE we need four times bigger sample size. With Pareto smoothing, | ||
we can go further, but the convergence rate decreases when $\hat{k}$ increases. | ||
|
||
```{r summarise_draws-conv_rate} | ||
drt %>% | ||
summarise_draws(mean, mcse_mean, ess_basic, | ||
pareto_khat, min_ss=pareto_min_ss, | ||
conv_rate=pareto_convergence_rate) | ||
``` | ||
|
||
We see that with $t_2$, $t_{1.5}$, and $t_1$ we need $4^{1/0.86}\approx 5$, | ||
$4^{1/0.60}\approx 10$, and $4^{1/0}\approx \infty$ times bigger sample sizes to | ||
halve MCSE for mean. | ||
|
||
## Pareto-$\hat{k}$-threshold | ||
|
||
The final Pareto diagnostic, $\hat{k}$-threshold, is useful for smaller sample sizes. Here we select only 100 iterations per chain to get total of 400 draws. | ||
|
||
```{r summarise_draws-khat_threshold} | ||
drt %>% | ||
subset_draws(iteration=1:100) %>% | ||
summarise_draws(mean, mcse_mean, ess_basic, | ||
pareto_khat, min_ss=pareto_min_ss, | ||
khat_thres=pareto_khat_threshold, | ||
conv_rate=pareto_convergence_rate) | ||
``` | ||
|
||
With only 400 draws, we can trust the Pareto smoothed result only when | ||
$\hat{k}<0.62$. For $t_{1.5}$ $\hat{k}\approx 0.64$, and `min_ss` reveals | ||
we would probably need more than 560 draws to be on the safe side. | ||
|
||
## Pareto diagnostics | ||
|
||
We can get all these diagnostics with `pareto_diags()`, and it's | ||
easy to use it also for derived quantities. | ||
|
||
```{r summarise_draws-pareto_diags} | ||
drt %>% | ||
mutate_variables(xt2_5_sq=xt2_5^2) %>% | ||
subset_draws(variable="xt2_5_sq") %>% | ||
summarise_draws(mean, mcse_mean, | ||
pareto_diags) | ||
``` | ||
|
||
## Discussion | ||
|
||
All these diagnostics are presented in Section 3 and summarized in | ||
Table 1 in PSIS paper (Vehtari et al., 2024). | ||
|
||
If you don't need to estimate means of thick tailed distributions, | ||
and there are no sampling issues due to thick tails, then you don't | ||
need to check existence of finite variance, and thus there is no | ||
need to check Pareto-$\hat{k}$ for all the parameters and derived | ||
quantities. | ||
|
||
It is possible that the distribution has finite variance, but | ||
pre-asymptotically given a finite sample size the behavior can be | ||
similar to infinite variance. Thus the diagnostic is useful even in | ||
cases where theory guarantees finite variance. | ||
|
||
## Reference | ||
|
||
Vehtari, A., Simpson, D., Gelman, A., Yao, Y., & Gabry, J. (2024). | ||
Pareto smoothed importance sampling. | ||
*Journal of Machine Learning Research*, 25(72):1-58. | ||
|
||
Vehtari A., Gelman A., Simpson D., Carpenter B., & Bürkner P. C. (2020). | ||
Rank-normalization, folding, and localization: An improved Rhat for assessing | ||
convergence of MCMC. *Bayesian Analysis*, 16(2):667-718. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters