Merge pull request #836 from stan-dev/issue/835-new-diagnostics

update CmdStan Guide and Reference Manual for updated `stansummary` and explain rank-normalized split R-hat.
stan-dev · Dec 10, 2024 · 24d1d7e · 24d1d7e
2 parents 1782f1a + 78d628f
commit 24d1d7e
Show file tree

Hide file tree

Showing 2 changed files with 102 additions and 53 deletions.
diff --git a/src/cmdstan-guide/stansummary.qmd b/src/cmdstan-guide/stansummary.qmd
@@ -12,34 +12,38 @@ diagnostic statistics on the sampler chains, reported in the following order:
 
 - Mean - sample mean
 - MCSE - Monte Carlo Standard Error, a measure of the amount of noise in the sample
-- StdDev - sample standard deviation
+- StdDev - sample standard deviation - the standard deviation around the sample mean.
+- MAD - Median Absolute Deviation  - the median absolute deviation around the sample median.
 - Quantiles - default 5%, 50%, 95%
-- N_eff - effective sample size - the number of independent draws in the sample
-- N_eff/S - the number of independent draws per second
-- R_hat - $\hat{R}$ statistic, a measure of chain equilibrium, must be within $0.05$ of  $1.0$.
+- ESS_bulk
+- ESS_tail
+- R_hat - $\hat{R}$ statistic, a MCMC convergence diagnostic
 
 When reviewing the `stansummary` output, it is important to check the final three
-output columns first - these are the diagnostic statistics on chain convergence and
-number of independent draws in the sample.
-A $\hat{R}$ statistic of greater than $1.05$ indicates that the chain has not converged and
-therefore the sample is not drawn from the posterior, thus the estimates of the mean and
-all other summary statistics are invalid.
+output columns first - these are the diagnostic statistics on MCMC convergence and
+effective sample size.
+A $\hat{R}$ statistic of greater than $1$ indicates potential convergence problems and that the sample is not presentative of the target posterior, thus the estimates of the mean and all other summary statistics are likely to be invalid. A value $1.01$ can be used as generic threshold to decide whether more iterations or further convergence analysis is needed, but other thresholds can be used depending on the specific use case.
 
 Estimation by sampling produces an approximate value for the model parameters;
-the MCSE statistic indicates the amount of noise in the estimate.
+the MCSE statistic indicates the amount of uncertainty in the estimate.
 Therefore MCSE column is placed next to the sample mean column,
 in order to make it easy to compare this sample with others.
 
 For more information, see the
 [Posterior Analysis](https://mc-stan.org/docs/reference-manual/analysis.html)
 chapter of the Stan Reference Manual which describes both the theory and practice of MCMC
 estimation techniques.
+
+The statistics - Mean, StdDev, MAD, and Quantiles - are computed directly from all draws across all chains.
+The diagnostic statistics - ESS_bulk, ESS_tail, and R_hat are computed from the rank-normalized,
+folded, and splitted chains according to the definitions by @Vehtari+etal:2021:Rhat.
+the MCSE statistic is computed using split chain R_hat and autocorrelations.
 The summary statistics and the algorithms used to compute them are described in sections
 [Notation for samples](https://mc-stan.org/docs/reference-manual/analysis.html#notation-for-samples-chains-and-draws)
 and
 [Effective Sample Size](https://mc-stan.org/docs/reference-manual/analysis.html#effective-sample-size.section).
 
-## Building the stansummary command
+## Building the `stansummary` command
 
 The CmdStan makefile task `build` compiles the `stansummary` utility
 into the `bin` directory.
@@ -69,38 +73,38 @@ output files to get the following report:
 
 ```
 > bin/stansummary eight_*.csv
-Input files: eight_1.csv, eight_2.csv, eight_3.csv, eight_4.csv
 Inference for Stan model: eight_schools_model
-4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.
-
-Warmup took (0.048, 0.060, 0.047, 0.045) seconds, 0.20 seconds total
-Sampling took (0.057, 0.058, 0.061, 0.066) seconds, 0.24 seconds total
-
-                 Mean     MCSE  StdDev     5%   50%   95%  N_Eff  N_Eff/s    R_hat
-
-lp__              -18     0.33     5.1    -26   -19  -9.1    233      963      1.0
-accept_stat__    0.88  1.6e-02    0.23   0.21  0.98  1.00    203      838  1.0e+00
-stepsize__       0.18  2.2e-02   0.031   0.14  0.20  0.22    2.0      8.3  3.9e+13
-treedepth__       3.8  5.9e-02    0.78    2.0   4.0   5.0    175      724  1.0e+00
-n_leapfrog__       18  1.3e+00     9.4    7.0    15    31     51      212  1.0e+00
-divergent__     0.015  4.1e-03    0.12   0.00  0.00  0.00    865     3576  1.0e+00
-energy__           23  3.4e-01     5.5     13    23    32    258     1066  1.0e+00
-
-mu                7.9     0.16     5.1  -0.23   7.9    16   1021     4221      1.0
-theta[1]           12     0.30     8.6  -0.48    11    28    837     3459      1.0
-theta[2]          7.8     0.15     6.4   -2.7   7.7    18   1717     7096     1.00
-theta[3]          6.1     0.19     7.7   -6.5   6.5    18   1684     6958      1.0
-theta[4]          7.5     0.15     6.7   -3.1   7.4    18   2026     8373      1.0
-theta[5]          4.7     0.17     6.4   -6.7   5.3    15   1391     5747     1.00
-theta[6]          5.9     0.16     6.7   -5.8   6.2    16   1673     6915     1.00
-theta[7]           11     0.22     7.0  0.057    10    23   1069     4419      1.0
-theta[8]          8.3     0.20     7.9   -4.2   8.0    22   1503     6209     1.00
-tau               7.2     0.26     5.2    1.5   5.9    17    401     1657      1.0
+4 chains: each with iter=1000; warmup=1000; thin=1; 1000 iterations saved.
+
+Warmup took (0.031, 0.031, 0.038, 0.031) seconds, 0.13 seconds total
+Sampling took (0.032, 0.039, 0.026, 0.029) seconds, 0.13 seconds total
+
+                 Mean   MCSE  StdDev    MAD     5%   50%   95%  ESS_bulk  ESS_tail  R_hat
+
+lp__              -18   0.36     5.3    5.5    -26   -18  -8.7       214       217    1.0
+accept_stat__    0.80  0.021    0.29  0.068  0.033  0.95  1.00       268       314    1.0
+stepsize__       0.20    nan   0.040  0.029   0.14  0.21  0.25       nan       nan    nan
+treedepth__       3.7   0.22    0.88    1.5    2.0   4.0   5.0        26        18    1.1
+n_leapfrog__       17    3.1      10     12    3.0    15    31        30        21    1.1
+divergent__     0.022    nan    0.15   0.00   0.00  0.00  0.00       nan       nan    nan
+energy__           23   0.38     5.7    5.8     13    23    32       223       286    1.0
+
+mu                7.9   0.16     5.3    4.9  -0.35   7.9    16       977       472    1.0
+theta[1]           12   0.32     8.8    7.4  -0.19    11    28       727       525    1.0
+theta[2]          7.8   0.17     6.4    6.0   -2.8   7.8    18      1298      2137    1.0
+theta[3]          5.9   0.20     8.1    6.8   -8.0   6.5    18      1493      1720    1.0
+theta[4]          7.5   0.17     6.9    6.3   -3.4   7.6    18      1517      1935    1.0
+theta[5]          4.9   0.20     6.3    6.1   -6.1   5.1    15       998      1357    1.0
+theta[6]          5.8   0.18     6.8    6.1   -6.3   6.2    16      1434      1431    1.0
+theta[7]           11   0.25     7.0    6.4   0.42    10    23       741       562    1.0
+theta[8]          8.6   0.21     8.1    7.2   -3.6   8.3    22      1444      1850    1.0
+tau               7.3   0.31     5.8    4.4    1.5   5.7    18       226      1230    1.0
 
 Samples were drawn using hmc with nuts.
-For each parameter, N_Eff is a crude measure of effective sample size,
-and R_hat is the potential scale reduction factor on split chains (at
-convergence, R_hat=1).
+For each parameter, ESS_bulk and ESS_tail measure the effective sample size for the entire sample
+(bulk) and for the .05 and .95 tails (tail),
+and R_hat measures the potential scale reduction on split chains. At convergence R_hat will be
+very close to 1.00.
 ```
 
 The console output information consists of

diff --git a/src/reference-manual/analysis.qmd b/src/reference-manual/analysis.qmd
@@ -5,7 +5,7 @@ pagetitle: Posterior Analysis
 # Posterior Analysis {#analysis.chapter}
 
 Stan uses Markov chain Monte Carlo (MCMC) techniques to generate
-samples from the posterior distribution for full Bayesian inference.
+draws from the posterior distribution for full Bayesian inference.
 Markov chain Monte Carlo (MCMC) methods were developed for situations
 in which it is not straightforward to make independent draws
 @Metropolis:1953.
@@ -17,7 +17,7 @@ despite the fact that it is not actually a Markov chain.
 Stan's Laplace algorithm produces a sample from a normal approximation
 centered at the mode of a distribution in the unconstrained space.
 If the mode is a maximum a posteriori (MAP) estimate,
-the samples provide an estimate of the mean and standard deviation
+the sample provides an estimate of the mean and standard deviation
 of the posterior distribution.
 If the mode is a maximum likelihood estimate (MLE),
 the sample provides an estimate of the standard error of the likelihood.
@@ -66,15 +66,15 @@ applies.  These problems are addressed in the next two sections.
 
 Stan's posterior analysis tools compute a number of summary
 statistics, estimates, and diagnostics for Markov chain Monte Carlo
-(MCMC) samples.  Stan's estimators and diagnostics are more robust in
+(MCMC) sample.  Stan's estimators and diagnostics are more robust in
 the face of non-convergence, antithetical sampling, and long-term
 Markov chain correlations than most of the other tools available.  The
 algorithms Stan uses to achieve this are described in this chapter.
 
 
 ## Convergence
 
-By definition, a Markov chain generates samples from the target
+By definition, a Markov chain samples from the target
 distribution only after it has converged to equilibrium (i.e.,
 equilibrium is defined as being achieved when $p(\theta^{(n)})$ is the
 target density).  The following point cannot be expressed strongly
@@ -105,21 +105,23 @@ One way to monitor whether a chain has converged to the equilibrium
 distribution is to compare its behavior to other randomly initialized
 chains.  This is the motivation for the @GelmanRubin:1992 potential
 scale reduction statistic, $\hat{R}$.  The $\hat{R}$ statistic
-measures the ratio of the average variance of samples within each
-chain to the variance of the pooled samples across chains; if all
+measures the ratio of the average variance of drawss within each
+chain to the variance of the pooled draws across chains; if all
 chains are at equilibrium, these will be the same and $\hat{R}$ will
 be one.  If the chains have not converged to a common distribution,
 the $\hat{R}$ statistic will be greater than one.
 
 Gelman and Rubin's recommendation is that the independent Markov
 chains be initialized with diffuse starting values for the parameters
-and sampled until all values for $\hat{R}$ are below 1.1.  Stan
+and sampled until all values for $\hat{R}$ are below some threshold.
+@Vehtari+etal:2021:Rhat suggest in general to use a threshold $1.01$, but
+othe thresholds can be used depending on the use case. Stan
 allows users to specify initial values for parameters and it is also
 able to draw diffuse random initializations automatically satisfying
 the declared parameter constraints.
 
 The $\hat{R}$ statistic is defined for a set of $M$ Markov chains,
-$\theta_m$, each of which has $N$ samples $\theta^{(n)}_m$.  The
+$\theta_m$, each of which has $N$ draws $\theta^{(n)}_m$.  The
 *between-chain variance* estimate is
 
 $$
@@ -190,6 +192,49 @@ because the first half of each chain has not mixed with the second
 half.
 
 
+### Rank normalization helps when there are heavy tails {-}
+
+Split R-hat and the effective sample size (ESS) are well defined only if
+the marginal posteriors have finite mean and variance.
+Therefore, following @Vehtari+etal:2021:Rhat, we compute the rank normalized
+parameter values and then feed them into the formulas for split R-hat and ESS.
+
+Rank normalization proceeds as follows:
+
+* First, replace each value $\theta^{(nm)}$ by its rank $r^{(nm)}$ within the pooled
+draws from all chains. Average rank for ties are used to conserve
+the number of unique values of discrete quantities.
+
+* Second, transform ranks to normal scores using the inverse normal transformation
+and a fractional offset:
+
+$$
+z_{(nm)} = \Phi^{-1} \left( \frac{r_{(nm)} - 3/8}{S - 1/4} \right)
+$$
+
+To further improve sensitivity to chains having different scales, 
+
+rank normalized R-hat is computed also for the 
+for the corresponding *folded*
+draws $\zeta^{(mn)}$, absolute deviations from the median,
+$$
+\label{zeta}
+\zeta^{(mn)} = \left|\theta^{(nm)}-{\rm median}(\theta)\right|.
+$$
+The rank normalized split-$\widehat{R}$ measure computed on the
+ $\zeta^{(mn)}$ values  is called \emph{folded-split}-$\widehat{R}$.
+  This measures convergence in the
+tails rather than in the bulk of the distribution. 
+
+To obtain a single conservative $\widehat{R}$ estimate, we propose
+to report the maximum of rank normalized split-$\widehat{R}$ and
+rank normalized folded-split-$\widehat{R}$ for each parameter.
+
+Bulk-ESS is defined as ESS for rank normalized split chains.  Tail-ESS
+is defined as the minimum ESS for the 5% and 95% quantiles.  See
+[Effective Sample Size section](#effective-sample-size.section) for
+details on how ESS is estimated.
+
 ### Convergence is global {-}
 
 A question that often arises is whether it is acceptable to monitor
@@ -256,7 +301,7 @@ analytically.
 So what we do in practice is hope that the finite number of draws is
 large enough for the expectations to be reasonably accurate. Removing
 warmup iterations improves the accuracy of the expectations but there
-is no guarantee that removing any finite number of samples will be
+is no guarantee that removing any finite number of draws will be
 enough.
 
 
@@ -278,7 +323,7 @@ distribution for the log posterior density (`lp__`) almost
 never looks Gaussian, instead it features long tails that can lead to
 large $\hat{R}$ even in the large $N$ limit.  Tweaks to $\hat{R}$,
 such as using quantiles in place of raw values, have the flavor of
-making the samples of interest more Gaussian and hence the $\hat{R}$
+making the sample of interest more Gaussian and hence the $\hat{R}$
 statistic more accurate.
 
 
@@ -297,7 +342,7 @@ and can apply the standard tests.
 ## Effective sample size {#effective-sample-size.section}
 
 The second technical difficulty posed by MCMC methods is that the
-samples will typically be autocorrelated (or anticorrelated) within a
+draws will typically be autocorrelated (or anticorrelated) within a
 chain.  This increases (or reduces) the uncertainty of the estimation of posterior
 quantities of interest, such as means, variances, or quantiles; see
 @Geyer:2011.
@@ -352,7 +397,7 @@ $$
    \, d\theta - \frac{\mu^2}{\sigma^2}.
 $$
 
-The effective sample size of $N$ samples generated by a process with
+The effective sample size of $N$ draws generated by a process with
 autocorrelations $\rho_t$ is defined by
 $$
 N_{\mathrm{eff}}
@@ -378,7 +423,7 @@ correlation.
 In practice, the probability function in question cannot be tractably
 integrated and thus the autocorrelation cannot be calculated, nor the
 effective sample size.  Instead, these quantities must be estimated
-from the samples themselves.  The rest of this section describes a
+from the draws themselves.  The rest of this section describes a
 autocorrelations and split-$\hat{R}$ based effective sample
 size estimator, based on multiple chains. As before, each chain
 $\theta_m$ will be assumed to be of length $N$.