-
-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve step size adaptation target #3105
Comments
About two and half years ago issue #2789 <#2789> pointed out a problem with the calculation of the adaptation target accept_stat. The issue was closed after a pull request with a proposed fix was completed but unfortunately the pull request was rejected and the problem is still present in Stan 2.29. This also came up on the forum last summer <https://discourse.mc-stan.org/t/acceptance-ratio-in-nuts/22752>.
Background reading: https://arxiv.org/abs/1411.6669 <https://arxiv.org/abs/1411.6669> finds the optimal stepsize for static HMC. In short: the ratio of accepted proposals to gradient evaluations is maximized when the average Metropolis acceptance probability is 0.8.
The acceptance probability is related to the energy error in the trajectory
w = exp(-H(q,p))
accept_prob = min(1,w_2/w_1)
where w_1 is the Boltzmann weight of the initial point and w_2 of the final point. The sampler selects the candidate sample with probability accept_prob and adaptation adjusts the step size up or down depending on whether accept_prob is above or below 0.8.
To be precise the derivation in that paper is defined not just for static integration times but also for the original Hybrid Monte Carlo implementation of Hamiltonian Monte Carlo where only the final state of the numerical Hamiltonian trajectory is considered. This approach is especially wasteful as it will reflect the proposal even if there are intermediate states in the generated numerical trajectory that have small error and are far away from the initial state.
Before we try to apply this to Stan, I'll mention one thing that bugs me about this use of Metropolis acceptance probability as the adaptation target for static HMC. And that is: a significant number (about 1/3) of transitions have w_2 > w_1 and an acceptance probability saturating to 1. But the magnitude of the energy error should be informative even in this case; it would be more efficient to not ignore it.
The saturation of the Metropolis acceptance probability does shed a bit of information, but the entire calculation is predicated around that particular mathematical form. In particular the Metropolis form defines the objective function (note that unlike similar analyses performed for random walk Metropolis and some Langevin methods this analysis does not assume a diffusive limit and so there’s no direct connection to any notion of expected jump distance or effective sample size). The form also introduces a kink in the Metropolis acceptance probability function when w_{2} = w_{1} which makes the calculation particularly annoying (and removes a few powers of the step size in the final approximation accuracy).
Various averages of the Metropolis acceptance probabilities across the states in a numerical Hamiltonian trajectory are not used because they provide the most information but rather because they provide the closest proxy to Hybrid Monte Carlo transitions for which the optimization calculation is well-defined.
The current acceptance statistic approximates the multinomial transition as an ensemble of Metropolis transitions from the initial state to each of the other states in the final numerical Hamiltonian trajectory. Each transition is given equal weight, which results in a conservative estimate as those states with large error that unlikely to be sampled by the multinomial transition have just as much influence as those states with smaller errors that are more likely to be sampled.
The proposed change in #2789 <#2789> went from uniform weights across this ensemble of proxy Metropolis transitions to the multinomial weights, allowing the final states that are more likely until the actual multinomial sampling to influence the adaptation statistic more. The intuition is that this provides a better approximation of the exact multinomial transition as an ensemble of Metropolis transitions, but this was also backed up by extensive numerical studies that showed that the effective sample size per gradient evaluation improved uniformly across dimensions, correlations, and different tail behaviors.
The multinomial weights define only the _marginal_ selection probabilities, and the Metropolis-like selection of the sub-trajectories that is biased away from the initial state does introduce more complicated _conditional_ selection probabilities. The challenge with trying to incorporate these is combining all of the conditional sub-tree probabilities into global weights that can be used to average over the entire numerical trajectory.
Ideally the original analysis would be directly generalized for a multinomial sampler, but coming up with an appropriate objective function is difficult. Some of the recent work on iterative delayed-rejection Markov transitions inspired by the Hamiltonian Monte Carlo implementations in Stan could be of use here.
Either way, I'm going to point out that--just like with static HMC--you can reduce the variance without changing the mean by replacing the Metropolis probability with the symmetric acceptance statistic.
The issue here isn’t so much reducing the variance of the acceptance statistic to stabilize the adaptation — it’s coming up with something that is as compatible with the theoretical optimal adaptation calculation. Reduced variance and stabilized adaptation would certainly be nice, but only conditional on having a meaningful adaptation target.
So, what now? The last doubling acceptance rate is an attractive proxy for effective sample size per computational cost (in fact, in the above example effective sample size also goes up and down in sync with fluctuations in the acceptance rate) if you could somehow make it strictly monotonic. One possible monotonic proxy can be created by taking the symmetric last doubling acceptance statistic
I think that the first step in considering a more heuristic acceptance statistic/proxy that isn’t an average of weighted Metropolis acceptance probabilities, and hence somewhat connected to the original analysis for why 0.8 is the more robustly optimal adaptation target, is an empirical study similar to what I performed for #2789 <#2789>. This would help clarify if there are significant improvements, how robust those improvements might be, and whether 0.8 is still reasonably optimal (and hence if the theoretical calculation still has any relevance).
|
(oops, took longer than I expected to get back to this...)
Yes, sorry, the important difference is not static vs dynamic integration time but multinomial sampling from the whole trajectory vs endpoints only.
The optimality criterion depends only the expectation value of the objective function. There are many different objective functions that give the same expectation value.
These all give the same adaptated stepsize but These also differ when trying to generalize to multinomial selection.
The part I don't understand is why use a weighting that provides "a better approximation" (than uniform) instead of just using the exact multinomial sampling probabilities as weights. The next paragraph ("The challenge [...] is combining all of the conditional sub-tree probabilities") sounds to me like you meant to say "yes, it would be better to weight by the actual selection probabilities but those are too difficult to compute" but I'm not sure.
While "last-doubling-lower-bound accept_stat" was derived from different considerations, numerically it is fairly similar to the accept_stat in the original NUTS paper. It can be computed by substituting the mean weight of the first subtree for the weight of the initial point in the expression for the original accept_stat. Taking inspiration from the tests you reported in https://discourse.mc-stan.org/t/request-for-volunteers-to-test-adaptation-tweak/9532/53 here's some plots. All with diagonal mass matrix, 4 chains and 2000 adaptation iterations. Here I wasn't sure if "correlated normal rho=0.5" meant every off-diagonal entry in the correlation matrix is 0.5, or only adjacent variables have correlation 0.5 and long-distance correlations fall off accordingly. Looking at these results, it was the latter. Let's throw in Neal's funnel too, |
the entire calculation is predicated around that particular mathematical form. In particular the Metropolis form defines the objective function
The optimality criterion depends only the expectation value of the objective function. There are many different objective functions that give the same expectation value.
a0 = 1 if the proposal was accepted else 0
a1 = min(1,w_2/w_1)
a2 = 2*min(w_1,w_2)/(w_1+w_2)
any linear mixture of the above
These all give the same adaptated stepsize but a2 has the smallest variance and hence the fastest convergence.
I agree that the step size optimization procedure will hold for any objective function with the same expectation value as the Metropolis acceptance probability, but do you have a reference for E[a2] = E[a1] and Var[a2] < Var[a1]?
Equivalent objective functions with smaller variance is definitely a compelling argument.
These also differ when trying to generalize to multinomial selection.
a0 is not suitable for averaging over many proxy transitions
if the weights are uniform then a1 and a2 lead to the same target stepsize but a2 again has a smaller variance
if the weights are Boltzmann-weights then a1 always leads to a target stepsize that is larger than that of a2
I can’t speak for a2, but when using Boltzman weights to average across the proxy Metropolis transitions an increased step size is expected.
The Metropolis proposal is limited by the Hamiltonian error at the final state, no matter what Hamiltonian errors are achieved within the numerical trajectory. Increasing the step size increases the Hamiltonian errors making it more likely reject the final state and go back to the initial state. Because the Hamiltonian error oscillates, however, proposals to intermediate states can be much less likely to reject than the final state even with increased step sizes. In other words the objective based on the Metropolis proposal between two states will be too conservative in expectation.
The question is how to weight the Metropolis proposals to those intermediate states. The current adaptation uses a uniform weighting which considers those states with large errors just as much as those with small errors, but the multinomial sampling will always focus on those intermediate states with smaller errors. Consequently the uniform weighting will again be too conservative and lead to smaller adapted step sizes.
The empirical studies backed this up — using the Boltzman weights lead to higher adapted step sizes and cheaper Hamiltonian transitions without affecting the effective sample size for the standard expectands.
The part I don't understand is why use a weighting that provides "a better approximation" (than uniform) instead of just using the exact multinomial sampling probabilities as weights. The next paragraph ("The challenge [...] is combining all of the conditional sub-tree probabilities") sounds to me like you meant to say "yes, it would be better to weight by the actual selection probabilities but those are too difficult to compute" but I'm not sure.
The update that I proposed does use the exact multinomial sampling probabilities as weights!
I think the confusion may be in to what approximation I was referring.
The multinomial transition is only approximated by a mixture of Metropolis proposals, and so any objective function based on a mixture of Metropolis proposals will only approximate the optimal step size of the multinomial transition. The Boltzman mixture is of Metropolis proposals, however, can be implemented exactly.
At the same time the actual Hamiltonian transition implemented in Stan is also not exactly a multinomial transition. Within each subtree T_k the proposed final state is chosen multinomially,
P[ z_n \in T_k ] = exp( H(z_n) - H(z_0) ) / sum_{z \in T_k} exp( H(z) - H(z_0) ).
When combining subtrees T_k and T_{k + 1} we then need to choose between the proposed final states in each subtree, z_k \in T_k and z_{k + 1} \in T_{k + 1}. Multinomial weights across the entire trajectory can be implemented by sampling a binary variable with probabilities
P [ z_k ] = 1 - P [ z_{k + 1} ] = P[ T_k ] / ( P [ T_k ] + P [ T_{k + 1} ] )
where
P [ T_k ] = sum_{z \in T_k} exp( H(z) - H(z_0) ).
Stan, however, uses a Metropolis probability to decide between the subtree proposals,
P [ z_k ] = min(1, P [ T_k ] / P [T_{k + 1} ),
which favors transitions to the new subtree which contains states that are further away from the initial state.
Keeping track of the exact multinomial weights is straightforward, but the subtree probabilities are harder to propagate through the recursion.
While "last-doubling-lower-bound accept_stat" was derived from different considerations, numerically it is fairly similar to the accept_stat in the original NUTS paper. It can be computed by substituting the mean weight of the first subtree for the weight of the initial point in the expression for the original accept_stat.
Taking inspiration from the tests you reported in https://discourse.mc-stan.org/t/request-for-volunteers-to-test-adaptation-tweak/9532/53 <https://discourse.mc-stan.org/t/request-for-volunteers-to-test-adaptation-tweak/9532/53> here's some plots. All with diagonal mass matrix, 4 chains and 2000 adaptation iterations.
How many runs were done for each dimension? Are you plotting the average? What does the effective sample size vs number of gradient evaluations look like?
I wasn't sure if "correlated normal rho=0.5" meant every off-diagonal entry in the correlation matrix is 0.5, or only adjacent variables have correlation 0.5 and long-distance correlations fall off accordingly. Looking at these results, it was the latter.
Yes, the latter.
Let's throw in Neal's funnel too, x[1] ~ normal(0,1), x[2:] ~ normal(0, exp(x[1])). A red dot means there are post-warmup divergences and this time adapt_delta=0.9.
Where there E-FMI warnings for the higher dimensionality targets?
|
My thinking on this is that HMC generates not just a Markov chain of points
but also a Markov chain of trajectories
with
but I'm pretty sure the distribution this samples from in the long run is the same as sampling T from the trajectory distribution and then drawing an initial point q conditional on that trajectory
where the sums range over all (so, to answer your question, no, I don't have a reference, just a "I'm pretty sure...")
The confusion was what "multinomial" refers to as I would have said "unbiased multinomial weights" for what you call "multinomial sampling weights" and "biased multinomial weights" for subtree-probability adjusted weights. (I don't even know why it's called multinomial since there's only one draw per subtree...)
It's not insurmountable. The current recursion accumulates subtree weights and sums after every doubling. sum_accept_stat += sum_accept_stat_subtree;
weight += weight_subtree; and the final accept_stat = (1-p)*accept_stat + p*sum_accept_stat_subtree/weight_subtree; where
Only one run per dimension; I figured adjacent dimensions are similar enough to be approximate replicates and the zig-zag gives a feel for run-to-run variability.
I forgot to check that. Running again with So, very bad E-FMI. (Unsurprising, I guess, since ESS was also quite low.) I was going to write more about adapting arxiv:1411.6669 to multinomial selections but re-reading it I realized I misunderstood it and am utterly confused.
and computes the cost as
This is the expected cost to the first accepted proposal starting from a random initial state. But isn't the quantity of interest the total cost divided by the total number of acceptances in a long-running chain? First time reading, I thought these were the same thing but
They're not equal! Why does the paper (and Beskos et al) optimize residence time instead of the total number of acceptances? |
If you want a paywalled article, the abstract of https://doi.org/10.1016/S0378-3758(99)00079-8 begins
which sounds a lot like filling in the details for the "I'm pretty sure" part but I can't promise you'll get your money's worth if you pay Elsevier $9.50 for 24 hours of access. |
You can find a pdf on Google Scholar. |
Let me respond a little bit out of order.
Only one run per dimension; I figured adjacent dimensions are similar enough to be approximate replicates and the zig-zag gives a feel for run-to-run variability.
n_leapfrog and accept_stat are averages, step size is for the first chain, and ESS is the first dimension (effectively a random direction, except for the funnel where it's the funnel direction (probably the slowest mixing?))
What does the effective sample size vs number of gradient evaluations look like?
The cross-over in performance for the IID normal model is interesting. In performance testing I stick to direct expectations to avoid any complications of the “bulk ESS” heuristic, but the cross over could also be due to the metric that you are proposing.
So, very bad E-FMI. (Unsurprising, I guess, since ESS was also quite low.)
Keep in mind that low E-FMI suggests that a central limit theorem doesn’t hold for all of the expectands in which case quantities like the estimated effective sample size won’t correspond to anything meaningful. At best it would quantify how quickly the Markov chain Monte Carlo estimators converge to a biased target.
In this case as the dimension grows the geometry of the funnel obstructs even exact trajectories from venturing down into the neck of the funnel. Markov chain Monte Carlo estimators of the average acceptance probability will then be biased high because they don’t take into account the high-curvatures, and low acceptance probabilities, in that neck, which results in the adaptation converging to a higher-than-desired step size. In lower dimensions the obstruction isn’t as bad and numerical trajectories venture deep enough into the funnel to diverge and provide a more immediate warning of problems.
This is the expected cost to the first accepted proposal starting from a random initial state. But isn't the quantity of interest the total cost divided by the total number of acceptances in a long-running chain? First time reading, I thought these were the same thing but
The expected number of proposals to first acceptance is the expected residence time E[1/a(q)]. It is also the expected number of proposals to the next acceptance when you start from a random point in a very long chain
The number of acceptances in a chain is proportional to the probability that, when picking a random point from the chain, the next proposal is accepted. The frequency of the state q in the chain is p(q) so this number is sum_q(p(q)a(q))=E[a(q)] and consequently the long-term average proposals per acceptances is 1/E[a(q)]
They're not equal! Why does the paper (and Beskos et al) optimize residence time instead of the total number of acceptances?
This all goes back to the Roberts, Gilks, and Gelman paper. There the authors show that for random walk Metropolis one can construct a diffusion limit where the expected number of proposals to first acceptance is directly related to the asymptotic variance, and hence becomes a viable target for adapting the configuration of the random walk Metropolis proposal distribution. This equivalence can also be verified empirically — adapting to the expected number of proposals to first acceptance gives the optimal effective sample size per iteration and vice versa.
Hamiltonian Monte Carlo doesn’t have a nice diffusion limit like random walk Metropolis, but Beskos et al used the expected number of proposals to first acceptance criterion heuristically anyways. When using numerical Hamiltonian trajectories as a Metropolis proposal, i.e. the original “Hybrid” Monte Carlo, this heuristic also does well empirically, at least for IID targets. See for example Figure 3 of https://arxiv.org/abs/1411.6669 <https://arxiv.org/abs/1411.6669>.
If you want a paywalled article, the abstract of https://doi.org/10.1016/S0378-3758(99)00079-8 <https://doi.org/10.1016/S0378-3758(99)00079-8> begins
We introduce a form of Rao–Blackwellization for Markov chains which uses the transition distribution for conditioning. We show that for reversible Markov chains, this form of Rao–Blackwellization always reduces the asymptotic variance
which sounds a lot like filling in the details for the "I'm pretty sure" part but I can't promise you'll get your money's worth if you pay Elsevier $9.50 for 24 hours of access.
All glory to the website that rhymes with ply-shrub that we do not name explicitly.
The result in the paper can be applied to the form of Hamiltonian Monte Carlo we use in Stan, but I don’t think it’s directly applicable to the adaptation question.
By construction the multinomial (categorical if one prefers, although I use multinomial for all numbers of trials) sampler is amenable to the kind of Rao-Blackwellization mentioned in that paper. Instead of computing estimates
E[f] \approx (1 / N) sum_{n = 1}^{N} f(q_n)
we could compute an iterated expectation
E[f]
\approx (1 / N) sum_{n = 1}^{N} E_{T_n}[f]
= (1 / N) sum_{n = 1}^{N} [ sum_{i = 1}^{N_{T_n}} w_i f(q_i) ] / [ sum_{i = 1}^{N_{T_n}} w_i ].
where T_{n} is the entire numerical trajectory at the nth iteration of the Markov chain.
Indeed the does result in high effective sample sizes, although not as much as one might hope. I did experiments on this way back when I first introduced the multinomial state selection scheme and the improvement wasn’t more than 5% or so. The problem is that neighboring states in each numerical trajectory are so correlated that there’s not much more information beyond a single, well-chosen state. So technically there is an improvement but it’s not much, and the overhead of having to store the entire numerical trajectories (or specify all expectands of interest ahead of time so that the Markov chain Monte Carlo estimators can be evaluated online) easily overwhelmed that small gain.
Were we to use that Rao-Blackwellized estimator then I agree that your suggested “acceptance statistic” would be appropriate. Because we don’t use that estimator, however, I think that that acceptance statistic would in general lead to an over-aggressive adaptation relative to the estimator that we do use. That said because the effect of the Rao-Blackwellization is small here I think that they should perform similarly, although not equivalently.
Empically I think this is backed up by some of your plots. If all of these acceptance statistics had the same expectation value but different variances then we should see the rolling estimators converge to the same value but at difference speeds. Your plots show different speeds but also different asymptotic values, consistent with the fact that the different statistics are targeting slightly different circumstance.
This is the heart of the issue — we need to mold the adaptation to the actual sampler that we use. The current acceptance statistic in Stan is a subpar approximation to the multinomial sampling (or whatever one might call the “inter-subtrajectory multinomial sampling with intra-subtrajectory Metropolis sampling” method that’s currently in Stan, which leads to an over-conservative adaptation and leaves performance on the table. I think that your suggestion goes a bit too far since we don’t use the Rao-Blackwellized estimator, where as in my opinion the acceptance statistic in the old PR is the best match to the current sampler behavior both theoretically and empirically.
|
I proposed two changes
In my opinion (1) takes us closer to the current sampler behavior and the effect of (2) is not theoretically understood but could plausibly be an improvement.
I must say, Michael, sometimes I find it very difficult to follow your logic. Rao-Blackwellization does not change the expected value of the estimator and previously you said
but now you've decided that Rao-Blackwellizing the adaptation objective is appropriate only if every other estimator is Rao-Blackwellized as well.
I don't think piecemeal replies to individual statements is getting us on the same page so I'll just recap the whole theory behind the step size adaptation. Random-Walk Metropolis-Hastings MCMCThe story begins with the Metropolis-Hastings algorithm. The algorithm samples from the distribution The theory for the optimal “step size”
They assume a target distribution that decomposes into a very large number of IID variables Due to the IID assumption the potential is a sum The ensemble is a sum of independent random variables so (assuming suitable regularity conditions) central limit theorem applies and the ensemble All that is to say, the ensemble follows the normal distribution In the limit $N\to\infty $ (where also Varying acceptance probabilityIn the above discussion the Metropolis acceptance probability is entirely determined by the ensemble average and does not meaningfully depend on any individual component. Consequently the diffusion coefficient is constant across the entire typical set and (as Roberts, Gelman, and Gilks point out) all measures of efficiency are equivalent. Rao-BlackwellizationOne more thing before we move on to Hamiltonian Monte Carlo. The adaptation algorithm needs an estimate of the average acceptance probability. The obvious estimator is just the acceptance probability for the current transition This should be a more stable estimator with the same mean. I haven't tested it in RWMH but this marginalization can also be applied to Stan's current accept_stat. (It's not obvious at all that the same formula would be correct with a more complicated algorithm like NUTS; that it works in specific case is a convenient coincidence.) My previous plots labelled this the “symmetric” accept_stat. Here's a direct comparison: My takeaways are:
Metropolis HMCNow, let's get to HMC. “Metropolis HMC” (a term I just made up, usually called just HMC) is like RWMH but the proposal The relevant paper is
They analyze the Hamiltonian flow and find that in the limit of either infinite dimensions or vanishing step size the distribution of the energy error converges to normal. (though they silently switch to (equivalent) The cost of building the trajectory is and the expectations $\mathbb{E_{\mathnormal{q}}}\left[\mathbb{E_{\mathnormal{p}}}\left[\cdot\right]\right] $ can be computed from the previous result $-\Delta\left(q,q^{\star}\right)\sim N\left(-\frac{1}{2}\alpha\epsilon^{4},\alpha\epsilon^{4}\right) $ . Figure 3b of the paper compares these theoretical bounds to empirical efficiency estimates: You may notice that the d=1000 the line follows the lower bound very closely. I believe that is because in high dimensions $a\left(q\right)=\mathbb{E_{\mathnormal{p}}}\left[a\left(q,p\right)\right] $ is constant across the typical set and consequently $\mathbb{E_{\mathnormal{q}}}\left[a\left(q\right)^{-1}\right]=\mathbb{E_{\mathnormal{q}}}\left[a\left(q\right)\right]^{-1} $ (i.e. the so-called lower bound is actually exact.) Their recommended adaptation target is the minimum of the cost upper bound at a=0.8. In theory, Rao-Blackwellizing the acceptance probability estimator should work the same here as with RWMH. Multinomial HMCAn HMC trajectory contains The theory for optimal tuning of multinomial transitions is actually pretty simple: there's no theory, we just take the initial point and the selected final point, then pretend this was a Metropolis HMC transition all along. This can be Rao-Blackwellized in a couple of different ways:
Multinomial HMC, take twoLet's try that again. The total cost of building the trajectory is $\frac{L}{\epsilon} $ and, provided L is short enough that movement is diffusive, effective samples per iteration is determined by the diffusion coefficient $D\left(q\right)=\mathbb{E_{\mathnormal{p}}}\left[l^{2}\left(q,p\right)\right] $ where $l^{2}\left(q,p\right) $ is the expected squared distance of the transition. With Metropolis HMC we had $l^{2}\left(q,p\right)=a\left(q,p\right)L^{2} $ but multinomial sampling allows for a more complex relationship. The previous bounds still apply (Of course, the challenge is finding the distribution of The precise distance is difficult to keep track of but if we replace “metric distance” with “integration time” and approximate it as being constant inside a subtree then we get a “multinomial acceptance statistic” I'll end this post with a sampling cost vs accept stat plot similar to the “Optimizing the step size” figure 3b. The lower and upper bounds in that paper were for Metropolis HMC with fixed integration time but NUTS chooses a power-of-two multiple of the step size. Changing the integration time by a factor of two may change the efficiency also by a factor of two so I've plotted the bounds as bands instead of lines. The blue line is 100 dimensional iid normal, the orange is also 100 dimensional normal but correlated. Top left accept stat is Stan's current one; top right is the one from the old pull request; bottom left is the "multinomial" one described in this post, and bottom right is the same but with the last doubling modified as I described in the first post of this thread. Effective sample size is measured with The lower and upper bounds in the top plots are those from the “Optimizing the step size” paper. The lower and upper bounds in the bottom plots are $\frac{1}{\epsilon\mathbb{E}\left[a_{M}\left(q,p\right)\right]} $ and $\frac{1}{\epsilon}\mathbb{E}\left[\frac{1}{a_{M}\left(q,p\right)}\right] $ estimated via a messy numerical calculation built on dubious assumptions about the shape and distribution of Hamiltonian trajectories. |
Thanks @nhuurre this is neat! Quick Q. For the last graph would it be better to show the 100 IID normal and correlated normal on two seperate graphs instead of breaking them up into 4 graphs? (i.e. all 100 IID normal on one graph color coded by the different acceptance statistics and all correlated normal on another graph). Also are all of the graphs on the same y axis scale? Am I intepreting that bottom left graph correctly that it tends to not do as well as the other accept stats almost everywhere? I'm not sure how to think about the blue line there zig zagging. If you could talk more about what you infer from the last graph that would be nice as I wasn't totally sure what I should be pulling it from it. Also wanted to clarify, are the bands in that graph the approximate bounds of the step size? So for the bottom two its taking smaller step sizes at lower acceptance stats? For the second to rao blackwellization in the last graph, does the final ESS / n_leapfrog have to do twice of leapfrogs because it computes both sides? I think I might just be misunderstanding that graph / math. Also is this all written with Stan's HMC? If so would it be possible to see ESS/second? If not then I can understand leaving it out as raw speed is going to be different depending on the language. Also if you have a script for all of this it might be neat to apply it to the models in posteriordb just to get an idea of how all this works across more models Do you have a PR that combines these ideas together? It sounds like that is what you would like to have put into the algorithms |
The different "accept stats" aren't exactly measuring the same thing so I wouldn't put them all on the x-axis of the same graph. The previous horizontal zig-zag is now seen as vertical zig-zag in the top left plot. It is due to step size resonances that cause a "surprising" increase in acceptance probability (and a small increase in ESS).
Yeah, I'm not sure really. I guess to me it looks like there's not much difference between the different accept_stats, in every case the empirical lines are roughly consistent with the theoretical bounds, with the minimum cost / maximum efficiency predicted around
I have a local branch that computes various |
It's hard not to respond piecemeal because there are multiple misunderstandings threaded throughout that complicate inline responses. Let me try to clarify a few important points that I think may be causing some of the confusion. Cost Functions For Optimizing Metropolis-Hastings Transitions
The standard argument for this form doesn't really consider the diffusion limit at all. Instead it introduces a cost function based on how expensive it will be to generate an accepted proposal. At the initial point and the number of rejections until an acceptance The average number of rejections across all initial points is then that harmonic mean,
Betancourt et al (2014) doesn't make any mention of a diffusion limit in motivating its cost function. On page 3 it instead lays out the above argument, adding in the cost of generating each numerical trajectory in those intermediate rejections. Consequently the results to apply for any trajectory length within the assumption that the cost of intermediate rejections dominates the overall performance of the sampler. This last assumption is not to be taken for granted. This cost function is missing any quantification of the effectiveness of an accepted proposal, instead treating all acceptances as the same. In cases where there is a related diffusion limit this makes sense, but there's no guarantee that it will generalize. That said, it does seem to capture the optimal behavior in practice even when Hamiltonian trajectories are long enough that they really can't be considered diffusive by any metric. The Many Meanings of Rao-BlackwellOne of the sources of confusion here is the term "Rao-Blackwell" being used to describe similar but formally distinct operations. For example "Rao-Blackwell" is often used to describe a Markov transition that generalizes the standard Metropolis-Hastings transition. These generalized transitions attempt auxiliary proposals after each rejection, drastically increasing the probability of an acceptance. Note that this also fundamentally changes the relationship between the proposal step size and the average acceptance probability, and hence what the optimal step size or optimal average acceptance probability will be. "Rao-Blackwell" can also be used to describe Markov chain Monte Carlo estimators that take advance of multiple intermediate states generated in a proposal. For example consider Hamiltonian Monte Carlo. Any Hamiltonian Monte Carlo method will generate an entire numerical Hamiltonian trajectory with each state offering useful information. Even if the final Markov transition only uses one of these points (the end point for the original Hybrid Monte Carlo sampler, a random point based on the Boltzmann weights for multinomial sampler) those intermediate points can be used in the construction of Markov chain Monte Carlo estimators with a little bit of care. When implemented correctly these Rao-Blackwellized Markov chain Monte Carlo estimators will be more precise than the standard Markov chain Monte Carlo estimators that use only the Markov chain states, which then modifies the relationship between the sampler configuration and the estimator effective sample sizes. In particular these Rao-Blackwellized estimators can afford a large step size, and hence lower cost, without compromising the effective sample size, and hence overall sampler performance. The subtlety here is that there are now two notions of a "Rao-Blackwellized average acceptance probability estimator". Notion OneOne notion arises when one endeavors to estimate an appropriate optimization statistic for a given sampler -- for example the Hybrid Monte Carlo sampler or the sliced NUTS sampler or the multinomial sampler -- using the entire intermediate numerical Hamiltonian trajectory instead of just the initial and accepted states. This first requires, however, the construction of an optimization statistic appropriate to the desired sampler. The basis of this issue is that Stan currently uses an estimator for an optimization statistics appropriate for the Hybrid Monte Carlo sampler not the multinomial sampler, and this results in suboptimal behavior when trying to optimize the multinomial sampler. In my original pull request I introduced an estimator of a modified optimization statistic appropriate to the multinomial sampler. This estimator uses all of the states in each numerical Hamiltonian trajectory and I demonstrated that it yielded better performance after adaptation. Because it uses all of the states in each numerical trajectory it could be considered a Rao-Blackwell estimator. @nhuurre you introduced another estimator which you claimed, at least according to my understanding, estimated the same multinomial optimization statistic but with a smaller variance by squeezing more information from each numerical trajectory (or generating more states entirely). You haven't yet shown any calculations demonstrating these claims -- same expectation but smaller variance for yours -- and the numerical experiments have yet to directly compare my proposed method to yours. As I mentioned earlier in the thread the smoking gun for "Rao-Blackwellization" would be the same asymptotic sampler performance but faster convergence for the estimator with the smaller variance. Notion TwoThe second notion that we might consider is how to optimize performance when using Rao-Blackwellized Markov chain Monte Carlo estimators, that is instead of using as the final Markov chain Monte Carlo output using $$ where The analyses in Gilks et al (1997) and Betancourt et al (2014) concern how to optimize Metropolis-Hastings samplers when the output is That said one might consider using a modified optimization statistic that would lead to approximately optimal behavior for the Which Notion?My previous claim is that the optimization statistic introduced by @nhuurre is well suited to the particular optimization problem of Notion 2. In other words I don't think that its expectation is actually equivalent to the multinomial optimization statistic, in which case it would not be a Rao-Blackwellized average acceptance probability estimator in the sense of Notion 1. Instead it would be, or so I claim, a Rao-Blackwellized average acceptance probability in the sense of Notion 2. I agree that a proper Rao-Blackwellization in the sence of Notion 1 will always be better, but I don't think that there has been any verification that any of these optimization statistics can be considered Rao-Blackwellizations of others. Empirical ResultsThe n_leapfrog / ESS verses average acceptance probability (really the "optimization statistic" since most of these aren't actually Metropolis acceptance probabilities" is great. I also think that it very clearly supports the claims of my original pull request. In the top-left plot we see that for these two problems the current Stan behavior is optimized not for the default The top-right plot shows that, again at least for these problems, the modified optimization statistic in my original pull request results in optimal behavior right around In my opinion the spikes exhibited in the "last doubling accept rate" in the lower-left would be far too fragile for practical application, and the "last doubling lower bound" gives similar results to the top-right within the resolution of the experiments. To flip the results around -- what if anything suggests that "last doubling lower bound" optimization statistic is better than the "Boltzmann weights"? Odds and Ends
Only one adaptation window uses a single |
Does GitHub's mathjax support need triple back tick blocks to render correctly? The docs imply that $$ blocks should be sufficient but it doesn't look like they're rendering. |
As far as I can tell, there's no way to get a displayed equation in Git flavored markdown. The double |
It does work since this May: https://github.blog/2022-05-19-math-support-in-markdown/ https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/writing-mathematical-expressions But no idea why it does not show in Michael's case. |
Thanks for the tip, Rok. I didn't think it worked in markdown, because I was trying to use it like Michael. The trick is putting blank lines before and after. So really dumb and completely undocumented, just like the rest of markdown. Well, not undocumented, just misleadingly documented:
No, that doesn't work, as this example shows:
It just produces abc but if you offset the display math with blank lines,
it produces abc def |
Thanks, @rok-cesnovar! The mathjax rendering doesn't seem to be perfect in some of the more elaborate equations but hopefully the edits are easier to read. |
Github's MathJax is a huge pain in the neck. I had great difficulty getting my previous post render correctly in the preview and apparently some of it has now broken anyway. @betanalpha
If we can agree on this then at least some progress has been made.
This is highly misleading phrasing. The average cost per accepted proposal in a long chain is irrelevant; the "standard argument" you speak of only considers the cost to the first accepted proposal in a new chain. (Which is surprising because that's not how the sampler actually behaves.)
The harmonic mean is the average number of rejections before the first acceptance across all initial points. That omission was a major stumbling block for me because for a long time I held the misunderstanding that these papers were minimizing the total number of rejections in a long-running chain (which, in retrospect, is obviously not a harmonic mean but at the time it seemed so self-evidently correct cost function that I did not stop and think.)
Betancourt et al (2014) has only a brief motivation that does not adequately justify the use of harmonic mean rather than arithmetic mean. When questioned on this, Betancourt (2022, pers. comm.) referred to Roberts et al (1997) which does use diffusion limit and does not distinguish between harmonic and arithmetic mean.
Once again, this type of language makes it sound as if the quantity of interest is the asymptotic number of acceptances, but that would be the arithmetic mean of the acceptance probability. If the argument is (as I now suspect it is) that the asymptotic variance of the MCMC estimate is proportional to the variance of the lengths of the sequences of rejections then you should say the important thing is not the mean of the geometric distribution but its variance-to-mean ratio which coincidentally happens to be equal to the mean.
I omitted the calculation for multinomial sampling but I did show one for Metropolis-Hasting and Hybrid Monte Carlo schemes, and that is what "the proposed change to the old PR" number 2 uses. Do you agree that this is a valid Rao-Blackwellization? Now, for multinomial sampling Stan currently uses "Gelman-Hoffman" acceptance statistic The obvious "Rao-Blackwellization" is to sum over all potential initial points in the trajectory But that's inconvenient because it requires summing all N(N-1)/2 pairs. Instead, the plots were made with a symmetrized Gelman-Hoffman statistic This has the same mean because it has the same "Rao-Blackwellization" (relabel the indices in one sum):
It's not clear to me how these analyses depend on the specifics of the MCMC estimators used. Could you elaborate on why that makes a difference and what the analysis looks like for
I've tried to follow Betancourt et al. (2014) who say the lower bound is maximized at a=0.6 and the upper bound is minimized at a=0.8. However this terminology is confusing because in my opinion their plot is upside down, showing inefficiency rather than (more natural) efficiency.
All else equal, longer transitions are better than shorter ones and I think the acceptance statistic should take that into account somehow. "Boltzmann weights" treats all points equally while "last doubling lower bound" has a strong preference for the far end of the trajectory. That makes it more robust in pathological examples like one I posted on the forums way back.
|
Summary:
The step size adaptation targets a suboptimal "acceptance statistic".
Description:
About two and half years ago issue #2789 pointed out a problem with the calculation of the adaptation target
accept_stat
. The issue was closed after a pull request with a proposed fix was completed but unfortunately the pull request was rejected and the problem is still present in Stan 2.29. This also came up on the forum last summer.Background reading: https://arxiv.org/abs/1411.6669 finds the optimal stepsize for static HMC. In short: the ratio of accepted proposals to gradient evaluations is maximized when the average Metropolis acceptance probability is 0.8.
The acceptance probability is related to the energy error in the trajectory
where
w_1
is the Boltzmann weight of the initial point andw_2
of the final point. The sampler selects the candidate sample with probabilityaccept_prob
and adaptation adjusts the step size up or down depending on whetheraccept_prob
is above or below 0.8.Before we try to apply this to Stan, I'll mention one thing that bugs me about this use of Metropolis acceptance probability as the adaptation target for static HMC. And that is: a significant number (about 1/3) of transitions have
w_2 > w_1
and an acceptance probability saturating to 1. But the magnitude of the energy error should be informative even in this case; it would be more efficient to not ignore it.To remedy this recall that the transition kernel is reversible: if q_1 -> q_2 is possible then so is q_2 -> q_1 and the rates at which these will happen are proportional to the Boltzmann weights w_1 and w_2. Taking the weighted average of acceptance probabilities in both directions gives a symmetric acceptance statistic
which has the same mean as the saturating acceptance probability but smaller variance.
Now, of course, Stan's sampler is not static HMC. Instead of considering only the last point for acceptance or rejection, Stan chooses a Boltzmann-weighted multinomial sample from the entire trajectory, with bias towards the last doubling. Adaptation strategy is similar but now the acceptance statistic cannot simply be a Metropolis acceptance probability.
The original Hoffmann & Gelman NUTS paper proposed using the average of
accept_prob
for all points in the last half of the trajectory as the proxy accept statistic and that's how Stan calculated it until #352 changed it to be the average over all points in the entire trajectory (even the unconditionally rejected ones, should the trajectory terminate early).I'm not sure which is better. On one hand short jumps that don't reach the last doubling should not be ignored, on the other hand they contribute much less to the effective sample size and I don't think you should count them as a full accepted sample when trying to maximize the number of accepted samples per computational effort.
Either way, I'm going to point out that--just like with static HMC--you can reduce the variance without changing the mean by replacing the Metropolis probability with the symmetric acceptance statistic.
Next up, issue #2789. The proposal there is that instead of a simple arithmetic average, accept_stat should be calculated as a weighted average. This makes some amount of sense but there are two issues that concern me.
Also, it is not obvious whether the statistic to average should be the (saturating) Metropolis acceptance probability or the symmetrized acceptance statistic--unlike with the previous accept_stat proxies these give different results with Boltzmann weighting.
Finally, let's consider an alternative idea: what if you use the probability of selecting from the last doubling as the accept_stat proxy?
The sampler implements the last-doubling-bias in a very Metropolis-like fashion: if the total weight of the second subtree is greater than the first select a sample from the second, otherwise select it with probability W_2/W_1 (where
W
s are summed Boltzmann weights over the entire subtree). You could use this probability as the adaptation proxy with the rationale that the last doubling is what gives you most effective samples.There's two problems with this idea:
I think the non-monotonicity in this plot mostly results from the symmetry of the 50-dimensional normal distribution. In a normal distribution the sampler moves in circles around the mean; at first the energy error increases until the simulated particle has reached 90 degrees from the starting point; but then starts decreasing and returns to almost zero when the movement is at 180 degrees around. If the stepsize is a power-of-two fraction of a full cycle then the two halves of the final trajectory are always symmetrical and have equal weight regardless of pointwise energy errors within them.
So, what now? The last doubling acceptance rate is an attractive proxy for effective sample size per computational cost (in fact, in the above example effective sample size also goes up and down in sync with fluctuations in the acceptance rate) if you could somehow make it strictly monotonic. One possible monotonic proxy can be created by taking the symmetric last doubling acceptance statistic
and applying Jensen's inequality to derive a lower bound for it
where
w_1
is the average weight in the first subtree andi
ranges over the second subtree.Current Version:
v2.29.0
The text was updated successfully, but these errors were encountered: