-
Notifications
You must be signed in to change notification settings - Fork 4
/
18-laplace.Rtex
371 lines (261 loc) · 14.3 KB
/
18-laplace.Rtex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
\frame{
\sffamily
\frametitle{The Laplace approximation}
\begin{itemize}
\item An alternative to variational approximations are Laplace approximations.
\item The Laplace approximation is a simple yet powerful asymptotic expansion of the posterior distribution that goes back to Laplace in 1774!
\item There are different variants of the technique, but the main argument is similar in all cases: Start by assuming that $y_1, \ldots, y_n$ are independent and identically distributed with $y_i \mid \bftheta \sim p(y_i \mid \bftheta)$ so that the posterior takes the form
$$
% p(\bfy \mid \bftheta) p(\bftheta)
p( \bftheta \mid \bfy ) \propto \exp \left\{ - n h_n(\bftheta) \right\} ,
$$
where $h_n(\bftheta) = -\frac{1}{n} \sum_{i=1}^{n} \log p(y_i \mid \bftheta) - \frac{p(\bftheta)}{n}$.
\end{itemize}
}
\frame{
\sffamily
\frametitle{The Laplace approximation}
\begin{itemize}
\item Now, do a second order expansion of $h_n(\bftheta)$ around a point $\bftheta_0$ and write
\begin{multline*}
h_n(\bftheta) = h_n(\bftheta_0) + \left[ \nabla h_n \left( \bftheta_0 \right) \right]^{T} \left[ \bftheta - \bftheta_0 \right] \\
%
+ \frac{1}{2} \left[ \bftheta - \bftheta_0 \right]^{T} \left[ \nabla \nabla h_n \left( \bftheta_0 \right) \right] \left[ \bftheta - \bftheta_0 \right] + R_n(\bftheta, \bftheta_0) ,
\end{multline*}
where $\nabla h_n \left( \bftheta \right)$ is the gradient of $h_n$ and $\nabla \nabla h_n \left( \bftheta \right)$ is the Hessian matrix of $h$.
\item It is not difficult to show that, under standard regularity conditions, $ \lim_{n \to \infty} R_n(\bftheta, \bftheta_0) = 0$.
\end{itemize}
}
\frame{
\sffamily
\frametitle{The Laplace approximation}
\begin{itemize}
\item The previous argument suggests the approximation
\begin{multline*}
p( \bftheta \mid \bfy ) \overset{\boldsymbol{\cdot}}{\propto} \exp \left\{ - n
\left[ \nabla h_n \left( \bftheta_0 \right) \right]^{T} \left[ \bftheta - \bftheta_0 \right] \right. \\
%
\left. - \frac{n}{2} \left[ \bftheta - \bftheta_0 \right]^{T} \left[ \nabla \nabla h_n \left( \bftheta_0 \right) \right] \left[ \bftheta - \bftheta_0 \right] \right\}
\end{multline*}
which corresponds to a Gaussian kernel!
\item Note that the contribution of the prior drops out faster than the contribution from the likelihood. Therefore, alternatively,
\begin{multline*}
p( \bftheta \mid \bfy ) \overset{\boldsymbol{\cdot}}{\propto} \exp \left\{
\left[ \left. \nabla \log p\left( \bfy \mid \bftheta \right) \right|_{\bftheta= \bftheta_0}\right]^{T} \left[ \bftheta - \bftheta_0 \right] \right. \\
%
\left. - \frac{1}{2} \left[ \bftheta - \bftheta_0 \right]^{T} \left[ \left. - \nabla \nabla \log p \left( \bfy \mid \bftheta \right) \right|_{\bftheta = \bftheta_0} \right] \left[ \bftheta - \bftheta_0 \right] \right\} .
\end{multline*}
\end{itemize}
}
%where $- \nabla \nabla p \left( \bfy \mid \bftheta \right) \right|_{\bftheta = \bftheta_0}$ is the
\frame{
\sffamily
\frametitle{The Laplace approximation}
\begin{itemize}
\item Typically $\bftheta_0$ is taken to be either the MLE of $\bftheta$ or the maximum a posteriori, leading to simplifications.
\begin{itemize}
\item For example, discarding the contribution of the likelihood and letting $\bftheta_0 = \hat{\bftheta}$ we have
\begin{multline*}
p( \bftheta \mid \bfy ) \overset{\boldsymbol{\cdot}}{\propto} \\
%
\exp \left\{ - \frac{1}{2} \left[ \bftheta - \hat{\bftheta} \right]^{T} \left[ \left. - \nabla \nabla \log p \left( \bfy \mid \bftheta \right) \right|_{\bftheta = \hat{\bftheta}} \right] \left[ \bftheta - \hat{\bftheta} \right] \right\}
\end{multline*}
where the matrix $\left. - \nabla \nabla \log p \left( \bfy \mid \bftheta \right) \right|_{\bftheta = \hat{\bftheta}}$ is called the \textit{observed} information matrix. That is
$$
\bftheta \mid \bfy \overset{\boldsymbol{\cdot}}{\sim} \normal\left( \hat{\bftheta} , \left[ \left. - \nabla \nabla \log p \left( \bfy \mid \bftheta \right) \right|_{\bftheta = \hat{\bftheta}} \right]^{-1} \right)
$$
\end{itemize}
\item This last version of the approximation is the basis for ``Bayesian CLT'', as well as for the derivation of the BIC.
\end{itemize}
}
\frame{
\sffamily
\frametitle{The Laplace approximation}
\begin{itemize}
\item Although technically the result is valid for any parameterization, maing appropriate transformations (e.g., logarithm, logit) might improve the accuracy.
\item The Laplace approximation can be very accurate for large $n$, but in complicated hierarchical models with a large number of hyperparameters the approximation can be very poor!
\item When the posterior moments are available in closed form, it might simply be better to use the posterior mean and variance as the moments for the normal approximation.
\item Even if they are not very accurate, Laplace approximations are sometimes used to guide the construction of proposal distributions in some of the Monte Carlo schemes discussed before.
\end{itemize}
}
\frame{
\sffamily
\frametitle{The Laplace approximation}
\begin{itemize}
\item {\bf Example (Poisson-Gamma models): } Consider data $y_1, \ldots, y_n$ where $y_i \sim \Poi (\lambda)$ and $\lambda \sim \Gam(a, b)$ The posterior distribution (which is known to be a $\Gam\left( a + \sum_{i=1}^{n} y_i , n + b \right)$ distribution) can be written as $p(\lambda \mid \bfy) \propto \left\{ - n h_n(\lambda) \right\}$ where
$$
h_n(\lambda) = - \lambda \frac{n+b}{n} + \frac{a - 1 + \sum_{i=1}^{n} y_i}{n} \log \lambda .
$$
It is easy to se that the posterior mode in this case is $\tilde{\lambda} = \frac{a - 1 + \sum_{i=1}^{n} y_i}{n+b}$. Also $- \frac{\partial^2 h_n}{\partial \lambda^2} = \frac{a - 1 + \sum_{i=1}^{n} y_i}{n} \frac{1}{\lambda^2}$, which leads
$$
\lambda \mid \bfy \overset{\boldsymbol{\cdot}}{\sim} \normal \left( \frac{a - 1 + \sum_{i=1}^{n} y_i}{n+b} , \frac{\left\{ a - 1 + \sum_{i=1}^{n} y_i \right\}}{(n+b)^2}
\right)
$$
\end{itemize}
}
\frame{
\sffamily
\frametitle{The Laplace approximation}
\begin{itemize}
\item {\bf Example (Poisson-Gamma models, cont): } Alternatively, if we discard the prior and center the approximation on the MLE we get the familiar expression
$$
\lambda \mid \bfy \overset{\boldsymbol{\cdot}}{\sim} \normal \left( \frac{1}{n} \sum_{i=1}^{n} y_i , \sum_{i=1}^{n} \frac{y_i}{n^2}
\right).
$$
(Note that this is equivalent to setting $a=1$ and $b=0$ in the first approximation).
\vspace{1mm}
If we are interested in approximating the posterior mean then we have
\begin{center}
\begin{tabular}{|c|c|} \hline
Exact & $\frac{a + \sum_{i=1}^{n}y_i}{b + n}$ \\
Like + Prior & $\frac{a -1 + \sum_{i=1}^{n}y_i}{b + n}$ \\
Like only & $\frac{\sum_{i=1}^{n}y_i}{n}$ \\ \hline
\end{tabular}
\end{center}
\vspace{1mm}
If $a$ is large and $n$ small, the differences can be substantial!
\end{itemize}
}
\frame{
\sffamily
\frametitle{The Laplace approximation}
\begin{itemize}
\item {\bf Example (Poisson-Gamma models): } The following graphs show a comparison of the two approximations against the true posterior for $a=2$, $b=1/5$ and $\bar{y} = 2$.
\includegraphics[height=5.2cm,angle=0]{plots/laplaceapprox_poissongamma_n10.pdf}
\includegraphics[height=5.2cm,angle=0]{plots/laplaceapprox_poissongamma_n50.pdf}
\end{itemize}
}
\frame{
\sffamily
\frametitle{The Laplace approximation}
\begin{itemize}
\item {\bf Example (Poisson-Gamma models): } The following table compares the exact posterior probability of the interval $(l,u)$ against these two approximation for different intervals and sample sizes.
\begin{center}
\begin{tabular}{ccccc}
$(l,u)$ & $n$ & Exact & Like + Prior & Like only \\ \hline\hline
$(1.5, 2.8)$ & & 0.849 & 0.844 & 0.831 \\
$(1.8, 2.3)$ & $10$ & 0.420 & 0.422 & 0.421 \\
$(2.5, \infty)$ & & 0.218 & 0.163 & 0.132 \\ \hline
$(1.5, 2.8)$ & & 0.998 & 0.995 & 0.994 \\
$(1.8, 2.3)$ & $50$ & 0.783 & 0.780 & 0.775 \\
$(2.5, \infty)$ & & 0.014 & 0.007 & 0.006 \\ \hline
\end{tabular}
\end{center}
\end{itemize}
}
\frame{
\sffamily
\frametitle{Integrated Nested Laplace Approximations}
\begin{itemize}
\item INLA is designed for hierarchical models:
\begin{align*}
y_i \mid \theta_i, \bfeta_1 &\sim p(y_i \mid \theta_i, \bfeta_1) & \bftheta \mid \bfeta_2 & \sim \normal \left( \mathbf{0} , \mathbf{Q}(\bfeta_2) \right) ,
\end{align*}
where $\bfeta = (\bfeta_1, \bfeta_2) \sim p(\bfeta_1, \bfeta_2)$ and $m = \mbox{dim} (\bfeta_1) + \mbox{dim} (\bfeta_2)$ is small (usually $\le 6$).
\item Particularly useful if $\mathbf{Q}(\bfeta_2)$ is sparse.
\item Goal is to provide approximations to
$$
p(\theta_i \mid \bfeta, \bfy) = \int p(\theta_i \mid \bfeta, \bfy) p(\bfeta \mid \bfy) \dd \bfeta
$$
and
$$
p(\eta_j \mid \bfy) = \int p(\eta \mid \bfy) \dd \bfeta_{-j} .
$$
\item Uses Laplace approximations twice (hence the name).
\end{itemize}
}
\frame{
\sffamily
\frametitle{Approximating $p(\bfeta \mid \bfy)$}
\begin{itemize}
\item Start by approximating $p(\bfeta \mid \bfy)$ by
$$
\tilde{p}(\bfeta \mid \bfy) \propto \left. \frac{p(\bftheta, \bfeta, \bfy)}{\tilde{p}_G(\bftheta \mid \bfeta, \bfy)} \right|_{\bftheta = \hat{\bftheta}(\bfeta)}
$$
where $\tilde{p}_G(\bftheta \mid \bfeta, \bfy)$ is a Gaussian approximation to the full conditional of $\bftheta$
$$
p(\bftheta \mid \bfeta, \bfy) \propto \exp \left\{ -\frac{1}{2} \bftheta^T \mathbf{Q}\left(\bfeta_2\right) \bftheta + \sum_{i} \log\left\{ p(y_i \mid \theta_i, \bfeta_1) \right\} \right\}
$$
and $\hat{\bftheta}(\bfeta) = \arg\max_{\bftheta} p(\bftheta \mid \bfeta, \bfy)$. This is just the Laplace approximation around the posterior mode.
\end{itemize}
}
\frame{
\sffamily
\frametitle{Approximating $p(\bfeta \mid \bfy)$}
\begin{itemize}
\item Generally speaking $\tilde{p}(\bfeta \mid \bfy)$ will not be tractable even if $\tilde{p}_G(\bftheta \mid \bfeta, \bfy)$ is. Hence, this approximation by itself is not too helpful for computing an approximation to $p(\theta_i \mid \bfeta, \bfy)$.
\item We go one step forward and approximate the approximation using a discrete set of points,
$$
\tilde{p}^{*}(\bfeta \mid \bfy) = \sum_{i_1 \in \mathcal{I}_1} \cdots \sum_{i_m \in \mathcal{I}_m} w_{i_1, \cdots, i_m} \delta_{(\tilde{\eta}_{i_1}, \ldots, \tilde{\eta}_{i_m})} (\bfeta)
$$
where the points $(\tilde{\eta}_{i_1}, \ldots, \tilde{\eta}_{i_m})$ for $(i_1, \ldots, i_m) \in \mathcal{I}_1 \times \cdots \times \mathcal{I}_m$ form a regular grid and
$$
w_{i_1, \cdots, i_m} \propto \tilde{p}^{*}(\tilde{\eta}_{i_1}, \ldots, \tilde{\eta}_{i_m} \mid \bfy)
$$
\item To select the grid points we use a second Laplace expansion (in this case of $\tilde{p}(\bfeta \mid \bfy)$). See next slide.
\end{itemize}
}
\frame{
\sffamily
\frametitle{Approximating $p(\bfeta \mid \bfy)$}
\begin{center}
\includegraphics[width=11cm,angle=0]{plots/INLA.pdf} \\
\hfill From \cite{rue2009approximate}.
\end{center}
}
\frame{
\sffamily
\frametitle{Approximating $p(\eta_i \mid \bfy)$}
\begin{itemize}
\item It is natural to approximate $p(\eta_j \mid \bfy)$ by
$$
\tilde{p}(\eta_j \mid \bfy) = \int \tilde{p}(\eta \mid \bfy) \dd \bfeta_{-j} .
$$
\item The integral can be computed using standard numerical integration techniques.
\item However, it is much more efficient to construct the numerical integrator by reusing the grid of values used in the definition $\tilde{p}^{*}(\bfeta \mid \bfy)$ (which is aligned with the principal axes of the distribution) rather than a completely new grid aligned to the original axes.
\end{itemize}
}
\frame{
\sffamily
\frametitle{Approximating $p(\theta_i \mid \bfeta, \bfy)$}
\begin{itemize}
\item If we had access to $p(\theta_i \mid \bfeta, \bfy)$ in closed form, the approximation $\tilde{p}^{*}(\bfeta \mid \bfy)$ could be used to compute $\hat{p}(\theta_i \mid \bfy)$ as
\begin{align*}
\hat{p}(\theta_i \mid \bfy) &= \int p(\theta_i \mid \bfeta, \bfy) \tilde{p}^{*}(\bfeta \mid \bfy) \dd \bfeta \\
& = \sum_{i_1 \in \mathcal{I}_1} \cdots \sum_{i_m \in \mathcal{I}_m} w_{i_1, \cdots, i_m}
p(\theta_i \mid (\tilde{\eta}_{i_1}, \ldots, \tilde{\eta}_{i_m}), \bfy)
\end{align*}
\item However, in $p(\theta_i \mid \bfeta, \bfy)$ is not available in closed form so we need to first approximate by $\hat{p}(\theta_i \mid \bfeta, \bfy)$. One option is to reuse the Laplace approximation $\tilde{p}_G(\bftheta \mid \bfeta, \bfy)$ we already computed for getting $\hat{p}(\theta_i \mid \bfy)$. Then
$$
\hat{p}^{*}(\theta_i \mid \bfy) = \sum_{i_1 \in \mathcal{I}_1} \cdots \sum_{i_m \in \mathcal{I}_m} w_{i_1, \cdots, i_m}
\tilde{p}_G(\theta_i \mid (\tilde{\eta}_{i_1}, \ldots, \tilde{\eta}_{i_m}), \bfy)
$$
\end{itemize}
}
\frame{
\sffamily
\frametitle{Approximating $p(\theta_i \mid \bfeta, \bfy)$}
\begin{itemize}
\item Reusing $\tilde{p}_G(\bftheta \mid \bfeta, \bfy)$ is expedient, but does not account for skewness and other features of the data.
\item An alternative is to use another Laplace expansion
$$
\tilde{p}_{LA}(\theta_i \mid \bfeta, \bfy) \propto \left. \frac{p(\bftheta, \bfeta, \bfy)}{\tilde{p}_{GG}(\bftheta_{-i} \mid \theta_i, \bfeta, \bfy)} \right|_{\bftheta_{-i} = \hat{\bftheta}_{-i} (\theta_i, \bfeta)}
$$
where $\tilde{p}_{GG}(\bftheta_{-i} \mid \theta_i, \bftheta, \bfy)$ is a Gaussian approximation to $(\bftheta_{-i} \mid \theta_i, \bfeta, \bfy)$ (which is different from the full conditional obtained from $\bftheta \mid \bfeta, \bfy$).
\end{itemize}
}
\frame{
\sffamily
\frametitle{Laplace-based methods and NIMBLE}
\begin{itemize}
\item Currently your software options are R-INLA and the related TMB software.
\item NIMBLE is very close to having automatic differentiation, which is important for Laplace-based methods as they need gradients.
\item At that point, one of our target algorithms will be INLA so we hope to see that in NIMBLE within a year.
\item Our hope is that by providing in NIMBLE, users can:
\begin{itemize}
\item use BUGS code to specify their model,
\item use distributions and model structures not provided in R-INLA, and
\item more readily compare results with other algorithms.
\end{itemize}
\end{itemize}
}