-
Notifications
You must be signed in to change notification settings - Fork 20
/
5-deterministic_estimators.tex
443 lines (384 loc) · 15.8 KB
/
5-deterministic_estimators.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
\chapter{Probability in Practice: Deterministic Estimators}
If we can't compute expectations with respect to our target distribution
analytically, one immediate strategy is to replace it with a simpler one.
In other words, we can approximate a complex target distribution, $\pi$,
with a simpler distribution, $\widetilde{\pi}$, whose expectations, or at
least some expectations of practical interest, are known analytically,
%
\begin{equation*}
\EE_{\pi} \! \left[ f \right]
\approx
\EE_{\widetilde{\pi}} \! \left[ f \right].
\end{equation*}
\emph{Deterministic estimators} use various criteria to identify an
optimal approximating distribution so that expectations can be
approximated deterministically.
\section{Modal Estimators}
Ideally all expectations with respect to our approximating distribution
would be analytic so that we could use it to approximate any
expectation with respect to the target distribution. The only
probability distribution that fits this criteria is the \emph{Dirac distribution},
$\delta_{\tilde{\theta}}$, that assigns all probability to a single point
in the sample space, $\tilde{\theta}$,
%
\begin{equation*}
\delta_{\tilde{\theta}} \! \left[ E \right] =
\left\{
\begin{array}{rr}
0, & \tilde{\theta} \notin E \\
1, & \tilde{\theta} \in E
\end{array}
\right. .
\end{equation*}
%
Because all probability concentrates at $\tilde{\theta}$, expectations
are trivial,
%
\begin{equation*}
\EE_{\delta} \! \left[ f \right]
=
f ( \tilde{\theta} ).
\end{equation*}
Where, however, should we assign all probability to best approximate
the target distribution and its typical set? One of the simplest, and
consequently most popular, deterministic estimation strategies is
\emph{modal estimation}, where we approximate the target distribution
with a Dirac distribution at the mode of the probability density function,
%
\begin{equation*}
\hat{\theta} = \mathrm{argmax} \, p \! \left( \theta \right).
\end{equation*}
%
This approach, however, immediately contradicts the intuition provided by
concentration of measure: it utilizes a single point in the sample space
that lies outside of the typical set and depends entirely on the choice of
representation! Why, then, is modal estimation so ubiquitous?
Modal estimators are seductive because the optimization on which they
rely is relatively computationally inexpensive. Moreover, in some very
simple cases modal estimators constructed from the probability density
function in \emph{some} representations can be reasonably accurate for
\emph{some} functions. For example, if the target probability distribution
is sufficiently simple that the typical set is approximately symmetric
\emph{and} if a representation can be found such that the mode of the
probability density function lies in the center of that typical set, then the
corresponding modal estimator might yield reasonably accurate estimates
for the mean, $\EE_{\pi} \! \left[ \theta \right]$
(Figure \ref{fig:good_modal_bad_modal}a). Other choices of
representation, can just as easily yield terribly inaccurate estimators
(Figure \ref{fig:good_modal_bad_modal}b).
\begin{figure*}
\centering
\subfigure[]{
\begin{tikzpicture}[scale=0.3, thick]
\begin{scope}
\clip (-12, -6) rectangle (12, 10);
\foreach \i in {0, 0.05,..., 1} {
\draw[line width={30 * \i}, opacity={exp(-8 * \i)}, dark]
(-10, 0) .. controls (-10, 15) and (-5, 5) .. (0, 5)
.. controls (5, 5) and (10, 8) .. (10, 0)
.. controls (10, -8) and (5, -3) .. (0, -3)
.. controls (-5, -3) and (-10, -5) .. (-10, 0);
}
\end{scope}
\fill[color=dark] (0, 1) circle (7pt);
\fill[color=light] (0, 1) circle (5pt)
node[left, color=black] { $\EE_{\pi} \! \left[ \theta \right]$ };
\fill[color=dark] (1, 0.5) circle (7pt);
\fill[color=green] (1, 0.5) circle (5pt)
node[right, color=black] { $\hat{\theta}$ };
\end{tikzpicture}
}
\subfigure[]{
\begin{tikzpicture}[scale=0.3, thick]
\begin{scope}
\clip (-12, -6) rectangle (12, 10);
\foreach \i in {0, 0.05,..., 1} {
\draw[line width={30 * \i}, opacity={exp(-8 * \i)}, dark]
(-10, 0) .. controls (-10, 15) and (-5, 5) .. (0, 5)
.. controls (5, 5) and (10, 8) .. (10, 0)
.. controls (10, -8) and (5, -3) .. (0, -3)
.. controls (-5, -3) and (-10, -5) .. (-10, 0);
}
\end{scope}
\fill[color=dark] (0, 1) circle (7pt);
\fill[color=light] (0, 1) circle (5pt)
node[left, color=black] { $\EE_{\pi} \! \left[ \theta \right]$ };
\fill[color=dark] (5, 0.5) circle (7pt);
\fill[color=green] (5, 0.5) circle (5pt)
node[right, color=black] { $\hat{\theta}$ };
\end{tikzpicture}
}
\caption{(a) In simple cases a prescient choice of representation can
yield a modal estimator that well approximates some expectations,
such as the mean of the real parameters $\theta$. (b) Poor choices
of the representation, however, yield very inaccurate estimates, even
in these simple problems.
}
\label{fig:good_modal_bad_modal}
\end{figure*}
Moreover, because they rely on a point estimate, modal estimators are
terrible at approximating expectations that depend on the breadth of
the typical set, such as the variance. Furthermore, identifying the optimal
representation for a particular target function, even if one exists, is
extremely challenging in practice. Worse, we have no generic means
of even quantifying the error in these estimators for a generic target
distribution. Ultimately this strong sensitivity to the choice of a particular
representation and the inability to validate the accuracy of the estimators
makes modal estimation extremely fragile in practice.
\section{Laplace Estimators}
The fragility of modal estimators can be partially resolved by generalizing
them to \emph{Laplace estimators} which approximate the probability
density function with a Gaussian density function,
%
\begin{equation*}
p \! \left( \theta \right) \approx
\mathcal{N} \! \left( \theta \mid \mu, \Sigma \right),
\end{equation*}
%
where the mean is given by the modal estimate,
%
\begin{equation*}
\mu = \hat{\theta},
\end{equation*}
%
and the covariance is given by the Hessian of the probability density
function evaluated at the mode,
%
\begin{equation*}
\left( \Sigma^{-1} \right)_{ij} = \left.
\frac{ \partial^{2} p \! \left( \theta \right) }
{ \partial \theta_{i} \partial \theta_{j} }
\right|_{\hat{\theta}}.
\end{equation*}
%
Expectations are then estimated with Gaussian integrals
%
\begin{equation*}
\mathbb{E}_{\pi} \! \left[ f \right]
\approx
\int_{\Theta} f \! \left( \theta \right)
\mathcal{N} \! \left( \theta \mid \mu, \Sigma \right) \dd^{D} \theta,
\end{equation*}
%
which often admit analytic solutions.
The accuracy of a Laplace approximations depends on how well the
mode and the Hessian of the probability density function quantify the
geometry of the typical set (Figure \ref{fig:laplace}). Unfortunately the
conditions that are necessary for these estimates to be reasonably
accurate hold only for very simple probability distributions, and then
only if an appropriate representation can be found.
\begin{figure*}
\centering
\begin{tikzpicture}[scale=0.35, thick]
\begin{scope}
\clip (-12, -6) rectangle (12, 10);
\foreach \i in {0, 0.05,..., 1} {
\draw[line width={30 * \i}, opacity={exp(-8 * \i)}, dark]
(-10, 0) .. controls (-10, 15) and (-5, 5) .. (0, 5)
.. controls (5, 5) and (10, 8) .. (10, 0)
.. controls (10, -8) and (5, -3) .. (0, -3)
.. controls (-5, -3) and (-10, -5) .. (-10, 0);
}
\foreach \i in {0, 0.05,..., 1} {
\draw[line width={30 * \i}, opacity={exp(-8 * \i)}, green]
(1, 0.5) ellipse (10 and 5);
}
\end{scope}
\fill[color=dark] (1, 0.5) circle (7pt);
\fill[color=green] (1, 0.5) circle (5pt)
node[right, color=black] { $\hat{\theta}$ };
\end{tikzpicture}
\caption{For simple probability distributions in well-chosen
representations, the local geometry around the mode of a
probability density function can quantify the geometry of the
entire typical set, yielding accurate Laplace estimators. For
more complex probability distributions, however, this local
information poorly quantifies the global geometry of the
typical set and Laplace estimators suffer from large biases.
}
\label{fig:laplace}
\end{figure*}
As with the simpler modal estimators, the dependence on the particular
representation manifests as fragility of the corresponding estimators,
and we have no generic means of quantifying that error in practice.
If we want robust estimation of probabilities and expectations then we
need strategies that do not depend on the irrelevant properties of any
particular representation.
\section{Variational Estimators}
In order to construct an approximation that is not sensitive to the choice
of a particular representation we need to frame the problem as an
optimization over a space of approximating distributions directly.
Optimizations over spaces of probability distributions fall into a class
of algorithms known as \emph{variational methods}.
Variational methods are characterized by two choices: the variational
family and a divergence function. The \emph{variational family},
$\mathcal{U}$, is a set of probability distributions such that at least
some expectations can be computed analytically. In order to identify
the best approximation to the target distribution we then define a
\emph{divergence function},
%
\begin{align*}
D &:
\mathcal{U} \times \mathcal{U}
\rightarrow \mathbb{R}^{+}
\\
& \quad \upsilon_{1}, \upsilon_{2}
\mapsto D \! \left( \upsilon_{1} \mid\mid \upsilon_{2} \right),
\end{align*}
%
which is zero if the two arguments are the same and increases as they
deviate from each other.
The best approximating distribution is then defined by the variational
objective (Figure \ref{fig:variational_cartoon}).
%
\begin{equation*}
\upsilon^{*}
=
\underset{\upsilon \, \in \, \mathcal{U}}{\mathrm{argmin}} \,
D \! \left( \pi \mid\mid \upsilon \right).
\end{equation*}
%
Although straightforward to define, this variational optimization can be
quite challenging in practice. Depending on how the target distribution
interacts with the choice of variational distribution and divergence
function, the variational objective might feature multiple critical points
and we may not be able to find the global optimum
(Figure \ref{fig:variational_challenges}).
\begin{figure*}
\centering
%
\subfigure[]{
\begin{tikzpicture}[scale=0.225, thick]
\draw[color=white] (-15, 0) -- (15, 0);
\fill[mid] (0, 0) ellipse (13 and 7);
\node at (11, -6) {$\mathcal{P}$};
\fill[light] (-4, 2) ellipse (6 and 3);
\node[color=white] at (-2, -2.5) {$\mathcal{U}$};
\fill[color=white] (4, 2) circle (8pt)
node[right, color=white] {$\pi$};
\end{tikzpicture}
}
\subfigure[]{
\begin{tikzpicture}[scale=0.225, thick]
\draw[color=white] (-15, 0) -- (15, 0);
\fill[mid] (0, 0) ellipse (13 and 7);
\node at (11, -6) {$\mathcal{P}$};
\draw[color=light, dashed] (-4, 2) ellipse (6 and 3);
\node[color=white] at (-2, -2.5) {$\mathcal{U}$};
\begin{scope}
\clip (-4, 2) ellipse (6 and 3);
\foreach \i in {0, 0.05,..., 1} {
\fill[opacity={exp(-5 * \i*\i)}, light] (0, 2) circle ({4 * \i});
}
\end{scope}
\fill[color=white] (2, 2) circle (8pt)
node[above, color=white] {$\upsilon^{*}$};
\fill[color=white] (4, 2) circle (8pt)
node[right, color=white] {$\pi$};
\end{tikzpicture}
}
\caption{(a) A variational family, $\mathcal{U}$, is a set of probability
distributions taken from the set of all probability distributions over the
same sample space, as the target distribution, $\mathcal{P}$.
(b) Adding a divergence function distinguishes which elements of
$\mathcal{U}$ are good approximations to the target distribution,
allowing us to identify the best approximation, $\upsilon^{*}$.
}
\label{fig:variational_cartoon}
\end{figure*}
\begin{figure*}
\centering
\begin{tikzpicture}[scale=0.225, thick]
\draw[color=white] (-15, 0) -- (15, 0);
\fill[mid] (0, 0) ellipse (13 and 7);
\node at (11, -6) {$\mathcal{P}$};
\draw[color=light, dashed] (-4, 2) ellipse (6 and 3);
\node[color=white] at (-2, -2.5) {$\mathcal{U}$};
\begin{scope}
\clip (-4, 2) ellipse (6 and 3);
\foreach \i in {0, 0.05,..., 1} {
\fill[opacity={exp(-5 * \i*\i)}, light] (0, 2) circle ({4 * \i});
}
\foreach \i in {0, 0.05,..., 1} {
\fill[opacity={exp(-5 * \i*\i)}, light] (-4, 3) ellipse ({4 * \i} and {3 * \i});
}
\foreach \i in {0, 0.05,..., 1} {
\fill[opacity={exp(-5 * \i*\i)}, light] (-8, 1) ellipse ({4 * \i} and {5 * \i});
}
\foreach \i in {0, 0.05,..., 1} {
\fill[opacity={exp(-5 * \i*\i)}, light] (-3, -1) ellipse ({3 * \i} and {2 * \i});
}
\end{scope}
\fill[color=white] (2, 2) circle (8pt)
node[above, color=white] {$\upsilon^{*}$};
\fill[color=white] (4, 2) circle (8pt)
node[right, color=white] {$\pi$};
\end{tikzpicture}
\caption{Typically different elements of the variational family are able to
capture different characteristics of the target distribution and the variational
objective manifests multiple optima. Even if a unique global optimum,
$\upsilon^{*}$, exists it will be difficult to find and we may be left with only
a suboptimal local optimum.
}
\label{fig:variational_challenges}
\end{figure*}
Even if we could find a global optimum, however, there are no guarantee
that the chosen variational family is rich enough that the best approximating
distribution will yield accurate estimates for all relevant expectations. For
example, some divergence functions are biased towards variational solutions
that underestimate the breadth of the typical set while others tend to
significantly overestimate it (Figure \ref{fig:variational_approximations}).
Here the means may be approximated adequately well , but the variances
would not be.
Variational methods are relatively new to statistics and at the moment
there are no generic methods for quantifying the error in variational
estimators.
\begin{figure*}
\centering
\subfigure[]{
\begin{tikzpicture}[scale=0.35, thick]
\begin{scope}
\clip (-12, -6) rectangle (12, 10);
\foreach \i in {0, 0.05,..., 1} {
\draw[line width={30 * \i}, opacity={exp(-8 * \i)}, dark]
(-10, 0) .. controls (-10, 15) and (-5, 5) .. (0, 5)
.. controls (5, 5) and (10, 8) .. (10, 0)
.. controls (10, -8) and (5, -3) .. (0, -3)
.. controls (-5, -3) and (-10, -5) .. (-10, 0);
}
\foreach \i in {0, 0.05,..., 1} {
\draw[line width={30 * \i}, opacity={exp(-8 * \i)}, green]
(0, 1) ellipse (9.75 and 3.75);
}
\end{scope}
\end{tikzpicture}
}
\subfigure[]{
\begin{tikzpicture}[scale=0.35, thick]
\begin{scope}
\clip (-14, -8) rectangle (14, 14);
\foreach \i in {0, 0.05,..., 1} {
\draw[line width={30 * \i}, opacity={exp(-8 * \i)}, dark]
(-10, 0) .. controls (-10, 15) and (-5, 5) .. (0, 5)
.. controls (5, 5) and (10, 8) .. (10, 0)
.. controls (10, -8) and (5, -3) .. (0, -3)
.. controls (-5, -3) and (-10, -5) .. (-10, 0);
}
\foreach \i in {0, 0.05,..., 1} {
\draw[line width={30 * \i}, opacity={exp(-8 * \i)}, green]
(0, 1.75) ellipse (12 and 9);
}
\end{scope}
\end{tikzpicture}
}
\caption{Intuition about the effects of a particular variational divergence
function can be developed by considering how the typical set of an
approximating distribution (green) interacts with the typical of the target
distribution (red). (a) Some divergence functions favor approximating
distributions that expand into the interior of the target typical set,
resulting in an underestimate of the breadth of the typical set. (b)
Others, however, favor approximating distractions that collapse around
the exterior of the target typical set, resulting in overestimated expectations.
}
\label{fig:variational_approximations}
\end{figure*}