-
Notifications
You must be signed in to change notification settings - Fork 20
/
0-preface.tex
264 lines (218 loc) · 13.7 KB
/
0-preface.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
\chapter*{Preface}
\addcontentsline{toc}{chapter}{Preface}
Naturally curious as a species, we all have a strong intuition
for how we can better understand the world around us. We
first make observations and then try to gleam insight from
those observations. Formalizing this intuition into a robust
methodology for inference, however, is a delicate process.
Consider, for example, a ball flying through the air, slowly
falling under the force of gravity. After diligently recording the
positions of the ball over its trajectory we want to identify a
physical model that quantitatively describes its motion. Perfect
position measurements would strongly constraint the possible
models, allowing us to exactly recover not only the trajectory
taken by the ball but also any latent model parameters, such as
the acceleration due to gravity, $g$ (Figure \ref{fig:motivating_example}a).
Of course we cannot make perfect measurements in practice.
The observations we can make are limited by an inherent variability
that obscures the underlying trajectory; multiple trajectories, and
hence multiple physical models, are all consistent with the noisy
measurements (Figure \ref{fig:motivating_example}b). In order
to learn from these realistic measurements we need not only
models of the ball's motion and models of the variability in the
measurements, but also a means of quantifying the uncertainty
in which of those models are consistent with the observations.
\emph{The only conclusive statement we can make is how uncertain
we are.}
\begin{figure*}
\centering
\subfigure[]{
\begin{tikzpicture}[scale=0.25, thick]
\draw[dashed, rotate=180, color=gray70] (0, -10) parabola (-20, 0);
\fill[color=dark] (2, -0.025 * 2 * 2 + 10) circle (7pt);
\fill[color=light] (2, -0.025 * 2 * 2 + 10) circle (5pt);
\fill[color=dark] (5, -0.025 * 5 * 5 + 10) circle (7pt);
\fill[color=light] (5, -0.025 * 5 * 5 + 10) circle (5pt);
\fill[color=dark] (7, -0.025 * 7 * 7 + 10) circle (7pt);
\fill[color=light] (7, -0.025 * 7 * 7 + 10) circle (5pt);
\fill[color=dark] (10, -0.025 * 10 * 10 + 10) circle (7pt);
\fill[color=light] (10, -0.025 * 10 * 10 + 10) circle (5pt);
\fill[color=dark] (13, -0.025 * 13 * 13 + 10) circle (7pt);
\fill[color=light] (13, -0.025 * 13 * 13 + 10) circle (5pt);
\fill[color=dark] (14, -0.025 * 14 * 14 + 10) circle (7pt);
\fill[color=light] (14, -0.025 * 14 * 14 + 10) circle (5pt);
\fill[color=dark] (17, -0.025 * 17 * 17 + 10) circle (7pt);
\fill[color=light] (17, -0.025 * 17 * 17 + 10) circle (5pt);
\fill[color=dark] (18, -0.025 * 18 * 18 + 10) circle (7pt);
\fill[color=light] (18, -0.025 * 18 * 18 + 10) circle (5pt);
\fill[color=dark] (19, -0.025 * 19 * 19 + 10) circle (7pt);
\fill[color=light] (19, -0.025 * 19 * 19 + 10) circle (5pt);
\draw[->] (22, 13) node[above] {$g$} -- (22, 10);
\draw[->] (0, 0) -- (25,0) node[right] {$x$};
\draw[->] (0, 0) -- (0,15) node[above] {$y$};
\end{tikzpicture}
}
\subfigure[]{
\begin{tikzpicture}[scale=0.25, thick]
\draw[dashed, rotate=180, color=gray70] (0, -9) parabola (-19, 0);
\draw[dashed, rotate=180, color=gray70] (0, -11) parabola (-20, 0);
\draw[dashed, rotate=180, color=gray70] (0, -12) parabola (-20.25, 0);
\draw[dashed, rotate=180, color=gray70] (0, -10) parabola (-21, 0);
\fill[color=dark] (2, -0.025 * 2 * 2 + 10 -1.123067470) circle (7pt);
\fill[color=light] (2, -0.025 * 2 * 2 + 10 -1.123067470) circle (5pt);
\fill[color=dark] (5, -0.025 * 5 * 5 + 10 -0.330845947) circle (7pt);
\fill[color=light] (5, -0.025 * 5 * 5 + 10 -0.330845947) circle (5pt);
\fill[color=dark] (7, -0.025 * 7 * 7 + 10 -0.642779020) circle (7pt);
\fill[color=light] (7, -0.025 * 7 * 7 + 10 -0.642779020) circle (5pt);
\fill[color=dark] (10, -0.025 * 10 * 10 + 10 + 2.174536596) circle (7pt);
\fill[color=light] (10, -0.025 * 10 * 10 + 10 + 2.174536596) circle (5pt);
\fill[color=dark] (13, -0.025 * 13 * 13 + 10 -1.043929042) circle (7pt);
\fill[color=light] (13, -0.025 * 13 * 13 + 10 -1.043929042) circle (5pt);
\fill[color=dark] (14, -0.025 * 14 * 14 + 10 -0.526778959) circle (7pt);
\fill[color=light] (14, -0.025 * 14 * 14 + 10 -0.526778959) circle (5pt);
\fill[color=dark] (17, -0.025 * 17 * 17 + 10 + 1.290826126) circle (7pt);
\fill[color=light] (17, -0.025 * 17 * 17 + 10 + 1.290826126) circle (5pt);
\fill[color=dark] (18, -0.025 * 18 * 18 + 10 + 0.008352532) circle (7pt);
\fill[color=light] (18, -0.025 * 18 * 18 + 10 + 0.008352532) circle (5pt);
\fill[color=dark] (19, -0.025 * 19 * 19 + 10 -0.508723224) circle (7pt);
\fill[color=light] (19, -0.025 * 19 * 19 + 10 -0.508723224) circle (5pt);
\draw[->] (22, 13) node[above] {$g?$} -- (22, 10);
\draw[->] (0, 0) -- (25,0) node[right] {$x$};
\draw[->] (0, 0) -- (0,15) node[above] {$y$};
\end{tikzpicture}
}
\caption{(a) Perfect measurements of a ball falling under the influence
of gravity would strongly constrain any physical model of that motion,
including latent parameters such as the the strength of gravity, $g$.
(b) In practice, however, measurements are inherently variable, which
limits our ability to infer the exact trajectory and hence any model of how
the trajectory itself. Here and in all measurements, uncertainty is intrinsic
to learning.}
\label{fig:motivating_example}
\end{figure*}
This simple example demonstrates a intrinsic principle that underlies
science, industry, medicine, and any other field that attempts to learn from
observations: uncertainty is inherent to learning and decision making.
In particular, if we want to develop any formal methodology for inference
and decision making then we first need a formal procedure for quantifying
and manipulating uncertainty itself. \emph{Bayesian inference} uses
\emph{probability theory} to quantify all forms of uncertainty, including not
only the intrinsic variability of measurements but also ignorance in the
learning process itself. This unified perspective provides an elegant and
powerful approach for first making inferences and then making robust
decisions.
\section{Our Objectives}
While reading this review we want the reader to learn how probability
theory can be used to quantify uncertainty both in theory and in practice,
ultimately motivating Bayesian inference itself. Unfortunately, this seemingly
straightforward objective ends up being quickly complicated by the subtle
and often counterintuitive nature of probability theory itself.
Many introductory treatments of Bayesian inference ignore these subtleties
entirely, oversimplifying the subject and neglecting many of its finer technical
aspects. Unfortunately, these technicalities are not irrelevant -- indeed they
often have a strong influence on practical applications of the theory. Without
at least a conceptual understanding of these technicalities, the readier is then
subject to dangerous fallacies and, ultimately, fragile analyses.
On the other hand, the more formal treatments of probability theory that do
address these subtleties are typically mired in abstract, mathematical pedantry
with little or not discussion of how the theory applies to inference. A few
treatments will introduce the application of these ideas to frequentist inference,
but rarely is there any discussion of the more general Bayesian perspective.
In this review we attempt an intermediate treatment where we provide a deeper
introduction to probability theory and Bayesian inference than most introductory
references, but focus entirely on concepts rather than proofs. We hope that
this will give Stan users the foundational understanding they need to properly
wield Bayesian inference in practice and take full advantage of their measurements.
\section{Our Strategy}
Probability theory, like much of higher mathematics, is challenging because
of its abstraction; the theory is defined in complete generality with only abstract
objects and manipulations. While much can be proved about the behavior
of these abstractions, there is no general way to explicitly specify these objects
and implement their manipulations. Without explicit examples it then becomes
difficult to develop accurate intuition and understanding.
To apply probability theory in practice we need to first define a \emph{representation}
that maps these abstractions into an explicit context where we can compute.
For example, we can map into a discrete space where manipulations reduce
to counting or a real space where manipulations reduce to integration. The
challenge in using these representations is to ensure that the calculations depend
only on the properties of the original abstract system and not on any irrelevant details
of the representation itself. Understanding exactly what details are relevant or
not, however, requires at least some comprehension of the abstract theory.
Many introductions to probability and statistics simply jump into one of these
representations immediately, ignoring the abstractions, and the subtitles as to
which operations are valid, altogether. This approach inevitably leads to
confused frustration. \emph{What do you mean a probability \emph{density} isn't
a probability? What do you mean most observations won't be near the mode
of the probability density?}. Unfortunately, this confusion manifests not just in
frustration but also incorrect statistical analyses that can have serious consequences.
In this introduction we will not ignore the abstractions of probability theory.
As with more formal treatments we will instead being by discussion the abstract
theory and only then discuss representations and the explicit computations that
they allow. Unlike those treatments, however, our presentation will be largely
conceptual, meanings lots of definitions of abstract objects and their manipulations
but little concern for any technical details that do not end up affecting the
application of the theory in practice. As we develop representations we'll discuss
more and more specific examples to develop as much intuitive context as can while
also emphasizing the limitations of these representations.
We'll first introduce logic as a means of quantifying information with certainty,
and then probability theory as a means of quantifying uncertainty about that
information. In both cases we'll first consider abstract definitions and then the
explicit representations of these concepts needed in practice. Next we'll discuss
how to implement probabilistic computations and discuss many popular computational
methods. Finally we'll show how all of these ideas come together in Bayesian
inference.
This will not be an easy journey. It requires a significant investment from the
reader, but that investment will be rewarded with a robust understanding of
statistics that will enable some amazing science.
\section{Mathematical Background and Notation}
A thorough review of probability theory and its application requires a nontrivial
mathematical background. We have attempted to make this review as
self-contained as possible regarding probability theory itself, but we do have to
assume that the reader is comfortable with the basics of set theory and differential
and integral calculus over the real numbers. We highly encourage anyone whose
math might be rusty to brush up before proceeding.
Throughout we will use common set theory notation. If $A$ is a set
then any element of the set is written as $a \in A$ while a subset is written
as $S \subset A$. Sets are also sometimes denoted by their elements, for
example $A = \left\{ a_{1}, \ldots, a_{N} \right\}$. The \emph{set builder
notation} is similarly used to denote subsets as
$S = \left\{ a \in A \mid \cdot \right\}$, where $\cdot$ is the condition identifying
which elements of $A$ are in the subset $S \subset A$. For example, we
can define the positive real numbers as
%
\begin{equation*}
\RR^{+} = \left\{ x \in \RR \mid x \ge 0 \right\}.
\end{equation*}
%
The \emph{union} of two sets, $A \cup B$ is the combination of all
elements in either set while the \emph{intersection} of two sets,
$A \cap B$ contains only those elements that appear in both sets.
If $S \subset A$ is a subset then its \emph{complement}, $S^{c}$ is
the collection of all elements of $A$ not in $S$.
\emph{Spaces} are sets endowed with a structure called a
\emph{topology} that allows us to separate ``well-behaved subsets''
from ``pathological subsets''. We will assume that all of our sets
have such a structure and consequently for all intents and purposes
set and space will be used interchangeably. Topologies are also
useful for characterizing spaces. For example, the familiar properties
of discrete spaces and continuous spaces, such as the real numbers,
are defined are defined by their respective topologies.
Throughout we will use the common notation for maps from one set into
another, $f : A \rightarrow B$ which defines $f$ as a map taking elements
of the set $A$ to elements of the set $B$. In other words, $f(a) \in B$ for
any $a \in A$. Sometimes we will be more explicit regarding the action
on a given point and write
%
\begin{align*}
f &: A \rightarrow B \\
&\quad a \mapsto f \! \left( a \right).
\end{align*}
At times we will be less precise. For example, when discussing computation
we will liberally use $\approx$ to define when two objects are approximately
equal, or when we \emph{assume} that they are approximately equal, without
making any effort to formally define what ``approximately equal'' means.
Similarly, we will make no attempt at the full mathematical rigor necessary for
a complete understanding of the intricacies of probability theory, and instead
focus on developing a high-level, conceptual intuition. In particular, in many
places we will appeal to vague notions like ``well-behaved'', as their technical
definitions do not offer much pedagogical benefit.