-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path12-variableSeln.Rtex
143 lines (103 loc) · 6.18 KB
/
12-variableSeln.Rtex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
\section{Model selection}
\frame{
\sffamily
\frametitle{Model selection and hypothesis testing}
\begin{itemize}
\item Most of the discussion so far has focused on Monte Carlo methods for point and interval estimation.
\item We move now to discuss MCMC methods in the context of model selection and hypothesis testing.
\item The approach we will focus on turns the model selection problem into an estimation problem for a discrete random variable (the model indicator).
\end{itemize}
}
\frame{
\sffamily
\frametitle{Variable selection}
\begin{itemize}
\item {\bf Example (variable selection in linear models):} Consider a linear model of the form $\bfy = \bfX \bftheta + \bfepsilon$ with $\bfepsilon \sim \normal( \mathbf{0}, \sigma^2 \mathbf{I} )$, $\mbox{dim} \{ \bfy \} = n$, and $\mbox{dim} \{ \bftheta \} = p$. We are interested in selecting a subset of the columns of $\bfX$ that explain the response $\bfy$.
\vspace{1mm}
Hence, since we have a total of $p$ variables, we need to select among $2^p$ models. Each of these models can be represented using a vector $\bfgamma = (\gamma_1, \ldots, \gamma_p)$ where $\gamma_k = 1$ if variable $k$ is in the model, and $\gamma_k = 0$ otherwise.
\vspace{1mm}
To address this problem from a Bayesian perspective we need to compute the probability associated with each model. To do so we need a prior distribution $p(\bftheta, \sigma^2, \bfgamma)$ which we will conveniently breakdown as $p(\bftheta \mid \sigma^2, \bfgamma) p(\sigma^2 \mid \bfgamma) p( \bfgamma)$.
\end{itemize}
}
\frame{
\sffamily
\frametitle{Variable selection}
\begin{itemize}
\item {\bf Example (variable selection in linear models, cont):} For the regression coefficients, we focus on priors of the form
$$
\bftheta \mid \sigma^2 , \bfgamma \sim \normal \left(\bftheta_{\bfgamma} \mid \mathbf{0}, c \sigma^2
\left\{ \bfX_{\bfgamma}^{T}\bfX_{\bfgamma} \right\}^{-1} \right) \prod_{k : \gamma_k = 0} \delta_{0}(\theta_k) ,
$$
$\bfX_{\bfgamma}$ denotes the matrix composed of the columns of $\bfX$ for which the associated $\gamma_k = 1$, and $\delta_0(\cdot)$ denotes the degenerate measure at $0$.
For the variance we have $\sigma^2 \mid \bfgamma \sim \IGam(a,b)$ (which is independent of $\bfgamma)$.
Set $p(\bfgamma) = \frac{\Gamma(1 + \sum_{k=1}^{p} \gamma_i) \Gamma(1 + p - \sum_{k=1}^{p} \gamma_i)}{\Gamma(p + 2)}$, (Beta-Bernoulli prior).
Set $c = n$ (more generally, a prior centered on $n$.)
\end{itemize}
}
\frame{
\sffamily
\frametitle{Variable selection}
\begin{itemize}
\item {\bf Example (variable selection in linear models, cont):} Note that in this analysis we are really interested in the marginal posterior $p(\bfgamma \mid \bfy)$, and we can think of $\bftheta$ and $\sigma^2$ as ``nuisance'' parameters. Integrating them out we have
{\scriptsize
\begin{multline*}
p(\bfy \mid \bfgamma) = \left(\frac{1}{2\pi}\right)^{\frac{n}{2}} \left|
\mathbf{I} + c\bfX_{\bfgamma} \left(\bfX_{\bfgamma}^{T} \bfX_{\bfgamma}\right)^{-1} \bfX_{\bfgamma}^T \right|^{-\frac{1}{2}} \\
%
\frac{b^a}{\Gamma(a)}
\frac{\Gamma \left( a + \frac{n}{2} \right)}
{\left( b + \frac{\bfy^T \left\{ \mathbf{I} + c\bfX_{\bfgamma} \left(\bfX_{\bfgamma}^{T} \bfX_{\bfgamma}\right)^{-1} \bfX_{\bfgamma}^T \right\}^{-1}\bfy}{2} \right)^{a + \frac{n}{2}}} ,
\end{multline*}
}
and the posterior distribution is simply
$$
p(\bfgamma \mid \bfy) = \frac{p(\bfy \mid \bfgamma)p(\bfgamma)}{\sum_{\bfgamma'}p(\bfy \mid \bfgamma')p(\bfgamma')} .
$$
\end{itemize}
}
\frame{
\sffamily
\frametitle{Variable selection}
\begin{itemize}
\item {\bf Example (variable selection in linear models, cont):} Because $\bfgamma$ can only take a finite number of values, getting these posterior probabilities should be straightforward:
\begin{itemize}
\item This is true when $p$ is relatively small (say $p \le 20$). In that case the number of models is manageable (with $p=20$ we have 1,048,576 models), and we can actually compute (and store) the probabilities for each model with the resources available in regular computers.
\item However, when $p$ is large we cannot do that anymore (roughly, every 10 extra variables you add multiplies the number of models by 1,000, so you can easily run out of atoms in the universe in a typical problem in genetics ...)
\end{itemize}
\vspace{1mm}
Hence, even in discrete problems with an ``analytical'' solutions you might want to use simulations to make the problem more manageable.
\end{itemize}
}
\frame{
\sffamily
\frametitle{Variable selection: A Metropolis-Hastings algorithm}
\begin{itemize}
\item {\bf Example (variable selection in linear models, cont): } A couple of options for constructing a Metropolis-Hasting algorithm:
\begin{itemize}
\item A simple random walk proposal can be constructed by choosing one entry $\gamma_k$ of $\bfgamma$ uniformly at random and generating a new model by setting
$$
\bfvartheta = (\gamma_1, \ldots, \gamma_{k-1}, 1 - \gamma_k, \gamma_{k+1}, \ldots, \gamma_{p})
$$
\item An independent proposal could be constructed by generating a new model by randomly sampling from a uniform distribution over the entire model, i.e., by letting $\vartheta_k \sim \Ber(1/2)$ independently for each $k$.
\end{itemize}
The acceptance probability in both cases is
$$
\rho(\bfvartheta, \bfgamma) = \min \left\{ 1, \frac{p(\bfvartheta \mid \bfy)}{p(\bfgamma \mid \bfy)} \right\} .
$$
\end{itemize}
}
\frame{
\sffamily
\frametitle{Variable selection}
\begin{itemize}
\item {\bf Example (variable selection in a GLMM):} \cite{gelman2007data} analyzes a dataset of support provided by unmarried fathers to their children's mothers. The model features are:
\begin{itemize}
\item observations of binary outcomes for each mother/father pair of informal support to the mother
\item observations grouped within 20 cities (with random effects to account for correlation)
\item individual-level predictors
\item city-level predictors including the intensity of child support enforcement by the city
\end{itemize}
\textbf{Scientific question:} does enforcement by the city impact whether support is provided?
\textbf{Model:} probit regression with variable selection.
\end{itemize}
}