forked from fri-datascience/course_pou
-
Notifications
You must be signed in to change notification settings - Fork 0
/
17-Bayesian_inference.Rmd
258 lines (210 loc) · 9.9 KB
/
17-Bayesian_inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# Bayesian inference {#bi}
This chapter deals with Bayesian inference.
The students are expected to acquire the following knowledge:
- How to set prior distribution.
- Compute posterior distribution.
- Compute posterior predictive distribution.
- Use sampling for inference.
<style>
.fold-btn {
float: right;
margin: 5px 5px 0 0;
}
.fold {
border: 1px solid black;
min-height: 40px;
}
</style>
<script type="text/javascript">
$(document).ready(function() {
$folds = $(".fold");
$folds.wrapInner("<div class=\"fold-blck\">"); // wrap a div container around content
$folds.prepend("<button class=\"fold-btn\">Unfold</button>"); // add a button
$(".fold-blck").toggle(); // fold all blocks
$(".fold-btn").on("click", function() { // add onClick event
$(this).text($(this).text() === "Fold" ? "Unfold" : "Fold"); // if the text equals "Fold", change it to "Unfold"or else to "Fold"
$(this).next(".fold-blck").toggle("linear"); // "swing" is the default easing function. This can be further customized in its speed or the overall animation itself.
})
});
</script>
```{r, echo = FALSE, warning = FALSE, message = FALSE}
togs <- T
library(ggplot2)
library(dplyr)
library(reshape2)
library(tidyr)
library(MASS)
library(ggforce)
library(mixtools)
library(ade4)
library(ggfortify)
library(numDeriv)
# togs <- FALSE
```
## Conjugate priors
```{exercise, name = "Poisson-gamma model"}
Let us assume a Poisson likelihood and a gamma prior on the Poisson mean parameter (this is a [conjugate prior](https://en.wikipedia.org/wiki/Conjugate_prior)).
a. Derive posterior
b. Below we have some data, which represents number of goals in a football match. Choose sensible prior for this data (draw the gamma density if necessary), justify it. Compute the posterior. <span style="color:blue">Compute an interval such that the probability that the true mean is in there is 95%. What is the probability that the true mean is greater than 2.5?</span>
c. Back to theory: Compute prior predictive and posterior predictive. Discuss why the posterior predictive is overdispersed and not Poisson?
d. <span style="color:blue">Draw a histogram of the prior predictive and posterior predictive for the data from (b). Discuss.</span>
e. <span style="color:blue">Generate 10 and 100 random samples from a Poisson distribution and compare the posteriors with a flat prior, and a prior concentrated away from the truth.</span>
```
```{r, echo = T, message = FALSE, eval = F, warning=FALSE}
x <- c(3, 2, 1, 1, 5, 4, 0, 0, 4, 3)
```
<div class="fold">
```{solution, echo = togs}
a.
\begin{align*}
p(\lambda | X) &= \frac{p(X | \lambda) p(\lambda)}{\int_0^\infty p(X | \lambda) p(\lambda) d\lambda} \\
&\propto p(X | \lambda) p(\lambda) \\
&= \Big(\prod_{i=1}^n \frac{1}{x_i!} \lambda^{x_i} e^{-\lambda}\Big) \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha - 1} e^{-\beta \lambda} \\
&\propto \lambda^{\sum_{i=1}^n x_i + \alpha - 1} e^{- \lambda (n + \beta)} \\
\end{align*}
We recognize this as the shape of a gamma distribution, therefore
\begin{align*}
\lambda | X \sim \text{gamma}(\alpha + \sum_{i=1}^n x_i, \beta + n)
\end{align*}
c. For the prior predictive, we have
\begin{align*}
p(x^*) &= \int_0^\infty p(x^*, \lambda) d\lambda \\
&= \int_0^\infty p(x^* | \lambda) p(\lambda) d\lambda \\
&= \int_0^\infty \frac{1}{x^*!} \lambda^{x^*} e^{-\lambda} \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha - 1} e^{-\beta \lambda} d\lambda \\
&= \frac{\beta^\alpha}{\Gamma(x^* + 1)\Gamma(\alpha)} \int_0^\infty \lambda^{x^* + \alpha - 1} e^{-\lambda (1 + \beta)} d\lambda \\
&= \frac{\beta^\alpha}{\Gamma(x^* + 1)\Gamma(\alpha)} \frac{\Gamma(x^* + \alpha)}{(1 + \beta)^{x^* + \alpha}} \int_0^\infty \frac{(1 + \beta)^{x^* + \alpha}}{\Gamma(x^* + \alpha)} \lambda^{x^* + \alpha - 1} e^{-\lambda (1 + \beta)} d\lambda \\
&= \frac{\beta^\alpha}{\Gamma(x^* + 1)\Gamma(\alpha)} \frac{\Gamma(x^* + \alpha)}{(1 + \beta)^{x^* + \alpha}} \\
&= \frac{\Gamma(x^* + \alpha)}{\Gamma(x^* + 1)\Gamma(\alpha)} (\frac{\beta}{1 + \beta})^\alpha (\frac{1}{1 + \beta})^{x^*},
\end{align*}
which we recognize as the negative binomial distribution with $r = \alpha$ and $p = \frac{1}{\beta + 1}$. For the posterior predictive, the calculation is the same, only now the parameters are $r = \alpha + \sum_{i=1}^n x_i$ and $p = \frac{1}{\beta + n + 1}$.
There are two sources of uncertainty in the predictive distribution. First is the uncertainty about the population. Second is the variability in sampling from the population. When $n$ is large, the latter is going to be very small. But when $n$ is small, the latter is going to be higher, resulting in an overdispersed predictive distribution.
```
```{r, echo = togs, message = FALSE, eval = togs, warning=FALSE}
x <- c(3, 2, 1, 1, 5, 4, 0, 0, 4, 3)
# b
# quick visual check of the prior
ggplot(data = data.frame(x = seq(0, 5, by = 0.01)), aes(x = x)) +
stat_function(fun = dgamma, args = list(shape = 1, rate = 1))
palpha <- 1
pbeta <- 1
alpha_post <- palpha + sum(x)
beta_post <- pbeta + length(x)
ggplot(data = data.frame(x = seq(0, 5, by = 0.01)), aes(x = x)) +
stat_function(fun = dgamma, args = list(shape = alpha_post, rate = beta_post))
# probability of being higher than 2.5
1 - pgamma(2.5, alpha_post, beta_post)
# interval
qgamma(c(0.025, 0.975), alpha_post, beta_post)
# d
prior_pred <- rnbinom(1000, size = palpha, prob = 1 - 1 / (pbeta + 1))
post_pred <- rnbinom(1000, size = palpha + sum(x), prob = 1 - 1 / (pbeta + 10 + 1))
df <- data.frame(prior = prior_pred, posterior = post_pred)
df <- gather(df)
ggplot(df, aes(x = value, fill = key)) +
geom_histogram(position = "dodge")
# e
set.seed(1)
x1 <- rpois(10, 2.5)
x2 <- rpois(100, 2.5)
alpha_flat <- 1
beta_flat <- 0.1
alpha_conc <- 50
beta_conc <- 10
n <- 10000
df_flat <- data.frame(x1 = rgamma(n, alpha_flat + sum(x1), beta_flat + 10),
x2 = rgamma(n, alpha_flat + sum(x2), beta_flat + 100),
type = "flat")
df_flat <- tidyr::gather(df_flat, key = "key", value = "value", - type)
df_conc <- data.frame(x1 = rgamma(n, alpha_conc + sum(x1), beta_conc + 10),
x2 = rgamma(n, alpha_conc + sum(x2), beta_conc + 100),
type = "conc")
df_conc <- tidyr::gather(df_conc, key = "key", value = "value", - type)
df <- rbind(df_flat, df_conc)
ggplot(data = df, aes(x = value, color = type)) +
facet_wrap(~ key) +
geom_density()
```
</div>
## Posterior sampling
```{exercise, name = "Bayesian logistic regression"}
In Chapter 15 we implemented a MLE for logistic regression (see the code below). For this model, conjugate priors do not exist, which complicates the calculation of the posterior. However, we can use sampling from the numerator of the posterior, using rejection sampling.
a. <span style="color:blue">Set a sensible prior distribution on $\beta$ and use rejection sampling to find the posterior distribution.</span>
b. <span style="color:blue">In a) you will get a distribution of parameter $\beta$. Plot the probabilities (as in exercise \@ref(exr:logisticmle)) for each sample of $\beta$ and compare to the truth.</span>
Hint: We can use rejection sampling even for functions which are not PDFs -- they do not have to sum/integrate to 1. We just need to use a suitable envelope that we know how to sample from. For example, here we could use a uniform distribution and scale it suitably.
```
```{r, echo = T, message = FALSE, eval = F, warning=FALSE}
set.seed(1)
inv_log <- function (z) {
return (1 / (1 + exp(-z)))
}
x <- rnorm(100)
y <- x
y <- rbinom(100, size = 1, prob = inv_log(1.2 * x))
l_logistic <- function (beta, X, y) {
logl <- -sum(y * log(inv_log(as.vector(beta %*% X))) + (1 - y) * log((1 - inv_log(as.vector(beta %*% X)))))
return(logl)
}
my_optim <- optim(par = 0.5, fn = l_logistic, method = "L-BFGS-B",
lower = 0, upper = 10, X = x, y = y)
my_optim$par
```
<div class="fold">
```{r, echo = togs, message = FALSE, eval = togs, warning=FALSE}
# Let's say we believe that the mean of beta is 0.5. Since we are not very sure
# about this, we will give it a relatively high variance. So a normal prior with
# mean 0.5 and standard deviation 5. But there is no right solution to this,
# this is basically us expressing our prior belief in the parameter values.
set.seed(1)
inv_log <- function (z) {
return (1 / (1 + exp(-z)))
}
x <- rnorm(100)
y <- x
y <- rbinom(100, size = 1, prob = inv_log(1.2 * x))
l_logistic <- function (beta, X, y) {
logl <- -sum(y * log(inv_log(as.vector(beta %*% X))) + (1 - y) * log((1 - inv_log(as.vector(beta %*% X)))))
if (is.nan(logl)) logl <- Inf
return(logl)
}
my_optim <- optim(par = 0.5, fn = l_logistic, method = "L-BFGS-B",
lower = 0, upper = 10, X = x, y = y)
my_optim$par
f_logistic <- function (beta, X, y) {
logl <- prod(inv_log(as.vector(beta %*% X))^y * (1 - inv_log(as.vector(beta %*% X)))^(1 - y))
return(logl)
}
a <- seq(0, 3, by = 0.01)
my_l <- c()
for (i in a) {
my_l <- c(my_l, f_logistic(i, x, y) * dnorm(i, 0.5, 5))
}
plot(my_l)
envlp <- 10^(-25.8) * dunif(a, -5, 5) # found by trial and error
tmp <- data.frame(envel = envlp, l = my_l, t = a)
tmp <- gather(tmp, key = "key", value = "value", - t)
ggplot(tmp, aes(x = t, y = value, color = key)) +
geom_line() # envelope OK
set.seed(1)
nsamps <- 1000
samps <- c()
for (i in 1:nsamps) {
tmp <- runif(1, -5, 5)
u <- runif(1, 0, 1)
if (u < (f_logistic(tmp, x, y) * dnorm(tmp, 0.5, 5)) /
(10^(-25.8) * dunif(tmp, -5, 5))) {
samps <- c(samps, tmp)
}
}
plot(density(samps))
mean(samps)
median(samps)
truth_p <- data.frame(x = x, prob = inv_log(1.2 * x), type = "truth")
preds <- inv_log(x %*% t(samps))
preds <- gather(cbind(as.data.frame(preds), x = x), key = "key",
"value" = value, - x)
ggplot(preds, aes(x = x, y = value)) +
geom_line(aes(group = key), color = "gray", alpha = 0.7) +
geom_point(data = truth_p, aes(y = prob), color = "red", alpha = 0.7) +
theme_bw()
```
</div>