From ce77ed62176c0cd48f7f8445131dce1dee10cdee Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Facundo=20Mu=C3=B1oz?= Date: Sun, 23 Jun 2024 12:16:04 +0200 Subject: [PATCH] Remove old version copies of vignettes. --- vignettes/p10.Rmd | 291 --------- vignettes/p8.Rmd | 1451 --------------------------------------------- vignettes/p9.Rmd | 464 --------------- 3 files changed, 2206 deletions(-) delete mode 100644 vignettes/p10.Rmd delete mode 100644 vignettes/p8.Rmd delete mode 100644 vignettes/p9.Rmd diff --git a/vignettes/p10.Rmd b/vignettes/p10.Rmd deleted file mode 100644 index 0b9cdec..0000000 --- a/vignettes/p10.Rmd +++ /dev/null @@ -1,291 +0,0 @@ ---- -title: 'Practical 7: Bayesian Hierarchical Modelling' -author: "VIBASS" -output: - html_vignette: - fig_caption: yes - number_sections: yes - toc: yes - fig_width: 6 - fig_height: 4 -vignette: > - %\VignetteIndexEntry{Practical 7: Bayesian Hierarchical Modelling} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -# Bayesian Hierarchical Modelling - -In this last practical we will consider the analysis of Bayesian hierarchical -models. As explained in the previous lecture, hierarchical models provide a -convenient tool to define models so that the different sources of variation in -the data are clearly identified. Bayesian inference for highly structured -hierarchical models can be difficult and may require the use of Markov chain -Monte Carlo methods. However, packages such as `BayesX` and `INLA`, to mention -just two, provide a very convenient way to fit and make inference about certain -types of Bayesian hierarchical models. - -Regarding software, we will use `INLA` to fit the models in the examples from -this point on. The reason is that it is a popular software for Bayesian -inference and very fast. - -# Linear Mixed Models - -Linear mixed models were defined in the lecture as follows: - -$$ -Y_{ij} = X_{ij}\beta +\phi_i+\epsilon_{ij} -$$ - -Here, Y_{ij} represents observation $j$ in group $i$, X_{ij} are a vector of -covariates with coefficients $\beta$, $\phi_i$ i.i.d. random effects and -$\epsilon_{ij}$ a Gaussian error term. The distribution of the random effects -$\phi_i$ is Gaussian with zero mean and precision $\tau_{\phi}$. - - -# Multilevel Modelling - -Multilevel models are a particular type of mixed-effects models in which -observations are nested within groups, so that group effects are modelled using -random effects. A typical example is that of students nested within classes. - -For the next example, the `nlschools` data set (in package `MASS`) will be used. -This data set records data about students' performance (in particular, about a -language score test) and other variables. The variables in this data set are: - -* `lang`, language score test. - -* `IQ`, verbal IQ. - -* `class`, class ID. - -* `GS`, class size as number of eighth-grade pupils recorded in the class. - -* `SES`, social-economic status of pupil’s family. - -* `COMB`, whether the pupils are taught in the multi-grade class with 7th-grade students. - -The data set can be loaded and summarised as follows: - -```{r} -library("MASS") -data("nlschools") -summary(nlschools) -``` - -The model to fit will take `lang` as the response variable and include -`IQ`, `GS`, `SES` and `COMB` as covariates (i.e., fixed effects). This model -can easily be fit with `INLA` as follows: - -```{r message = FALSE, warning = FALSE} -library("INLA") -m1 <- inla(lang ~ IQ + GS + SES + COMB, data = nlschools) - -summary(m1) -``` - -Note that the previous model only includes fixed effects. The data set includes -`class` as the class ID to which each student belongs. Class effects can have an -impact on the performance of the students, with students in the same class -performing similarly in the language test. - -Very conveniently, `INLA` can include random effects in the model by adding a -term in the right hand side of the formula that defined the model. Specifically, -the term to add is `f(class, model = "iid")` (see code below for the -full model). -This will create a random effect indexed over variable `class` and which is of -type `iid`, i.e., the random effects are independent and identically distributed -using a normal distribution with zero mean and precision $\tau$. - -Before fitting the model, the between-class variability can be explored by -means of boxplots: - -```{r fig = TRUE, fig.width = 15, fig.height = 5} -boxplot(lang ~ class, data = nlschools, las = 2) -``` - -The code to fit the model with random effects is: - -```{r} -m2 <- inla( - lang ~ IQ + GS + SES + COMB + f(class, model = "iid"), - data = nlschools -) - -summary(m2) -``` - - -# Generalised Linear Mixed Models - -Mixed effects models can also be considered within the context of generalised -linear models. In this case, the linear predictor of observation $i$, $\eta_i$, -can be defined as - -$$ -\eta_i = X_{ij}\beta +\phi_i -$$ - -Compared to the previous setting of linear mixed effects models, note that now -the distribution of the response could be other than Gaussian and that -observations are not necessarily nested within groups. - -# Poisson regression - - -In this practical we will use the North Carolina Sudden Infant Death Syndrome -(SIDS) data set. It is available in the `spData` package and it can be loaded -using: - -```{r message = FALSE, warning = FALSE} -library(spData) -data(nc.sids) -summary(nc.sids) -``` - -A full description of the data set is provided in the associated manual page -(check with `?nc.sids`) but in this practical we will only consider these -variables: - -* `BIR74`, number of births (1974-78). - -* `SID74`, number of SID deaths (1974-78). - -* `NWBIR74`, number of non-white births (1974-78). - -These variables are measured at the county level in North Carolina, of which -there are 100. - -Because `SID74` records the number of SID deaths, the model is Poisson: - -$$ -O_i \mid \mu_i \sim Po(\mu_i),\ i=1,\ldots, 100 -$$ -Here, $O_i$ represents the number of cases in county $i$ and $\mu_i$ the mean. -In addition, mean $\mu_i$ will be written as $\mu_i = E_i \theta_i$, where $E_i$ -is the *expected* number of cases and $\theta_i$ the relative risk in county $i$. - -The relative risk $\theta_i$ is often modelled, on the log-scale, to be equal -to a linear predictor: - -$$ -\log(\theta_i) = \beta_0 + \ldots -$$ - -The expected number of cases is computed by multiplying the number of births in -county $i$ to the overall mortality rate - -$$ -r = \frac{\sum_{i=1}^{100}O_i}{\sum_{i=1}^{100}B_i} -$$ -where $B_i$ represents the total number of births in country $i$. Hence, -the expected number of cases in county $i$ is $E_i = r B_i$. - - -```{r} -# Overall mortality rate -r74 <- sum(nc.sids$SID74) / sum(nc.sids$BIR74) -# Expected cases -nc.sids$EXP74 <- r74 * nc.sids$BIR74 -``` - -A common measure of relative risk is the *standardised mortality ratio* -($O_i / E_i$): - -```{r} -nc.sids$SMR74 <- nc.sids$SID74 / nc.sids$EXP74 -``` - -A summary of the SMR can be obtained: - -```{r fig = TRUE} -hist(nc.sids$SMR, xlab = "SMR") -``` - -Values above 1 indicate that the county has more observed deaths than expected -and that there might be an increased risk in the area. - - -As a covariate, we will compute the proportion of non-white births: - -```{r} -nc.sids$NWPROP74 <- nc.sids$NWBIR74/ nc.sids$BIR74 -``` - -There is a clear relationship between the SMR and the proportion of non-white -births in a county: - -```{r fig = TRUE} -plot(nc.sids$NWPROP74, nc.sids$SMR74) - -# Correlation -cor(nc.sids$NWPROP74, nc.sids$SMR74) -``` - - -A simple Poisson regression can be fit by using the following code: - -```{r} -m1nc <- inla( - SID74 ~ 1 + NWPROP74, - family = "poisson", - E = nc.sids$EXP74, - data = nc.sids -) -summary(m1nc) -``` - -Random effects can also be included to account for intrinsic differences between -the counties: - -```{r} -# Index for random effects -nc.sids$ID <- 1:nrow(nc.sids) - -# Model WITH covariate -m2nc <- inla( - SID74 ~ 1 + NWPROP74 + f(ID, model = "iid"), - family = "poisson", - E = nc.sids$EXP74, - data = as.data.frame(nc.sids) -) - -summary(m2nc) -``` - -The role of the covariate can be explored by fitting a model without it: - - -```{r} -# Model WITHOUT covariate -m3nc <- inla( - SID74 ~ 1 + f(ID, model = "iid"), - family = "poisson", - E = nc.sids$EXP74, - data = as.data.frame(nc.sids) -) - -summary(m3nc) -``` - -Now, notice the decrease in the estimate of the precision of the random effects -(i.e., the variance increases). This means that values of the random effects -are now larger than in the previous case as the random effects pick some of the -effect explained by the covariate. - - -```{r fig = TRUE} -par(mfrow = c(1, 2)) -boxplot(m2nc$summary.random$ID$mean, ylim = c(-1, 1), main = "With NWPROP74") -boxplot(m3nc$summary.random$ID$mean, ylim = c(-1, 1), main = "Without NWPROP74") -``` - -# Further Extensions - -Spatial random effects can be defined not to be independent and identically -distributed. Instead, spatial or temporal correlation can be considered when -defining them. For example, in the North Carolina SIDS data set, it is common -to consider that two counties that are neighbours (i.e., share a boundary) -will have similar relative risks. This can be taken into account in the model -but assuming that the random effects are spatially autocorrelated. This is out -of the scope of this introductory course but feel free to ask about this!! diff --git a/vignettes/p8.Rmd b/vignettes/p8.Rmd deleted file mode 100644 index 3ed85cf..0000000 --- a/vignettes/p8.Rmd +++ /dev/null @@ -1,1451 +0,0 @@ ---- -title: 'Practical 5: Numerical approaches' -author: "VIBASS" -output: - html_vignette: - fig_caption: yes - number_sections: yes - toc: yes - fig_width: 6 - fig_height: 4 -vignette: > - %\VignetteIndexEntry{Practical 5: Numerical approaches} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - - -# Introduction - -In previous practicals you have used Bayesian models with conjugate -priors where the posterior distribution can be easily worked out. In -general, this is seldom the case and other approaches need to be -considered. In particular, Importance Sampling and Markov Chain Monte -Carlo (MCMC) methods can be used to draw samples from the -posterior distribution that are in turn used to obtain estimates of the -posterior mean and variance and other quantities of interest. - -# Importance Sampling - -As described in the previous lecture, Importance Sampling (IS) is an -algorithm to estimate some quantities of interest of a target -distribution by sampling from a different (proposal) distribution and -reweighting the samples using importance weights. In Bayesian inference, -IS can be used to sample from the posterior distribution when the -normalizing constant is not known because -$$ -\pi(\theta \mid \mathbf{y}) \propto L(\theta \mid \mathbf{y}) \pi(\theta), -$$ -where $\mathbf{y}$ represents the observed data, -$L(\theta \mid \mathbf{y})$ the likelihood function and -$\pi(\theta)$ the prior distribution on $\theta$. - -If $g(\cdot)$ is a proposal distribution, and $\{\theta^{(m)}\}_{m=1}^M$ -are $M$ samples from that distribution, then the importance weights are -$$ -w(\theta^{(m)}) = \frac{L(\theta^{(m)} \mid \mathbf{y})\,\pi(\theta^{(m)})} - {g(\theta^{(m)})} . -$$ -\noindent -When the normalizing constant in the posterior distribution is not -known, the importance weights are rescaled to sum to one. Note that this -rescaling is done by the denominator in the expression at point 2 on -slide 9/22 of the Numerical Approaches slides you have just seen. In practice, -rescaling removes the need for the denominator and simplifies the -calculations throughout (we do it once, rather than every time). - -Hence, the posterior mean can be computed as -$$ -E\left(\theta \mid \mathbf{y}\right) = \mu = \int \theta\, \pi(\theta \mid \mathbf{y}) \mathop{d\theta} - \simeq \sum_{m=1}^M \theta^{(m)}\, w(\theta^{(m)}) = \hat{\mu}. -$$ -\noindent -Similarly, the posterior variance can be computed as -$$ -\mbox{Var}\left(\theta \mid \mathbf{y}\right) = -\sigma^2 = \int (\theta - \mu)^2\, \pi(\theta \mid \mathbf{y}) \mathop{d\theta} - \simeq \sum_{m=1}^M (\theta^{(m)})^2\, w(\theta^{(m)}) - \hat{\mu}^2 . -$$ - - - -# The Metropolis-Hastings Algorithm - -The Metropolis-Hastings (M-H) algorithm is a popular MCMC method to -obtain samples from the posterior distribution of an ensemble of -parameters. -In the examples below we will only consider models with one parameter, -but the M-H algorithm can be used on models with a large number of -parameters. - -The M-H algorithm works in a very simple way. At every step of the -algorithm a new movement is proposed using a *proposal distribution*. -This movement is accepted with a known probability, which implies that -the movement can be rejected so that the algorithm stays at the same -state in the current iteration. - -Hence, in order to code the M-H algorithm for a set of parameters -$\theta$ we need to define: - -* A function to draw observations from the proposal distribution, given -its current state. This will be denoted by $q(\cdot\mid\cdot)$, so that -the density of a new proposal $\theta^*$ given a current state -$\theta^{(m)}$ is given by $q(\theta^*\mid\theta^{(m)})$. - -From the Bayesian model, we already know: - -* A prior distribution on the parameters of interest, i.e., -$\pi(\theta)$. - -* The likelihood of the parameter $\theta$ given the observed data -$\mathbf{y}$, i.e., $L(\theta\mid\mathbf{y})$. - -At step $m$, a new value is drawn from $q(\cdot\mid\theta^{(m)})$ and it -is accepted with probability: - -$$ -\alpha = \min\left\{1, -\frac{L(\theta^*\mid\mathbf{y})\,\pi(\theta^{*})\,q(\theta^{(m)} -\mid\theta^{*})} -{L(\theta^{(m)}\mid\mathbf{y})\,\pi(\theta^{(m)})\,q(\theta^{*} -\mid\theta^{(m)})}\right\} -$$ - - -If the value is accepted, then the current state is set to the proposed -value, i.e., $\theta^{(m+1)} = \theta^{*}$. -Otherwise we keep the previous value, so -$\theta^{(m+1)} = \theta^{(m)}$. - -# Example: Binomial-Beta Model - -The first example that will be considered is based on the data set -collected on the number of red M&M's in tube $i$ with $n_i$ M&M's. -The number of red M&M's obtained depends on the total number of M&M's -extracted and the actual proportion of red ones $\theta$. -Given the proportion $\theta$, the number of red M&M's obtained -follows a Binomial distribution. Because $\theta$ takes values between -0 and 1, a sensible choice for a prior is the Beta distribution. - -The model can be stated as follows: -$$ -\begin{array}{rcl} -Y_i \mid \theta & \sim & \text{Bi}(n_i,\, \theta)\\ -\theta & \sim & \text{Be}(\alpha,\, \beta) -\end{array} -$$ -to allow for multiple tubes. - -In particular, we will consider a vague uniform prior in the $[0,1]$ interval, which corresponds to $\alpha=\beta=1$. - -You can use the following data set for this exercise: - -```{r eval = TRUE} -mmdata <- data.frame(MMs = c(20, 22, 24), red = c(5, 8, 9)) -``` - -These data reproduce different counts of red M&M's in three different -tubes. Variable `MMs` records the total number of M&M's $n_i$ in tube $i$ -and `red` the actual number of red M&M's ($y_i$ in the model). - -## Importance sampling - -Although the posterior distribution is known in closed form, IS can -be used to estimate the posterior mean and variance. Given that -the parameter $\theta$ is bounded, a uniform distribution in the interval $[0,1]$ will -be used. This is probably not very efficient (as it is likely not -to be close to the actual posterior) but it will provide a straightforward simulation strategy. - -```{r} -n_simulations <- 10000 -set.seed(12) -theta_sim <- runif(n_simulations) -``` - -Next, importance weights are computed in two steps. First, the -ratio between the likelihood times the prior and the density of the -proposal distribution is computed. -Secondly, weights are re-scaled to sum to one. - -```{r} -# Log-Likelihood (for each value of theta_sim) -loglik_binom <- sapply(theta_sim, function(theta) { - sum(dbinom(mmdata$red, mmdata$MMs, theta, log = TRUE)) -}) - -# Log-weights: log-lik + log-prior - log-proposal_distribution -log_ww <- loglik_binom + dbeta(theta_sim, 1, 1, log = TRUE) - log(1) - -# Re-scale weights to sum up to one -log_ww <- log_ww - max(log_ww) -ww <- exp(log_ww) -ww <- ww / sum(ww) - -``` - - -The importance weights can be summarized using a histogram (see below). -The distribution of weights shows that most samples are far from the -regions of high posterior density. - - -```{r fig = TRUE} -hist(ww, xlab = "Importance weights") -``` - -The posterior mean and variance can be computed as follows: - -```{r} -# Posterior mean -post_mean <- sum(theta_sim * ww) -post_mean - -# Posterior variance -post_var <- sum(theta_sim^2 * ww)- post_mean^2 -post_var -``` - - -Finally, an estimate of the posterior density -$\pi(\theta \mid \mathbf{y})$ -of the parameter can be obtained by using *weighted* kernel density -estimation. - -\hline -**Aside: weighted kernel density estimation** Standard kernel -density estimation is a way of producing a non-parametric estimate of -the distribution of a continuous quantity given a sample. A *kernel* -function is selected (typically a Normal density), and one of these is -placed centred on each sample point. The sum of these functions produces -the kernel density estimate (after scaling - dividing by the number of -sample points). A *weighted* kernel density estimate simply includes -weights in the sum of the kernel functions. In both weighted and -unweighted forms of kernel density estimation, the key parameter -controlling the smoothness of the resulting density estimate is the -*bandwidth* (equivalent to the standard deviation if using a Normal -kernel function); larger values give smoother density estimates, and -smaller values give *noisier* densities. -\hline - -```{r fig = TRUE} -plot(density(theta_sim, weights = ww, bw = 0.01), - main = "Posterior density") -``` - -Note that the value of the bandwidth used (argument `bw`) has been -set manually to provide a sensible, smooth curve. - -Similarly, a sample from the posterior distribution can be obtained by -resampling the original values of theta with their corresponding weights. - -```{r} -post_theta_sim <- sample(theta_sim, prob = ww, replace = TRUE) -hist(post_theta_sim, freq = FALSE) -``` - - -## Metropolis-Hastings - -As stated above, the implementation of the M-H algorithm requires a -proposal distribution to obtain new values of the parameter $\theta$. -Usually, the proposal distribution is defined so that the proposed -movement depends on the current value. -However, in this case we will use a uniform -distribution between 0 and 1 as our proposal distribution. - -First of all, we will define the proposal distribution, prior -and likelihood of the model: - -```{r} -# Proposal distribution: sampling -rq <- function() { - runif(1) -} - -# Proposal distribution: log-density -logdq <- function(x) { - dunif(x, log = TRUE) -} - -# Prior distribution: Beta(1, 1), log-density -logprior <- function(theta) { - dbeta(theta, 1, 1, log = TRUE) -} - -# Log-Likelihood -loglik <- function(y, theta, N) { - res <- sum(dbinom(y, N, theta, log = TRUE)) -} -``` - -Note that all densities and the likelihood are computed on the log-scale. - -Next, an implementation of the M-H algorithm is as follows: - -```{r} -# Number of iterations -n.iter <- 40500 - -# Simulations of the parameter -theta <- rep(NA, n.iter) - -# Initial value -theta[1] <- 0.5 - -# Data -y <- mmdata$red -N <- mmdata$MMs - -for(i in 2:n.iter) { - new.theta <- rq() - - # Log-Acceptance probability - logacc.prob <- loglik(y, new.theta, N) + logprior(new.theta) + logdq(theta[i - 1]) - logacc.prob <- logacc.prob - loglik(y, theta[i - 1], N) - logprior(theta[i - 1]) - - logdq(new.theta) - logacc.prob <- min(0, logacc.prob) # Note that 0 = log(1) - - if(log(runif(1)) < logacc.prob) { - # Accept - theta[i] <- new.theta - } else { - # Reject - theta[i] <- theta[i - 1] - } -} -``` - -The simulations we have generated are not independent of one another; each -is dependent on the previous one. This has two consequences: the chain is -dependent on the initial, starting value of the parameter(s); and the sampling -chain itself will exhibit *autocorrelation*. - -For this reason, we will remove the first 500 iterations to reduce the -dependence of the sampling on the starting value; and we will keep only every -10^th^ simulation to reduce the autocorrelation in the sampled series. -The 500 iterations we discard are known as the **burn-in** sample, and -the process of keeping only every 10^th^ value is called **thinning**. - -After that, we will compute summary statistics -and display a density of the simulations. - -```{r} -# Remove burn-in -theta <- theta[-c(1:500)] - -# Thinning -theta <- theta[seq(1, length(theta), by = 10)] - -# Summary statistics -summary(theta) - -par(mfrow = c(1, 2)) -plot(theta, type = "l", main = "MCMC samples", ylab = expression(theta)) -plot(density(theta), main = "Posterior density", xlab = expression(theta)) -``` - -## Exercises - - -### Performance of the proposal distribution - -The proposal distribution plays a crucial role in IS and it should be -as close to the posterior as possible. As a way of measuring how good -a proposal distribution is, it is possible to compute the -*effective* sample size as follows: - -$$ -\text{ESS} = \frac{(\sum_{m=1}^M w(\theta^{(m)}))^2}{\sum_{m=1}^M w(\theta^{(m)})^2}. -$$ - -* Compute the effective sample size for the previous example. -How is this related to the number of IS samples (`n_simulations`)? - -
Solution - - ```{r} - ESS <- function(ww){ - (sum(ww)^2)/sum(ww^2) - } - ESS(ww) - n_simulations - ``` - -
- - -* Use a different proposal distribution and check how sampling weights, -ESS and point estimates differ from those in the current example. For -example, a $\text{Be}(20,\, 10)$ will put more mass on high values of -$\theta$, unlike the actual posterior distribution. -What differences do you find with -the example presented here using a uniform proposal distribution? Why -do you think that these differences appear? - - -
Solution - - ```{r} - n_simulations <- 10000 - set.seed(12) - theta_sim <- rbeta(n_simulations, 20, 10) - loglik_binom <- sapply(theta_sim, function(theta) { - sum(dbinom(mmdata$red, mmdata$MMs, theta, log = TRUE)) - }) - log_ww <- loglik_binom + dbeta(theta_sim, 1, 1, log = TRUE) - dbeta(theta_sim, 20, 10, log = TRUE) - log_ww <- log_ww - max(log_ww) - ww <- exp(log_ww) - ww <- ww / sum(ww) - ``` - - ```{r fig = TRUE} - hist(ww, xlab = "Importance weights") - ``` - - ```{r} - (post_mean <- sum(theta_sim * ww)) - (post_var <- sum(theta_sim^2 * ww)- post_mean^2) - ``` - - ```{r fig = TRUE} - plot(density(theta_sim, weights = ww, bw = 0.025), main = "Posterior density") - ``` - - ```{r} - ESS(ww) - n_simulations - ``` - -
- - - -### Sampling from the actual posterior - -Use the posterior distribution (which for this particular case is known -in a closed form) as the proposal distribution. - -* What is the distribution of the importance weights now? - -
Solution - - ```{r} - n_simulations <- 10000 - set.seed(12) - theta_sim <- rbeta(n_simulations, sum(mmdata$red) + 1, sum(mmdata$MMs) -sum(mmdata$red) + 1) - loglik_binom <- sapply(theta_sim, function(theta) { - sum(dbinom(mmdata$red, mmdata$MMs, theta, log = TRUE)) - }) - log_ww <- loglik_binom + dbeta(theta_sim, 1, 1, log = TRUE) - dbeta(theta_sim, sum(mmdata$red) + 1, sum(mmdata$MMs) -sum(mmdata$red) + 1, log = TRUE) - log_ww <- log_ww - max(log_ww) - ww <- exp(log_ww) - ww <- ww / sum(ww) - ``` - - ```{r fig = TRUE} - hist(ww, xlab = "Importance weights") - ``` - - ```{r} - (post_mean <- sum(theta_sim * ww)) - (post_var <- sum(theta_sim^2 * ww)- post_mean^2) - ``` - - ```{r fig = TRUE} - plot(density(theta_sim, weights = ww, bw = 0.01), main = "Posterior density") - ``` - -
- - -* Compute the effective sample size. How large is it? Why do you think this happens? - -
Solution - - ```{r} - ESS(ww) - n_simulations - ``` - -
- - -### Non-conjugate prior - -IS and M-H are algorithms that can be used to make inference about -$\theta$ when the posterior density of the parameter is not available -in closed form. This is the case with models which have non-conjugate -priors. As an exercise, try to obtain the posterior density of the same -model with the following non-conjugate prior: - -$$ -\pi(\theta) \propto (1-\theta)^2,\quad \theta\in[0,1]. -$$ - -This prior has the following shape: - -```{r} -curve((1-x)^2, xlab = expression(theta), ylab = NA, yaxt = 'n', bty = 'n') -``` - -The interpretation of this prior is that smaller values are favoured over -higher values, i.e., our prior information is that the parameter is more -likely to have values close to 0 than values close to 1. - -Note that the prior is specified up to a normalising constant (which in -this case is not difficult to compute). However, this constant is not -needed to implement both the IS and M-H algorithms. In the case of IS, -the constant will cancel when the weights are re-scaled to sum to one -and in the case of the M-H algorithm the constant will cancel when the -acceptance ratio is computed. - -* Use the IS algorithm to compute and plot the weights using this -non-conjugate distribution, using them to estimate the posterior mean -and variance. Finally, show the posterior density estimate. - -
Solution - - ```{r} - n_simulations <- 10000 - set.seed(12) - theta_sim <- rbeta(n_simulations,20,10) - loglik_binom <- sapply(theta_sim, function(theta) { - sum(dbinom(mmdata$red, mmdata$MMs, theta, log = TRUE)) - }) - log_ww <- loglik_binom + 2*log(1-theta_sim) - dbeta(theta_sim, 20, 10, log = TRUE) - log_ww <- log_ww - max(log_ww) - ww <- exp(log_ww) - ww <- ww / sum(ww) - ``` - - ```{r fig = TRUE} - hist(ww, xlab = "Importance weights") - ``` - - ```{r} - (post_mean <- sum(theta_sim * ww)) - (post_var <- sum(theta_sim^2 * ww)- post_mean^2) - ``` - - ```{r fig = TRUE} - plot(density(theta_sim, weights = ww, bw = 0.025), main = "Posterior density") - ``` - -
- -* Use the M-H algorithm to produce the posterior density estimate. - -
Solution - - ```{r} - # Proposal distribution: sampling - rq <- function() { - runif(1) - } - - # Proposal distribution: log-density - logdq <- function(x) { - dunif(x, log = TRUE) - } - - # Prior distribution: (1-theta)^2, log-density - logprior <- function(theta) { - 2*log(1-theta) - } - - # Log-Likelihood - loglik <- function(y, theta, N) { - res <- sum(dbinom(y, N, theta, log = TRUE)) - } - ``` - - ```{r} - # Number of iterations - n.iter <- 40500 - - # Simulations of the parameter - theta <- rep(NA, n.iter) - - # Initial value - theta[1] <- 0.5 - - # Data - y <- mmdata$red - N <- mmdata$MMs - - for(i in 2:n.iter) { - new.theta <- rq() - - # Log-Acceptance probability - logacc.prob <- loglik(y, new.theta, N) + logprior(new.theta) + logdq(theta[i - 1]) - logacc.prob <- logacc.prob - loglik(y, theta[i - 1], N) - logprior(theta[i - 1]) - - logdq(new.theta) - logacc.prob <- min(0, logacc.prob) # Note that 0 = log(1) - - if(log(runif(1)) < logacc.prob) { - # Accept - theta[i] <- new.theta - } else { - # Reject - theta[i] <- theta[i - 1] - } - } - ``` - - ```{r} - # Remove burn-in - theta <- theta[-c(1:500)] - - # Thinning - theta <- theta[seq(1, length(theta), by = 10)] - - # Summary statistics - summary(theta) - ``` - - ```{r fig = TRUE} - par(mfrow = c(1, 2)) - plot(theta, type = "l", main = "MCMC samples", ylab = expression(theta)) - plot(density(theta), main = "Posterior density", xlab = expression(theta)) - ``` - -
- -# Example: Poisson-Gamma Model - -The second example will be based on the *Game of Thrones* data set. -Remember that this is made of the observed number of u's on -a page of a book of Game of Thrones. The model can be stated as: - -$$ -\begin{array}{rcl} -y_i \mid \lambda & \sim & \text{Po}(\lambda)\\ -\lambda & \sim & \text{Ga}(\alpha,\, \beta) -\end{array} -$$ - -In particular, the prior on $\lambda$ will be a Gamma distribution with -parameters $0.01$ and $0.01$, which is centred at 1 and has a small -precision (i.e., large variance). - -We will denote the observed values by `y` in the `R` code. The data collected can be loaded with: - -```{r eval = TRUE} -GoTdata <- data.frame(Us = c(25, 29, 27, 27, 25, 27, 22, 26, 27, 29, 23, - 28, 25, 24, 22, 25, 23, 29, 23, 28, 21, 29, - 28, 23, 28)) -y <- GoTdata$Us -``` - - -## Importance sampling - -Now the parameter of interest is not bounded, so the proposal -distribution needs to be chosen with care. We will use a log-Normal -distribution with mean 3 and standard deviation equal to 0.5 in the log -scale. This will ensure that all the sampled values are positive (because -$\lambda$ cannot take negative values) and that the sample values are -reasonable (i.e., they are not too small or too large). Note that this -proposal distribution has been chosen having in mind the problem at hand -and that it may not work well with other problems. - - -```{r} -n_simulations <- 10000 -set.seed(1) -lambda_sim <- rlnorm(n_simulations, 3, 0.5) -``` - -Next, importance weights are computed in two steps, as in the previous -example. Note that now the likelihood, prior and proposal distribution -are different from the ones in the Binomial example. - -```{r} -# Log-Likelihood (for each value of lambda_sim) -loglik_pois <- sapply(lambda_sim, function(LAMBDA) { - sum(dpois(GoTdata$Us, LAMBDA, log = TRUE)) -}) - -# Log-weights: log-lik + log-prior - log-proposal_distribution -log_ww <- loglik_pois + dgamma(lambda_sim, 0.01, 0.01, log = TRUE) - dlnorm(lambda_sim, 3, 0.5, log = TRUE) - -# Re-scale weights to sum up to one -log_ww <- log_ww - max(log_ww) -ww <- exp(log_ww) -ww <- ww / sum(ww) - -``` - - -The importance weights can be summarized using a histogram: - - -```{r fig = TRUE} -hist(ww, xlab = "Importance weights") -``` - -The posterior mean and variance can be computed as follows: - -```{r} -# Posterior mean -(post_mean <- sum(lambda_sim * ww)) - -# Posterior variance -(post_var <- sum(lambda_sim^2 * ww)- post_mean^2) -``` - - -Finally, an estimate of the posterior density of the parameter can be -obtained by using weighted kernel density estimation: - - -```{r fig = TRUE} -plot(density(lambda_sim, weights = ww, bw = 0.5) , main = "Posterior density", xlim = c(10,40)) -``` - -Note that the value of the bandwidth used (argument `bw`) has been set -manually to provide a realistically smooth density function. - -Similarly, a sample from the posterior distribution can be obtained by -resampling the original values of $\theta$ with their corresponding -weights. - -```{r} -post_lambda_sim <- sample(lambda_sim, prob = ww, replace = TRUE) -hist(post_lambda_sim, freq = FALSE) -``` - - - - - -## Metropolis-Hastings - -Similarly to the previous example, we will set the proposal -distribution, as the model has been fully defined above. -In this case, the proposal distribution is a log-Normal distribution -centred at the logarithm of the current value with precision $100$. - - -```{r} -# Proposal distribution: sampling -rq <- function(lambda) { - rlnorm(1, meanlog = log(lambda), sdlog = sqrt(1 / 100)) -} - -# Proposal distribution: log-density -logdq <- function(new.lambda, lambda) { - dlnorm(new.lambda, meanlog = log(lambda), sdlog = sqrt(1 / 100), log = TRUE) -} - -# Prior distribution: Ga(0.01, 0.01) -logprior <- function(lambda) { - dgamma(lambda, 0.01, 0.01, log = TRUE) -} - -# LogLikelihood -loglik <- function(y, lambda) { - res <- sum(dpois(y, lambda, log = TRUE)) -} -``` - - -With these definitions we can actually use the same implementation of the -M-H that we have used in the previous section. - -```{r} -# Number of iterations -n.iter <- 40500 - -# Simulations of the parameter -lambda <- rep(NA, n.iter) - -# Initial value -lambda[1] <- 30 - -for(i in 2:n.iter) { - new.lambda <- rq(lambda[i - 1]) - - # Log-Acceptance probability - logacc.prob <- loglik(y, new.lambda) + logprior(new.lambda) + - logdq(lambda[i - 1], new.lambda) - logacc.prob <- logacc.prob - loglik(y, lambda[i - 1]) - logprior(lambda[i - 1]) - - logdq(new.lambda, lambda[i - 1]) - logacc.prob <- min(0, logacc.prob)#0 = log(1) - - if(log(runif(1)) < logacc.prob) { - # Accept - lambda[i] <- new.lambda - } else { - # Reject - lambda[i] <- lambda[i - 1] - } -} -``` - - -The same burn-in and thinning as in the Beta-Binomial example will be -used. Furthermore, summary statistics and plots will be computed now: - - -```{r} -# Remove burn-in -lambda <- lambda[-c(1:500)] - -# Thinning -lambda <- lambda[seq(1, length(lambda), by = 10)] - -# Summary statistics -summary(lambda) - -par(mfrow = c(1, 2)) -plot(lambda, type = "l", main = "MCMC samples", ylab = expression(lambda)) -plot(density(lambda), main = "Posterior density", xlab = expression(lambda)) -``` - - -## Exercises - - -### Performance of the proposal distribution - -Similarly to the exercises in the example on Binomial data, you will -now assess the impact of the proposal distribution on the inference -process. - - -* Compute the effective sample size for the previous example. -How is this related to the number of IS samples (`n_simulations`)? - -
Solution - - ```{r} - ESS(ww) - n_simulations - ``` - -
- - -* Use a different proposal distribution and check how sampling weights, -ESS and point estimates differ from those in the current example. For -example, a $\text{Ga}(5,\, 0.1)$ will put a higher mass on values -around 40, unlike the actual posterior distribution. What differences -do you find with the example presented here using a uniform proposal -distribution? Why do you think that these differences appear? - -
Solution - - ```{r} - n_simulations <- 10000 - set.seed(12) - lambda_sim <- rgamma(n_simulations,5,0.1) - loglik_pois <- sapply(lambda_sim, function(LAMBDA) { - sum(dpois(GoTdata$Us, LAMBDA, log = TRUE)) - }) - log_ww <- loglik_pois + dgamma(lambda_sim, 0.01, 0.01, log = TRUE) - dgamma(lambda_sim, 5, 0.1, log=TRUE) - log_ww <- log_ww - max(log_ww) - ww <- exp(log_ww) - ww <- ww / sum(ww) - ``` - - ```{r fig = TRUE} - hist(ww, xlab = "Importance weights") - ``` - - ```{r} - post_mean <- sum(lambda_sim * ww) - post_mean - post_var <- sum(lambda_sim^2 * ww)- post_mean^2 - post_var - ``` - - ```{r fig = TRUE} - plot(density(lambda_sim, weights = ww, bw = 0.5), main = "Posterior density", xlim = c(10,40)) - ``` - - ```{r} - ESS(ww) - n_simulations - ``` - -
- - -### Sampling from the actual posterior - -Use the posterior distribution (which for this particular case is known -in a closed form) as the proposal distribution. - -* What is the distribution of the importance weights now? - -
Solution - - ```{r} - n_simulations <- 10000 - set.seed(12) - lambda_sim <- rgamma(n_simulations, sum(GoTdata$Us) + 0.01, length(GoTdata$Us) + 0.01) - loglik_pois <- sapply(lambda_sim, function(LAMBDA) { - sum(dpois(GoTdata$Us, LAMBDA, log = TRUE)) - }) - log_ww <- loglik_pois + dgamma(lambda_sim, 0.01, 0.01, log = TRUE) - dgamma(lambda_sim, sum(GoTdata$Us) + 0.01, length(GoTdata$Us) + 0.01, log=TRUE) - log_ww <- log_ww - max(log_ww) - ww <- exp(log_ww) - ww <- ww / sum(ww) - ``` - - ```{r fig = TRUE} - hist(ww, xlab = "Importance weights") - ``` - - ```{r} - (post_mean <- sum(lambda_sim * ww)) - (post_var <- sum(lambda_sim^2 * ww)- post_mean^2) - ``` - - ```{r fig = TRUE} - plot(density(lambda_sim, weights = ww, bw = 0.5) , main = "Posterior density") - ``` - -
- - -* Compute the effective sample size. How large is it? Why do you think this happens? - -
Solution - - ```{r} - ESS(ww) - n_simulations - ``` - -
- - - -### Non-conjugate prior - -Just as in the Binomial example, non-conjugate priors can be used -for the $\lambda$ parameter. In this case, given that $\lambda$ is -positive, care must be taken when choosing the prior. For this exercise, -try to use a log-Normal prior with mean 4 and standard deviation 1. -This will provide a prior of $\lambda$ as seen in the next plot: - -```{r} -curve( - dlnorm(x, 4, 1), - xlim = c(0, 600), n = 1e3, - xlab = expression(lambda), ylab = NA, - yaxt = 'n', bty = 'n' -) -``` - -Hence, with this prior small values of the average number of u's are -given a higher prior density. - -Parameter estimates for a model with this prior can be obtained using IS and M-H algorithms. - -* Fit the model with the non-conjugate prior using the IS algorithm. - -
Solution - - ```{r} - n_simulations <- 10000 - set.seed(1) - lambda_sim <- rlnorm(n_simulations, 3, 0.5) - ``` - - ```{r} - # Log-Likelihood (for each value of lambda_sim) - loglik_pois <- sapply(lambda_sim, function(LAMBDA) { - sum(dpois(GoTdata$Us, LAMBDA, log = TRUE)) - }) - - # Log-weights: log-lik + log-prior - log-proposal_distribution - log_ww <- loglik_pois + dlnorm(lambda_sim, 4, 1, log = TRUE) - dlnorm(lambda_sim, 3, 0.5, log = TRUE) - - # Re-scale weights to sum up to one - log_ww <- log_ww - max(log_ww) - ww <- exp(log_ww) - ww <- ww / sum(ww) - ``` - - ```{r fig = TRUE} - hist(ww, xlab = "Importance weights") - ``` - - ```{r} - # Posterior mean - (post_mean <- sum(lambda_sim * ww)) - # Posterior variance - (post_var <- sum(lambda_sim^2 * ww)- post_mean^2) - ``` - - ```{r fig = TRUE} - plot(density(lambda_sim, weights = ww, bw = 0.5) , main = "Posterior density", xlim = c(10,40)) - ``` - - ```{r} - ESS <- function(ww){ - (sum(ww)^2)/sum(ww^2) - } - ESS(ww) - n_simulations - ``` - -
- -* Fit the model with the non-conjugate prior using the M-H algorithm. - -
Solution - - ```{r} - # Proposal distribution: sampling - rq <- function(lambda) { - rlnorm(1, meanlog = log(lambda), sdlog = sqrt(1 / 100)) - } - - # Proposal distribution: log-density - logdq <- function(new.lambda, lambda) { - dlnorm(new.lambda, meanlog = log(lambda), sdlog = sqrt(1 / 100), log = TRUE) - } - - # Prior distribution: Log Normal(4,1) - logprior <- function(lambda) { - dlnorm(lambda_sim, 4, 1, log = TRUE) - } - - # LogLikelihood - loglik <- function(y, lambda) { - res <- sum(dpois(y, lambda, log = TRUE)) - } - ``` - - ```{r} - # Number of iterations - n.iter <- 40500 - - # Simulations of the parameter - lambda <- rep(NA, n.iter) - - # Initial value - lambda[1] <- 30 - - for(i in 2:n.iter) { - new.lambda <- rq(lambda[i - 1]) - - # Log-Acceptance probability - logacc.prob <- loglik(y, new.lambda) + logprior(new.lambda) + - logdq(lambda[i - 1], new.lambda) - logacc.prob <- logacc.prob - loglik(y, lambda[i - 1]) - logprior(lambda[i - 1]) - logdq(new.lambda, lambda[i - 1]) - logacc.prob <- min(0, logacc.prob)#0 = log(1) - - if(log(runif(1)) < logacc.prob) { - # Accept - lambda[i] <- new.lambda - } else { - # Reject - lambda[i] <- lambda[i - 1] - } - } - ``` - - ```{r} - # Remove burn-in - lambda <- lambda[-c(1:500)] - - # Thinning - lambda <- lambda[seq(1, length(lambda), by = 10)] - - # Summary statistics - summary(lambda) - ``` - - ```{r fig = TRUE} - par(mfrow = c(1, 2)) - plot(lambda, type = "l", main = "MCMC samples", ylab = expression(lambda)) - plot(density(lambda), main = "Posterior density", xlab = expression(lambda)) - ``` - -
- -# Gibbs Sampling - -As we have seen in the theory session, Gibbs Sampling is an MCMC method -which allows us to estimate one parameter at a time. This is very useful -for models which have lots of parameters, as in a sense it reduces a -very large multidimensional inference problem to a set of single -dimension problems. - -To recap, in order to generate a random sample from the joint density -$g\left(\mathbf{\theta}\right)= -g\left(\theta_1,\theta_2,\ldots,\theta_D\right)$ for a model with -$d$ parameters, we use the following algorithm: - -1. Start with an initial set -$\mathbf{\theta}^{(0)}= -\left(\theta_1^{(0)},\theta_2^{(0)},\ldots,\theta_D^{(0)}\right)$ -2. Generate $\theta_1^{(1)}$ from the conditional distribution -$\theta_1 \mid \left(\theta_2^{(0)},\ldots,\theta_D^{(0)}\right)$ -3. Generate $\theta_2^{(1)}$ from the conditional distribution -$\theta_2 \mid \left(\theta_1^{(1)},\theta_3^{(0)},\ldots,\theta_D^{(0)}\right)$ -4. $\quad\quad\cdots$ -5. Generate $\theta_D^{(1)}$ from the conditional distribution -$\theta_D \mid \left(\theta_1^{(1)},\theta_2^{(1)},\ldots,\theta_{D-1}^{(1)}\right)$ -6. Iterate from Step 2. - -As with Metropolis-Hastings, in Gibbs Sampling we typically discard the -initial simulations (the burn-in period), reducing the dependence on the -initial set of parameter values. - -As with the other MCMC algorithms, the resulting simulations approximate a -random sample from the posterior distribution. - -# Example: Simple Linear Regression - -We will illustrate the use of Gibbs Sampling on a simple linear -regression model. Recall that we saw yesterday that we can obtain an -analytical solution for a Bayesian linear regression, but that more -complex models require a simulation approach. - -The approach we take here for Gibbs Sampling on a simple linear -regression model is very easily generalised to more complex models, -the key is that whatever your model looks like, you update one -parameter at a time, using the conditional distribution of that -parameter given the current values of all the other parameters. - -The simple linear regression model we will analyse here is a reduced -version of the general linear regression model we saw yesterday: -$$ -Y_i = \beta_0+\beta_1x_i+\epsilon_i -$$ -for response variable $\mathbf{Y}=\left(Y_1,Y_2,\ldots,Y_n\right)$, -explanatory variable $\mathbf{x}=\left(x_1,x_2,\ldots,x_n\right)$ and -residual vector $\mathbf{\epsilon}= -\left(\epsilon_1,\epsilon_2,\ldots,\epsilon_n\right)$ for a sample of -size $n$, where $\beta_0$ is the regression intercept, -$\beta_1$ is the regression slope, and where the -$\epsilon_i$ are independent with -$\epsilon_i\sim \textrm{N}\left(0,\sigma^2\right)$ $\forall i=1,2,\ldots,n$. For -convenience, we refer to the combined set of $\mathbf{Y}$ and -$\mathbf{x}$ data as $\mathcal{D}$. We also define -$\hat{\mathbf{y}}$ to be the fitted response vector (i.e., -$\hat{y}_i = \beta_0 + \beta_1 x_i$ from the regression equation) using the -current values of the parameters $\beta_0$, $\beta_1$ and precision $\tau = 1$ -(remember that $\tau=\frac{1}{\sigma^2}$) from the Gibbs Sampling simulations. - -For Bayesian inference, it is simpler to work with precisions $\tau$ -rather than with variances $\sigma^2$. Given priors -$$ -\begin{align*} -\pi(\tau) &= \textrm{Ga}\left(\alpha, \beta\right), \\ -\pi(\beta_0) &= \textrm{N}\left(\mu_{\beta_0}, \tau_{\beta_0}\right), \quad\textrm{and} \\ -\pi(\beta_1) &= \textrm{N}\left(\mu_{\beta_1}, \tau_{\beta_1}\right) -\end{align*} -$$ -then by combining with the likelihood we can derive the full conditional distributions for each parameter. For $\tau$, we have -$$ -\begin{align*} -\pi\left(\tau|\mathcal{D}, \beta_0, \beta_1\right) &\propto -\pi(\tau)\prod_{i=1}^nL\left(y_i|\hat{y_i}\right), \\ -&= \tau^{\alpha-1}e^{-\beta\tau} -\tau^{\frac{n}{2}}\exp\left\{- -\frac{\tau}{2}\sum_{i=1}^n(y_i-\hat{y_i})^2\right\}, \\ -&= \tau^{\alpha-1 + \frac{n}{2}}\exp\left\{-\beta\tau- -\frac{\tau}{2}\sum_{i=1}^n(y_i-\hat{y_i})^2\right\},\\ -\textrm{so} \quad \tau|\mathcal{D}, \beta_0, \beta_1 &\sim -\textrm{Ga}\left(\alpha+\frac{n}{2}, -\beta +\frac{1}{2}\sum_{i=1}^n(y_i-\hat{y_i})^2\right). -\end{align*} -$$ -For $\beta_0$ we will need to expand $\hat{y}_i=\beta_0+\beta_1x_i$, -and then we have -$$ -\begin{align*} -p(\beta_0|\mathcal{D}, \beta_1, \tau) &\propto -\pi(\beta_0)\prod_{i=1}^nL\left(y_i|\hat{y_i}\right), \\ -&= \tau_{\beta_0}^{\frac{1}{2}}\exp\left\{- -\frac{\tau_{\beta_0}}{2}(\beta_0-\mu_{\beta_0})^2\right\}\tau^\frac{n}{2} -\exp\left\{-\frac{\tau}{2}\sum_{i=1}^n(y_i-\hat{y_i})^2\right\}, \\ -&\propto \exp\left\{-\frac{\tau_{\beta_0}}{2}(\beta_0-\mu_{\beta_0})^2- -\frac{\tau}{2}\sum_{i=1}^n(y_i-\beta_0-x_i \beta_1)^2\right\}, \\ -&= \exp \left\{ -\frac{1}{2}\left(\beta_0^2(\tau_{\beta_0} + n\tau) + -\beta_0\left(-2\tau_{\beta_0}\mu_{\beta_0} - 2\tau\sum_{i=1}^n -(y_i - x_i\beta_1)\right) \right) + C \right\}, \\ -\textrm{so} \quad \beta_0|\mathcal{D}, \beta_1, \tau &\sim -\mathcal{N}\left((\tau_{\beta_0} + n\tau)^{-1} -\left(\tau_{\beta_0}\mu_{\beta_0} + \tau\sum_{i=1}^n (y_i - x_i\beta_1)\right), -\tau_{\beta_0} + n\tau\right). -\end{align*} -$$ - -In the above, $C$ represents a quantity that is constant with respect to $\beta_0$ and hence can be ignored. Finally, for $\beta_1$ we have -$$ -\begin{align*} -p(\beta_1|\mathcal{D}, \beta_0, \tau) &\propto -\pi(\beta_1)\prod_{i=1}^nL\left(y_i|\hat{y_i}\right), \\ -&= \tau_{\beta_1}^{\frac{1}{2}}\exp\left\{- -\frac{\tau_{\beta_1}}{2}(\beta_1-\mu_{\beta_1})^2\right\}\tau^\frac{n}{2} -\exp\left\{-\frac{\tau}{2}\sum_{i=1}^n(y_i-\hat{y_i})^2\right\}, \\ -&\propto \exp\left\{-\frac{\tau_{\beta_1}}{2}(\beta_1-\mu_{\beta_1})^2- -\frac{\tau}{2}\sum_{i=1}^n(y_i-\beta_0-x_i \beta_1)^2\right\}, \\ -&= \exp \left\{ -\frac{1}{2}\left(\beta_1^2\left(\tau_{\beta_1} + -\tau\sum_{i=1}^n x_i^2\right) + -\beta_1\left(-2\tau_{\beta_1}\mu_{\beta_1} - 2\tau\sum_{i=1}^n -(y_i - \beta_0)x_i\right) \right) + C \right\}, \\ -\textrm{so} \quad \beta_1|\mathcal{D}, \beta_0, \tau -&\sim \textrm{N}\left((\tau_{\beta_1} + \tau\sum_{i=1}^n x_i^2)^{-1} -\left(\tau_{\beta_1}\mu_{\beta_1} + \tau\sum_{i=1}^n (y_i - \beta_0)x_i\right), -\tau_{\beta_1} + \tau\sum_{i=1}^n x_i^2\right). -\end{align*} -$$ -This gives us an easy way to run Gibbs Sampling for linear regression; we have -standard distributions as full conditionals and there are standard -functions in R to obtain the simulations. We shall do this now for an -example in ecology, looking at the relationship between water pollution and -mayfly size - the data come from the book *Statistics for Ecologists Using R and Excel 2nd edition* by Mark Gardener (ISBN 9781784271398), see [the publisher's webpage](https://pelagicpublishing.com/products/statistics-for-ecologists-using-r-and-excel-gardener-2nd-edition). - -The data are as follows: - -- `length` - the length of a mayfly in mm; - -- `BOD` - biological oxygen demand in mg of oxygen per litre, effectively a -measure of organic pollution (since more organic pollution requires more oxygen -to break it down). - -The data can be read into R: -```{r} -# Read in data -BOD <- c(200,180,135,120,110,120,95,168,180,195,158,145,140,145,165,187, - 190,157,90,235,200,55,87,97,95) -mayfly.length <- c(20,21,22,23,21,20,19,16,15,14,21,21,21,20,19,18,17,19,21,13, - 16,25,24,23,22) -# Create data frame for the Gibbs Sampling, needs x and y -Data <- data.frame(x=BOD,y=mayfly.length) -``` - -We provide here some simple functions for running the analysis by Gibbs -Sampling, using the full conditionals derived above. The functions assume -you have a data frame with two columns, the response `y` and the covariate -`x`. - -```{r} -# Function to update tau, the precision -sample_tau <- function(data, beta0, beta1, alphatau, betatau) { - rgamma(1, - shape = alphatau+ nrow(data) / 2, - rate = betatau+ 0.5 * sum((data$y - (beta0 + data$x*beta1))^2) - ) -} - -# Function to update beta0, the regression intercept -sample_beta0 <- function(data, beta1, tau, mu0, tau0) { - prec <- tau0 + tau * nrow(data) - mean <- (tau0*mu0 + tau * sum(data$y - data$x*beta1)) / prec - rnorm(1, mean = mean, sd = 1 / sqrt(prec)) -} - -# Function to update beta1, the regression slope -sample_beta1 <- function(data, beta0, tau, mu1, tau1) { - prec <- tau1 + tau * sum(data$x * data$x) - mean <- (tau1*mu1 + tau * sum((data$y - beta0) * data$x)) / prec - rnorm(1, mean = mean, sd = 1 / sqrt(prec)) -} - -# Function to run the Gibbs Sampling, where you specify the number of -# simulations `m`, the starting values for each of the three regression -# parameters (`tau_start` etc), and the parameters for the prior -# distributions of tau, beta0 and beta1 -gibbs_sample <- function(data, - tau_start, - beta0_start, - beta1_start, - m, - alpha_tau, - beta_tau, - mu_beta0, - tau_beta0, - mu_beta1, - tau_beta1) { - tau <- numeric(m) - beta0 <- numeric(m) - beta1 <- numeric(m) - tau[1] <- tau_start - beta0[1] <- beta0_start - beta1[1] <- beta1_start - - for (i in 2:m) { - tau[i] <- - sample_tau(data, beta0[i-1], beta1[i-1], alpha_tau, beta_tau) - beta0[i] <- - sample_beta0(data, beta1[i-1], tau[i], mu_beta0, tau_beta0) - beta1[i] <- - sample_beta1(data, beta0[i], tau[i], mu_beta1, tau_beta1) - } - - Iteration <- 1:m - data.frame(Iteration,beta0,beta1,tau) -} -``` - -## Exercises - -We will use Gibbs Sampling to fit a Bayesian Linear Regression model to the -mayfly data. We will use the following prior distributions for the regression -parameters: -$$ -\begin{align*} -\pi(\tau) &= \textrm{Ga}\left(1, 1\right), \\ -\pi(\beta_0) &= \textrm{N}\left(0, 0.0001\right), \quad\textrm{and} \\ -\pi(\beta_1) &= \textrm{N}\left(0, 0.0001\right). -\end{align*} -$$ -We will set the initial values of all parameters to 1, i.e. $\beta_0^{(0)}=\beta_1^{(0)}=\tau^{(0)}=1$. - -### Data exploration - -- Investigate the data to see whether a linear regression model would be -sensible. [Hint: a scatterplot and a correlation coefficient could be helpful.] - -
Solution - - ```{r fig = TRUE} - # Scatterplot - plot(BOD,mayfly.length) - # Correlation with hypothesis test - cor.test(BOD,mayfly.length) - ``` - -
- -- Run a frequentist simple linear regression using the `lm` function in R; this -will be useful for comparison with our Bayesian analysis. - -
Solution - - ```{r} - # Linear Regression using lm() - linreg <- lm(mayfly.length ~ BOD, data = Data) - summary(linreg) - ``` - -
- -### Running the Gibbs Sampler - -- Use the code above to fit a Bayesian simple linear regression using `length` -as the response variable. Ensure you have a burn-in period so that the initial -simulations are discarded. Use the output from `gibbs_sample` to estimate the -regression parameters (with uncertainty measures). Also plot the estimates of -the posterior distributions. - -
Solution - - ```{r fig = TRUE} - # Bayesian Linear Regression using a Gibbs Sampler - - # Set the number of iterations of the Gibbs Sampler - including how many to - # discard as the burn-in - m.burnin <- 500 - m.keep <- 1000 - m <- m.burnin + m.keep - - # Obtain the samples - results1 <- gibbs_sample(Data,1,1,1,m,1,1,0,0.0001,0,0.0001) - # Remove the burn-in - results1 <- results1[-(1:m.burnin),] - # Add variance and standard deviation columns - results1$Variance <- 1/results1$tau - results1$StdDev <- sqrt(results1$Variance) - # Estimates are the column means - apply(results1[,-1],2,mean) - # Also look at uncertainty - apply(results1[,-1],2,sd) - apply(results1[,-1],2,quantile,probs=c(0.025,0.975)) - - # Kernel density plots for beta0, beta1 and the residual stanard deviation - plot(density(results1$beta0, bw = 0.5), - main = "Posterior density for beta0") - plot(density(results1$beta1, bw = 0.25), - main = "Posterior density for beta1") - plot(density(results1$StdDev, bw = 0.075), - main = "Posterior density for StdDev") - ``` - -
- -- Use the function `ts.plot()` to view the autocorrelation in the Gibbs -sampling simulation chain. Is there any visual evidence of autocorrelation, or -do the samples look independent? - -
Solution - - ```{r fig = TRUE} - # Plot sampled series - ts.plot(results1$beta0,ylab="beta0",xlab="Iteration") - ts.plot(results1$beta1,ylab="beta1",xlab="Iteration") - ts.plot(results1$StdDev,ylab="sigma",xlab="Iteration") - ``` - -
- -- How do the results compare with the frequentist output? - -### Reducing the autocorrelation by mean-centering the covariate - -- One method for reducing the autocorrelation in the sampling chains for -regression parameters is to mean centre the covariate(s); this works because -it reduces any dependence between the regression intercept and slope(s). Do -this for the current example, noting that you will need to make a correction -on the estimate of the regression intercept afterwards. - -
Solution - - ```{r fig = TRUE} - # Mean-centre the x covariate - DataC <- Data - meanx <- mean(DataC$x) - DataC$x <- DataC$x - meanx - - # Bayesian Linear Regression using a Gibbs Sampler - - # Set the number of iterations of the Gibbs Sampler - including how many to - # discard as the burn-in - m.burnin <- 500 - m.keep <- 1000 - m <- m.burnin + m.keep - - # Obtain the samples - results2 <- gibbs_sample(DataC, 1, 1, 1, m, 1, 1, 0, 0.0001, 0, 0.0001) - # Remove the burn-in - results2 <- results2[-(1:m.burnin), ] - - # Correct the effect of the mean-centering on the intercept - results2$beta0 <- results2$beta0 - meanx * results2$beta1 - - # Add variance and standard deviation columns - results2$Variance <- 1 / results2$tau - results2$StdDev <- sqrt(results2$Variance) - # Estimates are the column means - apply(results2[, -1], 2, mean) - # Also look at uncertainty - apply(results2[, -1], 2, sd) - apply(results2[, -1], 2, quantile, probs = c(0.025, 0.975)) - - # Kernel density plots for beta0, beta1 and the residual standard - # deviation - plot(density(results2$beta0, bw = 0.5), - main = "Posterior density for beta0") - plot(density(results2$beta1, bw = 0.25), - main = "Posterior density for beta1") - plot(density(results2$StdDev, bw = 0.075), - main = "Posterior density for StdDev") - - # Plot sampled series - ts.plot(results2$beta0, ylab = "beta0", xlab = "Iteration") - ts.plot(results2$beta1, ylab = "beta1", xlab = "Iteration") - ts.plot(results2$StdDev, ylab = "sigma", xlab = "Iteration") - ``` - -
- diff --git a/vignettes/p9.Rmd b/vignettes/p9.Rmd deleted file mode 100644 index b454afb..0000000 --- a/vignettes/p9.Rmd +++ /dev/null @@ -1,464 +0,0 @@ ---- -title: 'Practical 6: Software and GLMs' -author: "VIBASS" -output: - html_vignette: - fig_caption: yes - number_sections: yes - toc: yes - fig_width: 6 - fig_height: 4 -vignette: > - %\VignetteIndexEntry{Practical 6: Software and GLMs} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r setup, include=FALSE} -knitr::opts_chunk$set( - echo = TRUE, - collapse = TRUE, - comment = "#>" -) -``` - -# Software for Bayesian Statistical Analysis - -So far, simple Bayesian models with conjugate priors have been -considered. As explained in previous practicals, when the posterior -distribution is not available in closed form, MCMC algorithms such as the -Metropolis-Hastings or -Gibbs Sampling can be used to obtain samples from it. - -In general, posterior distributions are seldom available in closed form and -implementing MCMC algorithms for complex models can be technically difficult -and very time-consuming. - -For this reason, in this Practical we start by looking at a number of `R` -packages to fit Bayesian statistical models. These packages will equip us -with tools which can be used to deal with more complex models efficiently, -without us having to do a lot of extra coding ourselves. Fitting Bayesian -models in `R` will then be much like fitting non-Bayesian models, using -model-fitting functions at the command line, and using standard syntax for -model specification. - -## BayesX and INLA - -In particular, the following two software packages will be considered: - -* `BayesX` - -* `INLA` - -`BayesX` (http://www.bayesx.org/) implements MCMC methods to obtain samples -from the joint posterior and is conveniently accessed from R via the package -`R2BayesX`. - -`INLA` (https://www.r-inla.org/) is based on producing (accurate) -approximations to the marginal posterior distributions of the model parameters. -Although this can be enough most of the time, making multivariate inference -with `INLA` can be difficult or impossible. However, in many cases this is -not needed and `INLA` can fit some classes of models in a fraction of the -time it takes with MCMC. - -Both `R2BayesX` and `INLA` have a very simple interface to define models -using a `formula` (in the same way as with `glm()` and `gam()` functions). - -While `R2BayesX` can be installed from CRAN, `INLA` is not on CRAN and -needs to be installed from a specific repository. - -## Other Bayesian Software - -* Package `MCMCpack` in R contains functions such as `MCMClogit()`, `MCMCPoisson()` and -`MCMCprobit()` for fitting specific kinds of models. -* A classic MCMC program is `BUGS`, (Bayesian Analysis using Gibbs Sampling) -described in Lunn et al. (2000): -[http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml](http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml). -BUGS can be used through graphical interfaces `WinBUGS` and `OpenBUGS`. Both of -these packages can be called from within R using packages `R2WinBUGS` and -`R2OpenBUGS`. -* `JAGS`, which stands for “just another Gibbs sampler”. Can also be called from -R using package `r2jags`. -* The `NIMBLE` package extends `BUGS` and implements MCMC and other methods -for Bayesian inference. You can get it from https://r-nimble.org, and is best -run directly from R. -* The `Stan` software implements Hamiltonian Monte Carlo and other methods for -fit hierarchical Bayesian models. It is available from https://mc-stan.org. - -# Bayesian Logistic Regression - -## Model Formulation - -To summarise the model formulation presented in the lecture, given a response -variable $Y_i$ representing the count of a number of successes from a given -number of trials $n_i$ with success probability $\theta_i$, we have - -* $(Y_i \mid \boldsymbol \theta_i) \sim\mbox{Bi}(n_i, \theta_i),\,\, i.i.d.\,\,, i=1, \ldots, m$ -\begin{align*} - \mbox{logit}(\theta_i) & =\eta_i \nonumber\\ -\eta_{i} & =\beta_0+\beta_1 x_{i1}+\ldots+\beta_p x_{ip}=\boldsymbol x_i\boldsymbol \beta\nonumber -\end{align*} -assuming the logit link function and with linear predictor $\eta_{i}$. - -## Example: Fake News - -The `fake_news` data set in the `bayesrules` package in `R` contains -information about 150 news articles, some real news and some fake news. - -In this example, we will look at trying to predict whether an article of news -is fake or not given three explanatory variables. - -We can use the following code to extract the variables we want from the -data set: - -```{r} -library(bayesrules) -fakenews <- fake_news[,c("type","title_has_excl","title_words","negative")] -``` - -The response variable `type` takes values `fake` or `real`, which should be -self-explanatory. The three explanatory variables are: - -* `title_has_excl`, whether or not the article contains an excalamation mark (values `TRUE` or `FALSE`); - -* `title_words`, the number of words in the title (a positive integer); and - -* `negative`, a sentiment rating, recorded on a continuous scale. - -In the exercise to follow, we will examine whether the chance of an article -being fake news is related to the three covariates here. - -## Fitting Bayesian Logistic Regression Models - -`BayesX` makes inference via MCMC, via the `R2BayesX` package which as noted -makes the syntax for model fitting very similar to that for fitting -non-Bayesian models using `glm()` in R. If you do not yet have it installed, -you can install it in the usual way from CRAN. - -The package must be loaded into R: -```{r} -library(R2BayesX) -``` - -The syntax for fitting a Bayesian Logistic Regression Model with one response -variable and three explanatory variables will be like so: - -```{r eval=FALSE} -model1 <- bayesx(formula = y ~ x1 + x2 + x3, - data = data.set, - family = "binomial") -``` - -Alternatively, we can obtain a Bayesian model fit without using MCMC with the -INLA software, implemented in R via the `INLA` package. If you do not have this -package installed already, as it is not on CRAN you will need to install it via - -```{r eval=FALSE} -install.packages("INLA",repos=c(getOption("repos"),INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE) -``` - -After this, the package can be loaded into R: -```{r} -library(INLA) -``` - - -## Model Fitting - -Note that the variable `title_has_excl` will need to be either replaced by or -converted to a factor, for example - -```{r} -fakenews$titlehasexcl <- as.factor(fakenews$title_has_excl) -``` - -Functions `summary` and `confint` produce a summary (including parameter -estimates etc) and confidence intervals for the parameters, respectively. - -In order to be able to obtain output plots from BayesX, it seems that we need -to create a new version of the response variable of type logical: - -```{r} -fakenews$typeFAKE <- fakenews$type == "fake" -``` - -## Exercises - -* Perform an exploratory assessment of the fake news data set, in particular -looking at the possible relationships between the explanatory variables -and the fake/real response variable `typeFAKE`. You may wish to use the R -function `boxplot()` here. - -
Solution - - ```{r fig = TRUE} - # Is there a link between the fakeness and whether the title has an exclamation mark? - table(fakenews$title_has_excl, fakenews$typeFAKE) - # For the quantitative variables, look at boxplots on fake vs real - boxplot(fakenews$title_words ~ fakenews$typeFAKE) - boxplot(fakenews$negative ~ fakenews$typeFAKE) - - ``` - -
- -* Fit a Bayesian model in BayesX using the fake news `typeFAKE` variable as -response and the others as covariates. Examine the output; does the model fit -well, and is there any evidence that any of the explanatory variables are -associated with changes in probability of an article being fake or not? - -
Solution - - ```{r fig = TRUE} - # Produce the BayesX output - bayesx.output <- bayesx(formula = typeFAKE ~ titlehasexcl + title_words + negative, - data = fakenews, - family = "binomial", - method = "MCMC", - iter = 15000, - burnin = 5000) - summary(bayesx.output) - confint(bayesx.output) - ``` - -
- -* Produce plots of the MCMC sample traces and the estimated posterior -distributions for the model parameters. Does it seem like convergence has been -achieved? - -
Solution - - ```{r fig = TRUE, fig.width = 5, fig.height = 10} - # Traces can be obtained separately - plot(bayesx.output,which = "coef-samples") - ``` - - ```{r fig = TRUE} - # And the density plots one-by-one - par(mfrow=c(2,2)) - plot(density(samples(bayesx.output)[,"titlehasexclTRUE"]),main="Title Has Excl") - plot(density(samples(bayesx.output)[,"title_words"]),main="Title Words") - plot(density(samples(bayesx.output)[,"negative"]),main="Negative") - ``` - -
- -* Fit the Bayesian model without MCMC using `INLA`; note that the summary output -provides credible intervals for each parameter to help us make inference. -Also, in INLA a Binomial response needs to be entered as type integer, so we -need another conversion: - ```{r} - fakenews$typeFAKE.int <- as.integer(fakenews$typeFAKE) - ``` - -
Solution - - ```{r fig = TRUE} - # Fit model - note similarity with bayesx syntax - inla.output <- inla(formula = typeFAKE.int ~ titlehasexcl + title_words + negative, - data = fakenews, - family = "binomial") - # Summarise output - summary(inla.output) - ``` -
- -* Fit a non-Bayesian model using `glm()` for comparison. How do the model fits -compare? -
Solution - - ```{r fig = TRUE} - # Fit model - note similarity with bayesx syntax - glm.output <- glm(formula = typeFAKE ~ titlehasexcl + title_words + negative, - data = fakenews, - family = "binomial") - # Summarise output - summary(glm.output) - # Perform ANOVA on each variable in turn - drop1(glm.output,test="Chisq") - ``` - -
- -# Bayesian Poisson Regression - -## Model Formulation - -To summarise the model formulation presented in the lecture, given a response -variable $Y_i$ representing the counts occurring from a process with mean -parameter $\lambda_i$: - -* $(Y_i \mid \boldsymbol \lambda_i) \sim\mbox{Po}(\lambda_i),\,\, i.i.d.\,\,, i=1, \ldots, n$ -\begin{align*} - \mbox{log}(\lambda_i) & =\eta_i \nonumber\\ -\eta_{i} & =\beta_0+\beta_1 x_{i1}+\ldots+\beta_p x_{ip}=\boldsymbol x_i\boldsymbol \beta\nonumber -\end{align*} -assuming the log link function and with linear predictor $\eta_{i}$. - -## Example: Emergency Room Complaints - -For this example we will use the `esdcomp` data set, which is available in the -`faraway` package. This data set records complaints about emergency room -doctors. In particular, data was recorded on 44 doctors working in an -emergency service at a hospital to study the factors affecting the number of -complaints received. - -The response variable that we will use is `complaints`, an integer count of the -number of complaints received. It is expected that the number of complaints will -scale by the number of visits (contained in the `visits` column), so we are -modelling the rate of complaints per visit - thus we will need to include a new -variable `log.visits` as an offset. - -The three explanatory variables we will use in the analysis are: - -* `residency`, whether or not the doctor is still in residency training (values -`N` or `Y`); - -* `gender`, the gender of the doctor (values `F` or `M`); and - -* `revenue`, dollars per hour earned by the doctor, recorded as an integer. - -Our simple aim here is to assess whether the seniority, gender or income of the -doctor is linked with the rate of complaints against that doctor. - -We can use the following code to extract the data we want without having to load -the whole package: - -```{r} -esdcomp <- faraway::esdcomp -``` - -## Fitting Bayesian Poisson Regression Models - -Again we can use `BayesX` and `INLA` to fit this form of Bayesian generalised -linear model. - -If not loaded already, the packages must be loaded into R: -```{r echo=FALSE} -library(R2BayesX) -library(INLA) -``` - -In BayesX, the syntax for fitting a Bayesian Poisson Regression Model with one -response variable, three explanatory variables and an offset will be like so: - -```{r eval=FALSE} -model1 <- bayesx(formula = y~x1+x2+x3+offset(w), - data = data.set, - family="poisson") -``` - -As noted above we need to include an offset in this analysis; since -for a Poisson GLM we will be using a log() link function by default, we must -compute the log of the number of visits and include that in the data set -`esdcomp`: - -```{r} -esdcomp$log.visits <- log(esdcomp$visits) -``` - -The offset term in the model is then written - -`offset(log.visits)` - -in the call to `bayesx`. - -## Exercises - -* Perform an exploratory assessment of the emergency room complaints data set, -particularly how the response variable `complaints` varies with the proposed -explanatory variables relative to the number of visits. To do this, create -another variable which is the ratio of `complaints` to `visits`. - -
Solution - - ```{r fig = TRUE} - # Compute the ratio - esdcomp$ratio <- esdcomp$complaints / esdcomp$visits - # Plot the link with revenue - plot(esdcomp$revenue,esdcomp$ratio) - # Use boxplots against residency and gender - boxplot(esdcomp$ratio ~ esdcomp$residency) - boxplot(esdcomp$ratio ~ esdcomp$gender) - ``` - -
- -* Fit a Bayesian model in BayesX using the `complaints` variable as Poisson -response and the others as covariates. Examine the output; does the model fit -well, and is there any evidence that any of the explanatory variables are -associated with the rate of complaints? - -
Solution - - ```{r fig = TRUE} - # Fit model - note similarity with bayesx syntax - esdcomp$log.visits <- log(esdcomp$visits) - bayesx.output <- bayesx(formula = complaints ~ offset(log.visits) + residency + gender + revenue, - data = esdcomp, - family = "poisson") - # Summarise output - summary(bayesx.output) - ``` - -
- - -* Produce plots of the MCMC sample traces and the estimated posterior -distributions for the model parameters. Does it seem like convergence has been -achieved? - -
Solution - - ```{r fig = TRUE, fig.width = 5, fig.height = 10} - # An overall plot of sample traces and density estimates - # plot(samples(bayesx.output)) - # Traces can be obtained separately - plot(bayesx.output,which = "coef-samples") - ``` - - ```{r fig = TRUE} - # And the density plots one-by-one - par(mfrow =c(2, 2)) - plot(density(samples(bayesx.output)[, "residencyY"]), main = "Residency") - plot(density(samples(bayesx.output)[, "genderM"]), main = "Gender") - plot(density(samples(bayesx.output)[, "revenue"]), main = "Revenue") - ``` - -
- -* Fit the Bayesian model without MCMC using `INLA`; note that the summary output -provides credible intervals for each parameter to help us make inference. - -
Solution - - ```{r} - # Fit model - note similarity with bayesx syntax - inla.output <- inla(formula = complaints ~ offset(log.visits) + residency + gender + revenue, - data = esdcomp, - family = "poisson") - # Summarise output - summary(inla.output) - ``` - -
- -* Fit a non-Bayesian model using `glm()` for comparison. How do the model fits -compare? -
Solution - - ```{r fig = TRUE} - # Fit model - note similarity with bayesx syntax - esdcomp$log.visits <- log(esdcomp$visits) - glm.output <- glm(formula = complaints ~ offset(log.visits) + residency + gender + revenue, - data = esdcomp, - family = "poisson") - # Summarise output - summary(glm.output) - # Perform ANOVA on each variable in turn - drop1(glm.output, test = "Chisq") - ``` - -
-