From 7fc2afc45ef349b235836651c9dbff3da0c89925 Mon Sep 17 00:00:00 2001 From: Mark J Brewer Date: Wed, 3 Jul 2024 16:46:17 +0100 Subject: [PATCH] Big Update - Now Using BayesX Removed most of the MCMCpack (apart from the simple example in Practical 5), now reverted to using BayesX throughout, much happier! --- DESCRIPTION | 8 +- vignettes/p5.Rmd | 7 +- vignettes/p6.Rmd | 459 ++++++++++++++++++++++++----------------------- vignettes/p7.Rmd | 384 +++++++++++++++++---------------------- 4 files changed, 406 insertions(+), 452 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index 6bc4e2c..d293481 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,7 +1,7 @@ Type: Package Package: vibass Title: Valencia International Bayesian Summer School -Version: 0.0.41 +Version: 0.0.43 Authors@R: c(person(given = "VIBASS7", role = c("aut")), @@ -46,11 +46,12 @@ Imports: crayon, dplyr, extraDistr, + faraway, ggplot2, golem, knitr, magrittr, - MCMCpack, + R2BayesX, rlang, rstudioapi, shiny (>= 1.5), @@ -60,7 +61,6 @@ Suggests: bayesrules, coda, cowplot, - faraway, hrbrthemes, htmlwidgets, INLA, @@ -68,9 +68,9 @@ Suggests: LaplacesDemon, magick, MASS, + MCMCpack, plotly, rmarkdown, - R2BayesX, pacman, png, spData, diff --git a/vignettes/p5.Rmd b/vignettes/p5.Rmd index fbdf414..ed1b60d 100644 --- a/vignettes/p5.Rmd +++ b/vignettes/p5.Rmd @@ -487,10 +487,11 @@ package, using the so-called "full conditional" distributions - that is, the conditional distributions referred to in the [Section on Gibbs Sampling](#Gibbs). -We will use the R package MCMCpack to run the Gibbs Sampling here and for the -other examples coming in this and the next two Practicals. +We will use the R package MCMCpack to run the Gibbs Sampling for this simple +example, although we will use more advanced software in Practicals 6 and 7 for +more complex examples. -We will study an example from ecology, looking at the relationship between water pollution and mayfly size - the data come from the book *Statistics for +We will study an problem from ecology, looking at the relationship between water pollution and mayfly size - the data come from the book *Statistics for Ecologists Using R and Excel 2nd edition* by Mark Gardener (ISBN 9781784271398), see [the publisher's webpage](https://pelagicpublishing.com/products/statistics-for-ecologists-using-r-and-excel-gardener-2nd-edition). diff --git a/vignettes/p6.Rmd b/vignettes/p6.Rmd index 489f01c..5a06a75 100644 --- a/vignettes/p6.Rmd +++ b/vignettes/p6.Rmd @@ -22,6 +22,7 @@ knitr::opts_chunk$set( ) ``` + # Software for Bayesian Statistical Analysis So far, simple Bayesian models with conjugate priors have been @@ -34,23 +35,33 @@ In general, posterior distributions are seldom available in closed form and implementing MCMC algorithms for complex models can be technically difficult and very time-consuming. -For this reason, in this Practical we start by listing a number of `R` -packages to fit Bayesian statistical models. These packages equip researchers -with tools for dealing with complex models efficiently, -without having to do a lot of extra mathematics and coding. Fitting Bayesian -models in `R` is then much like fitting non-Bayesian models, using +For this reason, in this Practical we start by looking at a number of `R` +packages to fit Bayesian statistical models. These packages will equip us +with tools which can be used to deal with more complex models efficiently, +without us having to do a lot of extra coding ourselves. Fitting Bayesian +models in `R` will then be much like fitting non-Bayesian models, using model-fitting functions at the command line, and using standard syntax for model specification. -## Specific Software +## BayesX + +In particular, the following software package will be considered: + +* `BayesX` -* We have already seen package `MCMCpack` in R, which contains functions -such as `MCMClogit()`, `MCMCPoisson()` and `MCMCprobit()` for fitting -specific kinds of models via simple function calls. -* `BayesX` (http://www.bayesx.org/) implements MCMC methods to obtain samples +`BayesX` (http://www.bayesx.org/) implements MCMC methods to obtain samples from the joint posterior and is conveniently accessed from R via the package -`R2BayesX`. It has a very simple interface to define models. +`R2BayesX`. + +`R2BayesX` has a very simple interface to define models using a `formula` (in the same way as with `glm()` and `gam()` functions). + +`R2BayesX` can be installed from CRAN. + +## Other Bayesian Software + +* Package `MCMCpack` in R contains functions such as `MCMClogit()`, `MCMCPoisson()` and +`MCMCprobit()` for fitting specific kinds of models. * `INLA` (https://www.r-inla.org/) is based on producing (accurate) approximations to the marginal posterior distributions of the model parameters. Although this can be enough most of the time, making multivariate inference @@ -62,21 +73,18 @@ specific website where it can be downloaded: [https://www.r-inla.org/download-install](https://www.r-inla.org/download-install) * A classic MCMC program is `BUGS`, (Bayesian Analysis using Gibbs Sampling) described in Lunn et al. (2000): -[http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml](http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml). -`BUGS` can be used through graphical interfaces `WinBUGS` and `OpenBUGS`. Both of + [http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml](http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml). +BUGS can be used through graphical interfaces `WinBUGS` and `OpenBUGS`. Both of these packages can be called from within R using packages `R2WinBUGS` and -`R2OpenBUGS`. There also exists software called -[`multibugs`](https://www.multibugs.org/) which is a parallel implementation of -BUGS. +`R2OpenBUGS`. * `JAGS`, which stands for “just another Gibbs sampler”. Can also be called from R using package `r2jags`. * The `NIMBLE` package extends `BUGS` and implements MCMC and other methods -for Bayesian inference. You can get it from -[https://r-nimble.org](https://r-nimble.org), and is best +for Bayesian inference. You can get it from https://r-nimble.org, and is best run directly from R. * The `Stan` software implements Hamiltonian Monte Carlo and other methods for -fitting hierarchical Bayesian models. It is available from -[https://mc-stan.org](https://mc-stan.org). +fit hierarchical Bayesian models. It is available from https://mc-stan.org. + # Bayesian Logistic Regression @@ -87,8 +95,8 @@ variable $Y_i$ representing the count of a number of successes from a given number of trials $n_i$ with success probability $\theta_i$, we have * $(Y_i \mid \boldsymbol \theta_i) \sim\mbox{Bi}(n_i, \theta_i),\,\, i.i.d.\,\,, i=1, \ldots, m$ -\begin{align*} - \mbox{logit}(\theta_i) & =\eta_i \nonumber\\ + \begin{align*} +\mbox{logit}(\theta_i) & =\eta_i \nonumber\\ \eta_{i} & =\beta_0+\beta_1 x_{i1}+\ldots+\beta_p x_{ip}=\boldsymbol x_i\boldsymbol \beta\nonumber \end{align*} assuming the logit link function and with linear predictor $\eta_{i}$. @@ -123,24 +131,24 @@ being fake news is related to the three covariates here. ## Fitting Bayesian Logistic Regression Models -We will analyse these data using the `MCMClogit()` function in package -`MCMCpack`. Note that this function uses a Metropolis-Hastings algorithm to -fit the model, rather than the Gibbs Sampler as used by `MCMCregress()`. +`BayesX` makes inference via MCMC, via the `R2BayesX` package which as noted +makes the syntax for model fitting very similar to that for fitting +non-Bayesian models using `glm()` in R. If you do not yet have it installed, +you can install it in the usual way from CRAN. -The package must first be loaded into R: +The package must be loaded into R: ```{r} -library(MCMCpack) +library(R2BayesX) ``` The syntax for fitting a Bayesian Logistic Regression Model with one response variable and three explanatory variables will be like so: ```{r eval=FALSE} -model1 <- MCMClogit(formula = y ~ x1 + x2 + x3, - data = data.set) +model1 <- bayesx(formula = y ~ x1 + x2 + x3, + data = data.set, + family = "binomial") ``` -with extra options for burn-in (`burnin`), MCMC samples (`mcmc`), and thinning -(`thin`) as with `MCMCregress()`. ## Model Fitting @@ -151,11 +159,11 @@ converted to a factor, for example fakenews$titlehasexcl <- as.factor(fakenews$title_has_excl) ``` -Function `summary` produces a summary (including parameter -estimates etc) for the parameters. +Functions `summary` and `confint` produce a summary (including parameter + estimates etc) and confidence intervals for the parameters, respectively. - - +In order to be able to obtain output plots from BayesX, it seems that we need +to create a new version of the response variable of type logical: ```{r} fakenews$typeFAKE <- fakenews$type == "fake" @@ -165,247 +173,244 @@ fakenews$typeFAKE <- fakenews$type == "fake" * Perform an exploratory assessment of the fake news data set, in particular looking at the possible relationships between the explanatory variables -and the fake/real response variable `typeFAKE`. You may wish to use the R +and the fake/real response variable `typeFAKE`. You may wish to use the R function `boxplot()` here. -
Solution - - ```{r fig = TRUE} - # Is there a link between the fakeness and whether the title has an exclamation mark? - table(fakenews$title_has_excl, fakenews$type) - # For the quantitative variables, look at boxplots on fake vs real - boxplot(fakenews$title_words ~ fakenews$type) - boxplot(fakenews$negative ~ fakenews$type) - - ``` - -
- -* Fit a Bayesian model in MCMCpack using the fake news `typeFAKE` variable as +
Solution + + ```{r fig = TRUE} + # Is there a link between the fakeness and whether the title has an exclamation mark? + table(fakenews$title_has_excl, fakenews$typeFAKE) + # For the quantitative variables, look at boxplots on fake vs real + boxplot(fakenews$title_words ~ fakenews$typeFAKE) + boxplot(fakenews$negative ~ fakenews$typeFAKE) + + ``` + +
+ +* Fit a Bayesian model in BayesX using the fake news `typeFAKE` variable as response and the others as covariates. Examine the output; does the model fit well, and is there any evidence that any of the explanatory variables are associated with changes in probability of an article being fake or not? -
Solution - - ```{r fig = TRUE} - # Fit the logistic regression model - results1 <- MCMClogit(formula = typeFAKE ~ titlehasexcl + title_words + negative, - data = fakenews, - burnin = 5000, mcmc = 15000, thin=1, - beta.start = NA, # uses maximum likelihood estimates as starting values - b0 = c(0,0,0,0), B0 = c(0.0001,0.0001,0.0001,0.0001), - verbose=1000) - summary(results1) - ``` - -
+
Solution + + ```{r fig = TRUE} + # Produce the BayesX output + bayesx.output <- bayesx(formula = typeFAKE ~ titlehasexcl + title_words + negative, + data = fakenews, + family = "binomial", + method = "MCMC", + iter = 15000, + burnin = 5000) + summary(bayesx.output) + confint(bayesx.output) + ``` + +
* Produce plots of the MCMC sample traces and the estimated posterior distributions for the model parameters. Does it seem like convergence has been achieved? -
Solution - - ```{r fig = TRUE, fig.width = 5, fig.height = 10} - # Trace plots - par(mfrow=c(2,2)) - traceplot(results1) - ``` - - ```{r fig = TRUE} - # And the density plots - par(mfrow=c(2,2)) - densplot(results1) - - ``` - -
+
Solution + + ```{r fig = TRUE, fig.width = 5, fig.height = 10} + # Traces can be obtained separately + plot(bayesx.output,which = "coef-samples") + ``` + + ```{r fig = TRUE} + # And the density plots one-by-one + par(mfrow=c(2,2)) + plot(density(samples(bayesx.output)[,"titlehasexclTRUE"]),main="Title Has Excl") + plot(density(samples(bayesx.output)[,"title_words"]),main="Title Words") + plot(density(samples(bayesx.output)[,"negative"]),main="Negative") + ``` + +
* Fit a non-Bayesian model using `glm()` for comparison. How do the model fits compare? -
Solution - - ```{r fig = TRUE} - # Fit model - note similarity with bayesx syntax - glm.output <- glm(formula = typeFAKE ~ titlehasexcl + title_words + negative, - data = fakenews, - family = "binomial") - # Summarise output - summary(glm.output) - # Perform ANOVA on each variable in turn - drop1(glm.output,test="Chisq") - ``` - -
- + +
Solution + + ```{r fig = TRUE} + # Fit model - note similarity with bayesx syntax + glm.output <- glm(formula = typeFAKE ~ titlehasexcl + title_words + negative, + data = fakenews, + family = "binomial") + # Summarise output + summary(glm.output) + # Perform ANOVA on each variable in turn + drop1(glm.output,test="Chisq") + ``` + +
+ + # Bayesian Poisson Regression ## Model Formulation To summarise the model formulation presented in the lecture, given a response -variable $Y_i$ representing the counts occurring from a process with mean +variable $Y_i$ representing the counts occurring from a process with mean parameter $\lambda_i$: -* $(Y_i \mid \boldsymbol \lambda_i) \sim\mbox{Po}(\lambda_i),\,\, i.i.d.\,\,, i=1, \ldots, n$ -\begin{align*} - \mbox{log}(\lambda_i) & =\eta_i \nonumber\\ + * $(Y_i \mid \boldsymbol \lambda_i) \sim\mbox{Po}(\lambda_i),\,\, i.i.d.\,\,, i=1, \ldots, n$ + \begin{align*} +\mbox{log}(\lambda_i) & =\eta_i \nonumber\\ \eta_{i} & =\beta_0+\beta_1 x_{i1}+\ldots+\beta_p x_{ip}=\boldsymbol x_i\boldsymbol \beta\nonumber \end{align*} assuming the log link function and with linear predictor $\eta_{i}$. -The following example considers a slightly more complex relationship. +## Example: Emergency Room Complaints -## Example: Salmonella +For this example we will use the `esdcomp` data set, which is available in the +`faraway` package. This data set records complaints about emergency room +doctors. In particular, data was recorded on 44 doctors working in an +emergency service at a hospital to study the factors affecting the number of +complaints received. -For this example we will use the `salmonella` data set, which can be read in -via the code -```{r} -salmonella <- read.csv("..\\data\\salmonella.csv") -``` +The response variable that we will use is `complaints`, an integer count of the +number of complaints received. It is expected that the number of complaints will +scale by the number of visits (contained in the `visits` column), so we are +modelling the rate of complaints per visit - thus we will need to include a new +variable `log.visits` as an offset. + +The three explanatory variables we will use in the analysis are: -This data set appeared in Breslow (1984) where Poisson methods were developed -for analysis. The data record the number of colonies of TA98 Salmonella which -are measured for different dosage levels of quinoline (in `mg` per plate) for -three replicate plates for each dosage level. +* `residency`, whether or not the doctor is still in residency training (values + `N` or `Y`); -The number of colonies is the response variable - an integer count; -and we are interested in the relationship between the numbers of -Salmonella colonies and dose. Theory suggests a dose-response curve of the -form -$ y_i = \beta_0 + \beta_1 \log \left(x_i+10\right) + \beta_2x_i $ +* `gender`, the gender of the doctor (values `F` or `M`); and -Thus we have a single explanatory variable (dosage level) which appears twice -in the formula to represent the curved relatioship. +* `revenue`, dollars per hour earned by the doctor, recorded as an integer. + +Our simple aim here is to assess whether the seniority, gender or income of the +doctor is linked with the rate of complaints against that doctor. + +We can use the following code to extract the data we want without having to load +the whole package: + +```{r} +esdcomp <- faraway::esdcomp +``` ## Fitting Bayesian Poisson Regression Models -Again we can use package `MCMCpack` to fit this form of Bayesian generalised -linear model, this time using the function `MCMCpoisson()`. +Again we can use `BayesX` to fit this form of Bayesian generalised +linear model. If not loaded already, the package must be loaded into R: ```{r echo=FALSE} -library(MCMCpack) +library(R2BayesX) ``` -The syntax for fitting a Bayesian Poisson Regression Model with one -response variable and two explanatory variables would be -like so: +In BayesX, the syntax for fitting a Bayesian Poisson Regression Model with one +response variable, three explanatory variables and an offset will be like so: + ```{r eval=FALSE} -results2 <- MCMCpoisson(formula = y~x1+x2, - data = data.set) +model1 <- bayesx(formula = y~x1+x2+x3+offset(w), + data = data.set, + family="poisson") +``` + +As noted above we need to include an offset in this analysis; since +for a Poisson GLM we will be using a log() link function by default, we must +compute the log of the number of visits and include that in the data set +`esdcomp`: + +```{r} +esdcomp$log.visits <- log(esdcomp$visits) ``` -In this example, the first explanatory variable ($x_1) is $\log(x_1+10)$, and -this will need to be calculated before fitting the model. -For a Poisson GLM we will be using a log() link function by default. +The offset term in the model is then written + +`offset(log.visits)` + +in the call to `bayesx`. ## Exercises -* Perform an exploratory assessment of the salmonella data set, -particularly how the number of Salmonella colonies varies with -the dosage of quinoline. - -
Solution - - ```{r fig = TRUE} - # Plot the relationship between number of colonies and dose - plot(salmonella$Dose,salmonella$y) - ``` - -
- -* Fit a Bayesian model using `MCMCpoisson` using the `y` variable (colonies of -salmonella) as -Poisson response and the dose as covariate. Examine the output; does the -model fit well, and is there any statistical -evidence that dosage level is associated with the number of salmonella -colonies? In particular, do the 95% credible intervals for $\beta_1$ and -$\beta_2$ contain zero? - -
Solution - - ```{r fig = TRUE} - # Fit model - salmonella$logDose10 <- log(salmonella$Dose+10) - results2 <- MCMCpoisson(formula = y ~ logDose10 + Dose, - data = salmonella, - burnin = 5000, mcmc = 10000, thin=10, - beta.start = c(NA,NA,NA), # uses maximum likelihood estimates as starting values when NA - b0 = c(0,1,1), - B0 = c(0.0001,0.0001,0.0001), - verbose=1000) - # Summarise output - summary(results2) - ``` - -
+* Perform an exploratory assessment of the emergency room complaints data set, +particularly how the response variable `complaints` varies with the proposed +explanatory variables relative to the number of visits. To do this, create +another variable which is the ratio of `complaints` to `visits`. + +
Solution + + ```{r fig = TRUE} + # Compute the ratio + esdcomp$ratio <- esdcomp$complaints / esdcomp$visits + # Plot the link with revenue + plot(esdcomp$revenue,esdcomp$ratio) + # Use boxplots against residency and gender + boxplot(esdcomp$ratio ~ esdcomp$residency) + boxplot(esdcomp$ratio ~ esdcomp$gender) + ``` + +
+ +* Fit a Bayesian model in BayesX using the `complaints` variable as Poisson +response and the others as covariates. Examine the output; does the model fit +well, and is there any evidence that any of the explanatory variables are +associated with the rate of complaints? + +
Solution + + ```{r fig = TRUE} + # Fit model - note similarity with bayesx syntax + esdcomp$log.visits <- log(esdcomp$visits) + bayesx.output <- bayesx(formula = complaints ~ offset(log.visits) + residency + gender + revenue, + data = esdcomp, + family = "poisson") + # Summarise output + summary(bayesx.output) + ``` + +
* Produce plots of the MCMC sample traces and the estimated posterior distributions for the model parameters. Does it seem like convergence has been achieved? -
Solution - - ```{r fig = TRUE, fig.width = 7, fig.height = 9} - # Traces can be obtained separately - par(mfrow =c(2, 2)) - traceplot(results2) - ``` - - ```{r fig = TRUE, fig.width = 7, fig.height = 7} - # And the density plots - par(mfrow =c(2, 2)) - densplot(results2) - ``` - -
- -* Plot the curved relationship from the fitted model, showing how Salmonella -colonies are predicted to vary with dose. Add the 95% prediction intervals; -do the intervals look wide enough, that is, do they capture most of the data -points? -
Solution - - ```{r fig = TRUE} - plot(salmonella$Dose,salmonella$y,ylab="Number of colonies") - x.predict <- seq(0,1000,length=1001) - beta.0 <- summary(results2)[[1]]["(Intercept)","Mean"] - beta.1 <- summary(results2)[[1]]["logDose10","Mean"] - beta.2 <- summary(results2)[[1]]["Dose","Mean"] - y.predict <- exp(beta.0 + beta.1 * log(x.predict+10) + beta.2 * x.predict) - lines(x.predict,y.predict) - # Now let's plot the prediction intervals - here we need to use the full set of MCMC samples - x.pred.matrix <- as.matrix(x.predict) # Needed to use apply() - y.pred.matrix <- apply(x.pred.matrix,1,function(x){results2[,"(Intercept)"]+results2[,"logDose10"]*log(x+10)+results2[,"Dose"]*x}) - y.predict.lower <- exp(apply(y.pred.matrix,2,function(x){quantile(x,0.025)})) - lines(x.predict,y.predict.lower,lty=2) - x.pred.matrix <- as.matrix(x.predict) # Needed to use apply() - y.pred.matrix <- apply(x.pred.matrix,1,function(x){results2[,"(Intercept)"]+results2[,"logDose10"]*log(x+10)+results2[,"Dose"]*x}) - y.predict.upper <- exp(apply(y.pred.matrix,2,function(x){quantile(x,0.975)})) - lines(x.predict,y.predict.upper,lty=2) - ``` - -
- +
Solution + + ```{r fig = TRUE, fig.width = 5, fig.height = 10} + # An overall plot of sample traces and density estimates + # plot(samples(bayesx.output)) + # Traces can be obtained separately + plot(bayesx.output,which = "coef-samples") + ``` + + ```{r fig = TRUE} + # And the density plots one-by-one + par(mfrow =c(2, 2)) + plot(density(samples(bayesx.output)[, "residencyY"]), main = "Residency") + plot(density(samples(bayesx.output)[, "genderM"]), main = "Gender") + plot(density(samples(bayesx.output)[, "revenue"]), main = "Revenue") + ``` + +
+ * Fit a non-Bayesian model using `glm()` for comparison. How do the model fits compare? -
Solution - - ```{r fig = TRUE} - # Fit model - note similarity with MCMCpack syntax - glm.output <- glm(formula = y ~ logDose10 + Dose, - data = salmonella, - family = "poisson") - # Summarise output - summary(glm.output) - # Perform ANOVA on each variable in turn - drop1(glm.output, test = "Chisq") - ``` - -
- -## Reference - -Breslow, N. E. (1984). Extra-Poisson Variation in Log-Linear Models. Journal of the Royal Statistical Society. Series C (Applied Statistics), 33(1), 38–44. https://doi.org/10.2307/2347661 + +
Solution + + ```{r fig = TRUE} + # Fit model - note similarity with bayesx syntax + esdcomp$log.visits <- log(esdcomp$visits) + glm.output <- glm(formula = complaints ~ offset(log.visits) + residency + gender + revenue, + data = esdcomp, + family = "poisson") + # Summarise output + summary(glm.output) + # Perform ANOVA on each variable in turn + drop1(glm.output, test = "Chisq") + ``` + +
diff --git a/vignettes/p7.Rmd b/vignettes/p7.Rmd index 0a3fade..f204baf 100644 --- a/vignettes/p7.Rmd +++ b/vignettes/p7.Rmd @@ -21,13 +21,13 @@ models. As explained in the previous lecture, hierarchical models provide a convenient tool to define models so that the different sources of variation in the data are clearly identified. Bayesian inference for highly structured hierarchical models can be difficult and may require the use of Markov chain -Monte Carlo methods. However, packages such as `BayesX` and `INLA`, to mention -just two, provide a very convenient way to fit and make inference about complex -Bayesian hierarchical models. +Monte Carlo methods. -Regarding software, we will use `MCMCpack` to fit the models in the fairly -straightforward examples in this Practical - more complex models would require -another package. +However, packages such as `BayesX` and `INLA`, to mention +just two, provide a very convenient way to fit and make inference about certain +types of Bayesian hierarchical models. We will continue using BayesX, although +you can try out using `INLA` if you look at the optional, additional material +in Practical 8. # Linear Mixed Models @@ -75,21 +75,13 @@ summary(nlschools) The model to fit will take `lang` as the response variable and include `IQ`, `GS`, `SES` and `COMB` as covariates (i.e., fixed effects). This model -can easily be fitted with `MCMCregress` as follows: +can easily be fit with `BayesX` (using the `R2BayesX` package) as follows: ```{r message = FALSE, warning = FALSE} -library(MCMCpack) -burnin <- 5000 -mcmc <- 10000 -thin <- 10 -results1 <- MCMCregress(formula = lang ~ IQ + GS + SES + COMB, - data = nlschools, - b0 = 0, B0 = 0.0001, - c0 = 2, d0 = 2, # Because the prior is Ga(c0/2,d0/2), - beta.start = c(1,1), - burnin=burnin, mcmc=mcmc, thin=thin, - verbose=1000) -summary(results1) +library("R2BayesX") +m1 <- bayesx(lang ~ IQ + GS + SES + COMB, data = nlschools) + +summary(m1) ``` Note that the previous model only includes fixed effects. The data set includes @@ -97,13 +89,15 @@ Note that the previous model only includes fixed effects. The data set includes impact on the performance of the students, with students in the same class performing similarly in the language test. -Very conveniently, the function `MCMChregress` in `MCMCpack" -can include random effects in the model. The argument `group` takes the name -of the grouping variable (in quotes). The fixed effects are written as a -standard model formula using the argument `fixed`, while the argument `random` -(as you might expect!) contains the random effects model formula, this time -using a one-sided formula (as the response variable was already declared -in the fixed model formula). This is best illustrated via example, as below. +Very conveniently, `Bayesx` can include random effects in the model by adding a +term in the right hand side of the formula that defined the model. Specifically, +the term to add is `sx(class, bs = "re")` (see code below for the +full model). +This will create a random effect indexed over variable `class` and which is of +type `re`, i.e., it provides random effects which are independent and +identically distributed using a normal distribution with zero mean and +precision $\tau$ - there are clear similarities with a residual terms here +(although that is at the observation level, rather than the group level). Before fitting the model, the between-class variability can be explored by means of boxplots: @@ -112,126 +106,17 @@ means of boxplots: boxplot(lang ~ class, data = nlschools, las = 2) ``` -The we wish to use has a random effect indexed over grouping variable -`class` and random effects which are independent and identically distributed -using a normal distribution with zero mean and precision $\tau_\phi$. -The function `MCMChregress()` places a Wishart prior on $\tau_\phi$, -but studying this is beyond the scope of this course. For our purposes, -we would like a vague prior for this quantity (similar to the Uniform prior -placed on $\sigma_\phi$ in the lectures), and this equates to placing an -Inverse Wishart prior on $\sigma^2_\phi$. We use a IW(1,1) prior here, to -give the most uninformative prior (via arguments $r$ and $R$ to -`MCMChregress()`). - -The normal priors for the fixed effect regression parameters have means set -by `mubeta` and variances by `Vbeta`; we use fixed values of 0 for each mean -and 0.000001 for each variance. - -Note that in `MCMChregress()`, the argument `verbose` only takes the values `0` -and `1`, representing a switch where a progress bar is shown if the value `1` -is chosen. - -The code to fit the model with a single random effect allowing for a different -mean level per group is: +The code to fit the model with random effects is: ```{r} -results2 <- MCMChregress( - fixed = lang ~ IQ + GS + SES + COMB, - random = ~ 1, - group = "class", - data = nlschools, - #burnin=1000, mcmc=10000, thin=1, verbose=1, - burnin=100, mcmc=1000, thin=1, verbose=1, - #beta.start=0, sigma2.start=1, # Starting values for - mubeta=0, Vbeta=1.0E6, # Mean and variance of prior distributions for each random effect - r=1, R=1 # To give an uninformative prior on sigma_phi +m2 <- bayesx( + lang ~ IQ + GS + SES + COMB + sx(class, bs = "re"), + data = nlschools ) +summary(m2) ``` -If we look at -```{r} -summary(results2) -``` -all we see is a brief description of the stucture of the MCMC output contained -within `results2`. This means that to look at the results, we need to do a -little more work. - -There are two objects in `results2`; `mcmc` contains the MCMC simulations, and -`Y.pred` contains predictions for the response variable, `lang` in this -example. - -Doing -```{r} -summary(results2$mcmc) -``` -gives summary statistics for *everything* in the model - including all of the -random effects. Most often we want to study the fixed effects; we can select -*only* those terms using the somewhat clunky but powerful code below,. -making use of `grepl`, a function for matching regular expressions: -```{r} -summary(results2$mcmc[ , grepl("beta", colnames(results2$mcmc), fixed=TRUE)]) -``` -This works because all the names of the fixed effects (including the overall -Intercept) contain the string `beta`. - -We are also interested in the variance terms for the random effects and for -the residuals - the relative sizes of these in informative on whether there is -more variability within the groups or between the groups. - -For the residual variance, we use -```{r} -summary(results2$mcmc[ , grepl("sigma2", colnames(results2$mcmc), fixed=TRUE)]) -``` -and for the random effects variance, we use -```{r} -summary(results2$mcmc[ , grepl("VCOV", colnames(results2$mcmc), fixed=TRUE)]) -``` -The random effect variance term looks more complicated because for more -complex models (with more than a single random effect term) it will be a matrix. - -In this example, we can see that the random effect variance is much larger than -the residual variance, suggesting that there is more variability between the -groups than within the groups. - -Finally, we can study some trace plots and some density plots. - -```{r fig = TRUE, fig.width = 7, fig.height = 9} - par(mfrow=c(3,2)) - ts.plot(results2$mcmc[,"beta.(Intercept)"], ylab = "beta.Intercept", xlab = "Iteration") - ts.plot(results2$mcmc[,"beta.IQ"], ylab = "beta.IQ", xlab = "Iteration") - ts.plot(results2$mcmc[,"beta.GS"], ylab = "beta.GS", xlab = "Iteration") - ts.plot(results2$mcmc[,"beta.SES"], ylab = "beta.SES", xlab = "Iteration") - ts.plot(results2$mcmc[,"beta.COMB1"], ylab = "beta.COMB1", xlab = "Iteration") -``` - -```{r fig = TRUE, fig.width = 7, fig.height = 9} - par(mfrow=c(3,2)) - ts.plot(results2$mcmc[,"VCV.(Intercept).(Intercept)"], ylab = "RE Variance", xlab = "Iteration") - ts.plot(results2$mcmc[,"sigma2"], ylab = "Residual Variance", xlab = "Iteration") -``` - -```{r fig = TRUE, fig.width = 7, fig.height = 9} - par(mfrow=c(3,2)) - plot(density(results2$mcmc[,"beta.(Intercept)"], bw = 5), - main = "Posterior density for beta.Intercept") - plot(density(results2$mcmc[,"beta.IQ"], bw = 0.2), - main = "Posterior density for beta.IQ") - plot(density(results2$mcmc[,"beta.GS"], bw = 0.75), - main = "Posterior density for beta.GS") - plot(density(results2$mcmc[,"beta.SES"], bw = 0.2), - main = "Posterior density for beta.SES") - plot(density(results2$mcmc[,"beta.COMB1"], bw = 5), - main = "Posterior density for beta.COMB1") -``` - -```{r fig = TRUE, fig.width = 7, fig.height = 9} - par(mfrow=c(3,2)) - plot(density(results2$mcmc[,"VCV.(Intercept).(Intercept)"], bw = 0.5), - main = "Posterior density for RE Variance") - plot(density(results2$mcmc[,"sigma2"], bw = 0.5), - main = "Posterior density for Residual Variance") -``` # Generalised Linear Mixed Models @@ -247,106 +132,169 @@ Compared to the previous setting of linear mixed effects models, note that now the distribution of the response could be other than Gaussian and that observations are not necessarily nested within groups. -# Poisson GLMM - overdispersion +# Poisson regression + +In this practical we will use the North Carolina Sudden Infant Death Syndrome +(SIDS) data set. It is available in the `spData` package and it can be loaded +using: + +```{r message = FALSE, warning = FALSE} +library(spData) +data(nc.sids) +summary(nc.sids) +``` + +A full description of the data set is provided in the associated manual page +(check with `?nc.sids`) but in this practical we will only consider these +variables: + +* `BIR74`, number of births (1974-78). + +* `SID74`, number of SID deaths (1974-78). + +* `NWBIR74`, number of non-white births (1974-78). + +These variables are measured at the county level in North Carolina, of which +there are 100. + +Because `SID74` records the number of SID deaths, the model is Poisson: + +$$ +O_i \mid \mu_i \sim Po(\mu_i),\ i=1,\ldots, 100 +$$ +Here, $O_i$ represents the number of cases in county $i$ and $\mu_i$ the mean. +In addition, mean $\mu_i$ will be written as $\mu_i = E_i \theta_i$, where $E_i$ +is the *expected* number of cases and $\theta_i$ the relative risk in county $i$. + +The relative risk $\theta_i$ is often modelled, on the log-scale, to be equal +to a linear predictor: + +$$ +\log(\theta_i) = \beta_0 + \ldots +$$ + +The expected number of cases is computed by multiplying the number of births in +county $i$ to the overall mortality rate -In the previous practical we looked at the Salmonella example with a Poisson -GLM, and we saw evidence that a standard GLM was not adequate to describe the -amount of variability in the data set, in terms of the number of colonies. +$$ +r = \frac{\sum_{i=1}^{100}O_i}{\sum_{i=1}^{100}B_i} +$$ +where $B_i$ represents the total number of births in country $i$. Hence, +the expected number of cases in county $i$ is $E_i = r B_i$. -In this practical we will fit a Poisson GLMM as a mechanism for introducing -extra variation into the model; another way of thinking of this is that we -will be including a term in the model to deal with *overdispersion*. -If you need to load the data set in again: ```{r} -salmonella <- read.csv("..\\data\\salmonella.csv") +# Overall mortality rate +r74 <- sum(nc.sids$SID74) / sum(nc.sids$BIR74) +# Expected cases +nc.sids$EXP74 <- r74 * nc.sids$BIR74 ``` -If not loaded already, the package must be loaded into R: -```{r echo=FALSE} -library(MCMCpack) +A common measure of relative risk is the *standardised mortality ratio* +($O_i / E_i$): + +```{r} +nc.sids$SMR74 <- nc.sids$SID74 / nc.sids$EXP74 ``` -If not loaded already, the package must be loaded into R: -```{r echo=FALSE} -library(R2BayesX) +A summary of the SMR can be obtained: + +```{r fig = TRUE} +hist(nc.sids$SMR, xlab = "SMR") ``` -A Poisson GLMM to deal with overdispersion simply has a separate random effect -term $\lambda_i$ for each data point; hence our model becomes: -$ y_i = \beta_0 + \beta_1 \log \left(x_i+10\right) + \beta_2x_i + \lambda_i$ -where -$\lambda_i\sim\mbox{N}\left(0,\sigma^2)$. +Values above 1 indicate that the county has more observed deaths than expected +and that there might be an increased risk in the area. + + +As a covariate, we will compute the proportion of non-white births: -`MCMCpack` places an Inverse Wishart prior distribution on the random effect -distribution variance term; we need to specify the variance matrix, which in -our case will be a 1 by 1 matrix. +```{r} +nc.sids$NWPROP74 <- nc.sids$NWBIR74/ nc.sids$BIR74 +``` + +There is a clear relationship between the SMR and the proportion of non-white +births in a county: + +```{r fig = TRUE} +plot(nc.sids$NWPROP74, nc.sids$SMR74) + +# Correlation +cor(nc.sids$NWPROP74, nc.sids$SMR74) +``` -Now the model fitting: -```{r fig=TRUE} + +A simple Poisson regression can be fit by using the following code: + +```{r} +m1nc <- bayesx( + SID74 ~ 1 + NWPROP74, + family = "poisson", + offset = log(nc.sids$EXP74), + data = nc.sids +) +summary(m1nc) +``` + +Random effects can also be included to account for intrinsic differences between +the counties: + +```{r} # Index for random effects -salmonella$ID <- 1:nrow(salmonella) -salmonella$ID -ID <- salmonella$ID - -# Fit the model -salmonella$logDose10 <- log(salmonella$Dose+10) -results3 <- bayesx(y ~ logDose10 + Dose + sx(ID, bs = "re"), - family="poisson", data = salmonella) -summary(results3) -plot(results3) - -# Fit the model -# salmonella$log.Dose10 <- log(salmonella$Dose+10) -# results3 <- MCMChpoisson(fixed = y ~ log.Dose10 + Dose, -# random = ~ 1, -# group = "ID", -# data = salmonella, -# burnin = 5000, mcmc = 10000, thin=10, -# # beta.start = c(NA,NA,NA), # uses maximum likelihood estimates as starting values when NA -# mubeta = c(0,1,1), -# Vbeta = 0.0001, -# R = diag(1,1), -# FixOD=1, verbose=1) -# # Summarise output -# summary(results3) +nc.sids$ID <- 1:nrow(nc.sids) + +# Model WITH covariate +m2nc <- bayesx( + SID74 ~ 1 + NWPROP74 + sx(ID, bs = "re"), + family = "poisson", + offset = log(nc.sids$EXP74), + data = as.data.frame(nc.sids) +) +summary(m2nc) ``` -## Exercises - -* Use BayesX - -```{r fig = TRUE, echo=TRUE} - plot(salmonella$Dose,salmonella$y)#,ylab="Number of colonies") - x.predict <- seq(0,1000,length=1001) - beta.0 <- coef(results3)["(Intercept)","Mean"] - beta.0 - beta.1 <- coef(results3)["logDose10","Mean"] - beta.2 <- coef(results3)["Dose","Mean"] - y.predict <- exp(beta.0 + beta.1 * log(x.predict+10) + beta.2 * x.predict) - lines(x.predict,y.predict) - # Now let's plot the prediction intervals - here we need to use the full set of MCMC samples, and add on a contribution for the random effect variance - beta.samples <- samples(results3) # Extract MCMC samples for the betas - RE.sd <- sqrt(mean(samples(results3, term = "sx(ID):re")[,"Var"])) # RE standard deviation - x.pred.matrix <- as.matrix(x.predict) # Needed to use apply() - y.pred.matrix <- apply(x.pred.matrix,1,function(x){beta.samples[,"Intercept"]+beta.samples[,"logDose10"]*log(x+10)+beta.samples[,"Dose"]*x}) - y.predict.lower <- apply(y.pred.matrix,2,function(x){quantile(x,0.025)}) - y.predict.lower <- exp(y.predict.lower)#-1.96*RE.sd) - lines(x.predict,y.predict.lower,lty=2) - y.predict.upper <- apply(y.pred.matrix,2,function(x){quantile(x,0.975)}) - y.predict.upper <- exp(y.predict.upper)#+1.96*RE.sd) - lines(x.predict,y.predict.upper,lty=2) +We can see the fitted relationship with `NWPROP74`: +```{r fig = TRUE} +x.predict <- seq(0,1,length=1000) +y.predict <- exp(coef(m2nc)["(Intercept)","Mean"]+coef(m2nc)["NWPROP74","Mean"]*x.predict) +par(mfrow = c(1, 1)) +plot(nc.sids$NWPROP74, nc.sids$SMR74) +lines(x.predict,y.predict) ``` +The role of the covariate can be explored by fitting a model without it: +```{r} +# Model WITHOUT covariate +m3nc <- bayesx( + SID74 ~ 1 + sx(ID, bs = "re"), + family = "poisson", + offset = log(nc.sids$EXP74), + data = as.data.frame(nc.sids) +) + +summary(m3nc) +``` + +Now, notice the decrease in the estimate of the precision of the random effects +(i.e., the variance increases). This means that values of the random effects +are now larger than in the previous case as the random effects pick some of the +effect explained by the covariate. + + +```{r fig = TRUE} +par(mfrow = c(1, 2)) +boxplot(m2nc$effects$`sx(ID):re`$Mean, ylim = c(-1, 1), main = "With NWPROP74") +boxplot(m3nc$effects$`sx(ID):re`$Mean, ylim = c(-1, 1), main = "Without NWPROP74") +``` + # Further Extensions Spatial random effects can be defined not to be independent and identically distributed. Instead, spatial or temporal correlation can be considered when -defining them. With spatial area data sets for example, it is common -to consider that two areas that are neighbours (i.e., share a boundary) -will have some correlation between them, This can be taken into account in -GLMMs, +defining them. For example, in the North Carolina SIDS data set, it is common +to consider that two counties that are neighbours (i.e., share a boundary) +will have similar relative risks. This can be taken into account in the model but assuming that the random effects are spatially autocorrelated. This is out of the scope of this introductory course but feel free to ask about this!!