modelling_mondays.qmd

---
title: "Modelling Mondays"
author: "Argyris Stringaris"
date: "05-06-2024"
format: 
  html:
    toc: true
    toc-depth: 3
    toc-title: Table of contents
editor: visual
---

# Week 1: The Likelihood Function

## Motivation

What is Data Generating Process (DGP) and what is a likelihood function?

Typically, we think of a DGP as a mathematical formula that gives rise to a distribution. For example, the IQ curve can be generated through the Gaussian, named after Karl Friedrich Gauß--very much worth reading about also in the novel The Measuring of the World, by Kehlmann (where the parallel lives of Gauß and Humboldt are presented).

![](Carl_Friedrich_Gauss_1840_by_Jensen.jpg)

![](measuring_the_world.jpg)

But in a more abstract way, the question is, what are the mechanisms through which a set of data are generated, be it voting patterns, brain data or league games.

Consider, for example, a sample of the general population filling in a questionnaire about depression. Figure 1a. shows a typical pattern, that of a right-skewed truncated distribution. The "mechanism" that gives rise to the right skew is the fact that there are far more people without many symptoms and hence many people close to the zero mark. It is also truncated because scores can't go below zero and can't go above the max of the sum of the scale. By contrast, Figure 1b, shows the depression scores of a clinical population.

```{r echo = F, warning = F, message = F }
library(stevemisc)
library(tidyverse)

samples <- c(10^2, 10^3, 10^4, 10^5, 10^6, 5*10^6, 10*10^6)
mean_a <- 4.9 
sd <- 4.49
mean_b <- 12


df_pop <- data.frame(

values = rbnorm(samples[4], mean_a, sd, 0, 26, round = TRUE, seed = 1974), 

origin = rep("general population", samples[4])
)

df_comm <- data.frame(

values = rbnorm(samples[4], mean_b, sd, 0, 26, round = TRUE, seed = 1974),

origin = rep("clinical population", samples[4])
)


df_gen_vs_clin <- rbind(df_pop, df_comm)


df_gen_vs_clin %>% 
ggplot(aes(x = values, fill= origin))+
  geom_histogram(position = "identity", alpha = 0.4, bins = 40)+
  ggtitle ("Two Data Generating Processes",
           subtitle = "Score on a Depression Questionnaire")
  
  
```

In general, we always want to consider the DGP so as to:

a\) understand what gives rise to the data.

b\) mathematically describe (at least) how the data arise.

c\) estimate parameters (related to b)

d\) simulate the process to study it better.

## The omniscient person: knowing the DGS and the correct parameter.

This is someone who knows the function and its probability, is certain about the DGP. Let's say that they know that they are dealing with the normal distribution, which is formalised as:

$$
f(x | \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot e^{-\frac{(x - \theta)^2}{2\sigma^2}}
$$ {#eq-1}

where, *x* is the point of interest of the probability density function, $\theta$ is the mean (location parameter) of the normal distribution, and $\sigma$ is the standard deviation (spread parameter).

```{r echo = F, warning = F, message = F}
# this is a function for the pdf, you can get it through dnorm.
pdf_normal <- function(x, theta, sigma) {
  density <- 1 / (sqrt(2 * pi * sigma^2)) * exp(-((x - theta)^2) / (2 * sigma^2))
  return(density)
}


x <- seq(from = 40, to = 180,by = 1) #giving it a typical range for IQs
theta <- 100 # we know this as the mean
sigma <- 15 # the sd

pdf_value <- pdf_normal(x, theta, sigma)

df_values_and_parameter <- data.frame(x, pdf_value)
df_values_and_parameter %>% 
  ggplot(aes(x, pdf_value))+
geom_point()


# sum(pdf_value[x<100,]) #check what this sums up to

```

You will all recognise this as the standard IQ curve.

Please note from Equation 1 that here the point is that the situation is phrased as:

$$f(x | \theta, \sigma)$$

i.e. we ask what the probability is of obtaining these data given the parameters $\theta$.

The situation where you are certain about the correct parameter and only need to know the frequency of individual values or set of values is a very convenient one to be in. Often however, in the real world we may have an intuition about what the DGP might be but not know the parameter(s). That is when we ask about the likelihood.

## A real person: having data, intuiting the DGS, and not knowing the parameter.

Consider having collected some data, having some intuition about the DGS and needing to find out the parameter amongst a set of parameters.

This is a more likely situation which I will illustrate here by trying to recover the mean parameter from synthetic data.

For the moment it is safe to say that what were trying to do here is to invert the process above, i.e. what you do with the probability density function. Instead of asking what data are likely to occur given a parameter (such as the mean and sd) that you *already* know about, here you ask, what is the most likely parameter that has given rise to the data I have.

$$
L(\mu, \sigma^2 | x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)
$$ {#eq-2}

Equation 2 states precisely that: what is the likelihood of this mean and variance, given all these data points? Equation 2 on the right-hand side contains the PDF, as above, but what it says is that it takes the probability at each step and multiplies them altogether, this is what that giant Greek $\Pi$ stands for, the product.

Notice that when I tried this with fewer data points, I was able to get the likelihood, but when I increased them, I needed the natural log. Try it for yourself.

```{r echo = F, warning = F, message = F}
# I am using the above pdf to define the log-likelihood function 

log_likelihood_normal <-  function(x, theta, sigma) {
  log_pdf <- log(1 / (sqrt(2 * pi * sigma^2)) * exp(-((x - theta)^2) / (2 * sigma^2)))
  log_likelihood <- sum(log_pdf)  # Log-likelihood for the entire dataset
  return(log_likelihood)
}


#I could also have done this. Note what happens if you leave out the log...
# log_likelihood_normal <- function(mu, x) {
#   log_pdf <- dnorm(x, mean = mu, sd = 15, log = TRUE)  # Log of probability density function
#   log_likelihood <- sum(log_pdf)  # Log-likelihood for the entire dataset
#   return(log_likelihood)
# }
# 

# I am making synthetic iq data
set.seed(1974)  
data <- rnorm(1000, mean = 100, sd = 15) 

# log likelihood for the different mean values
mu_values <- seq(40, 160, by = 0.1)  # Range of mu values to test
log_likelihood_values <- sapply(mu_values, function(mu) log_likelihood_normal(data, mu, 15))

# This gives me the MLE
mle <- mu_values[which.max(log_likelihood_values)] # close to 100!


df_log_likelihood <- data.frame(mu_values, log_likelihood_values)
df_log_likelihood %>% 
  ggplot(aes(x = mu_values, y = log_likelihood_values))+
  geom_point()+
  geom_vline(xintercept = mle, colour = "red")+
  ggtitle("A Likelihood Function for IQ")+
  ylab("Log-Likelihood")+
  xlab("Means IQ points")

```

## Likelihood for the linear regression model

Now let's turn to the simple linear regression model. Let's start by asking how to think formally of the data generating mechanism of any linear model. It should be a

$$
y_i = \beta_0 + \beta_1 x_i + \epsilon_i 
$$ {#eq-5}

where, $\epsilon_i$ follows a normal distribution with mean zero and variance $\sigma^2$

$$
L(\beta_0, \beta_1 | x_1, y_1, x_2, y_2, \ldots, x_n, y_n) = \prod_{i=1}^{n} f(y_i | \beta_0 + \beta_1 x_i)
$$ {#eq-6}

where $$f(y_i | \beta_0 + \beta_1 x_i) $$ {#eq-7} is the probability density function (PDF) of the normal distribution with mean $( \mu_i = \beta_0 + \beta_1 x_i )$ and constant variance $sigma^2$

To demonstrate this, I will first create synthetic data

```{r echo = F, warning = F, message = F}
# Some synthetic data
set.seed(1974)  
n <- 100  # obs
x <- rnorm(n, mean = 5, sd = 2)  # predictor
epsilon <- rnorm(n, mean = 0, sd = 1)  # error 
beta0 <- 2  # intercept arbitrary
beta1 <- 0.6  # slope also arbitrary
y <- beta0 + beta1 * x + epsilon  # simulate the dependent variable

# Define the likelihood function for simple linear regression
stan_data_slope <- function(beta0, beta1, x, y) {
  mu <- beta0 + beta1 * x  # predicted values
  log_likelihood <- sum(dnorm(y, mean = mu, sd = 1, log = TRUE))  # log-likelihood i have used # dnorm to save space, could have equally used the spelt out function that I created above
  return(log_likelihood)
}

# The log-likelihood for different values of beta0 and beta1 REMIND ME TO TELL YOU ABOUT THE #PLAUSIBLE RANGE!
beta0_values <- seq(0, 4, by = 0.1)  
beta1_values <- seq(0, 1, by = 0.01)  
log_likelihood_values <- outer(beta0_values, beta1_values, # the outer function allows me to #have a vector of the two parameters, which you then pass each through the vectorised #function. This can be mind-boggling and I have created below an example with a simpler input #and function
                               Vectorize(function(b0, b1) stan_data_slope(b0, b1, x, y)))

# The MLE for beta0 and beta1
max_indices <- which(log_likelihood_values == max(log_likelihood_values), arr.ind = TRUE)
mle_beta0 <- beta0_values[max_indices[1]]
mle_beta1 <- beta1_values[max_indices[2]]

# Plot log-likelihood surface A PAIN!
# library(plot3D)
# persp3D(beta0_values, beta1_values, log_likelihood_values, xlab = "Beta0", ylab = "Beta1", zlab = "Log-Likelihood",
#         main = "Log-Likelihood Surface for Simple Linear Regression")
# points3D(mle_beta0, mle_beta1, max(log_likelihood_values), col = "red", pch = 16)
# 

```

In the code chunk below, I explain how the outer product and vectorisation works.

```{r warning = F, message = F , echo = F}
# Show the above with a simple function
simple_function <- function(a, b) {
  return((a + b)^2)
}

# Create input vectors
input_vector_a <- c(1, 2, 3)
input_vector_b <- c(10, 20, 30)

# Apply the function to all combinations of elements from the input vectors
result_matrix <- outer(input_vector_a, input_vector_b, Vectorize(simple_function))


```

# Week 2: The Binomial Distribution

Before we move over to more complex models, let's consider the binomial or Bernoulli distribution. Here are two situations. In the first one, you are a Creator (let's say a game creator, rather than The Creator); in the other you are a detective.

## The binomial from a creator's point of view.

Let us assume that you are trying to create a game for which you must create sequences of binary events, let's say decisions between a state *H* and a state *T*. Basically, you want either of these two to appear with a probability that is on average $\pi = 0.5$ i.e. a 50% chance of appearing.

I want to take us back to something which whilst obvious, is often forgotten, namely that probabilities are things that we can understand "in the long run".

Here is what I mean. The way to create the game above is to invoke the binomial distribution. This is the following:

$$f(k;n,p) = \binom{n}{k} p^k (1-p)^{n-k}
$$ {#eq-3} where $\binom{n}{k}$ is the binomial coefficient (will explain this) and then come the probabilities.

Let's explain this. The binomial coefficient, basically says: if I have *n* objects and want to choose *k* of them, how many ways can this be done. Think of the following example. I have the letters ABCD; how many ways are there to combine two letters (sequnce doesn't matter, e.g. AB = BA) There are 6 possible ways: AB AC AD BC BD CD to play around with it, look at the code below.

```{r echo = F, warning = F, message  = F}
some_objects <- c("blue", "red", "green", "black") # this the total set of possible objects, four colours in this case. 
n <- length(some_objects) # here I just get the total number of colours
k <- 0: n # here I create the vector of possible numbers to choose, anything from nothing to the maximum, i.e. 0,1,2,3,4
how_many_ways <- choose(n, k) # this now gives us the number of ways that k objects can. be chosen out of n, play around with it and the above numbers
ways_to_choose <- paste("We have" , how_many_ways , "way(s) to choose", k , "objects out of", n)
print(ways_to_choose)
```

In Equation 3, this quantity is then multiplied with the product of $p^k$ *x* $(1-p)^{n-k}$ . This product is a sequence of possible events of success and failure, for a probability *p* . If you substitute numbers between 1 and, say, 4 (representing possible outcomes in 4 coin tosses) into them, you would get

$p^1$ *x* $(1-p)^{4-1}$

$p^2$ *x* $(1-p)^{4-2}$

$p^3$ *x* $(1-p)^{4-3}$

$p^4$ *x* $(1-p)^{4-4}$

Each of these sequences is then multiplied with the number of ways k objects can be chosen out of n total objects (e.g. the number of 4 times Heads in 10 throws).

A priori, which one of these outcomes would you expect to be more likely for a fair coin?

```{r echo = F, warning = F, message  = F}

p <- 0.5
n <- 4
k <- 1:4
the_product<-0
the_ways_to_choose <- 0
multipl_the_two <- 0

outcomes <- 0
for(i in 1:n){

  the_product[i] <- p^k[i] * (1-p)^(n-k[i]) # p^k*(1-p)^(n-k)
  
  the_ways_to_choose[i] <- choose(n, k[i]) # the coefficient of choosing k items from n
  
  multipl_the_two <- the_product*the_ways_to_choose # this is the formula
  
}
multipl_the_two # these now are the probabilities for each k-value, e.g. that we have 1, 2, 3 or four times Heads.

dbinom(1:4, 4, 0.5) # check whether what I have done above fits with what the R inbuilt function would give you

# play around with the dice

```

### Exercise 1.1

Modify the above code to create a game where a coin is tossed 100 times.

a\) Estimate the probabilities for each possible outcome, *k* and store in vector.

b\) Find the outcome, *k* with the maximum probability

c\) Plot all possible outcomes against their probabilities.

d\) check against the standard inbuilt R function

```{r echo = F, warning = F, message = F}

p <- 0.5
n <- 100
k <- 1:n # these are all the possible outcomes
the_product<-0
the_ways_to_choose <- 0
ultipl_the_two <- 0

outcomes <- 0
for(i in 1:n){

  the_product[i] <- p^k[i] * (1-p)^(n-k[i]) # p^k*(1-p)^(n-k)
  
  the_ways_to_choose[i] <- choose(n, k[i]) # the coefficient of choosing k items from n
  
  multipl_the_two <- the_product*the_ways_to_choose # This is 1a
  
}

which.max(multipl_the_two) # this is 1b

plot(k, multipl_the_two) # this is 1c


#plot(k,dbinom(1:k, n, p)) # this is 1d


```

## The binomial for a detective

Now, suppose that you are the new detective in town. Your first case is that of "Nick the Shark", against whom there are several allegations of setting up fraudulent games. All you have to go by is a sheet of paper where all the outcomes of coin tosses were recorded by one of your informants. There were 271 outcomes that were Heads. You are now asked to find out whether the coin was fair or not.

You would ask for the help of the statistician, but they are all away at a big conference and you must appear in from of the judge who decides on whether the person can remain in detention or not.

The judge has an exceptional understanding of numbers for a legal person and asks you to prove to her your case that Nick is indeed a swindler, as you say, and not the upright occasionally gambling citizen that the defendant maintains that he is.

You spend the night, writing out all outcomes of the coin tosses, all 630 of them.

```{r echo = F, warning = F, message = F}
# Here I have created the likelihood function for a binomial 
likelihood_binomial <- function(p, k, n) {
  log_pmf <- log(choose(n, k) * p^k * (1 - p)^(n - k))  #PDF
  log_likelihood <- sum(log_pmf)  # just the product
  return(log_likelihood)
}


 k <-271 # Number of Heads in Nick's games
 n <- 630  # Total number of coin tosses in Nick's game

# Calculate the likelihood for different values of p
p_values <- seq(0.1, 0.8, by = 0.01)
likelihood_values <- sapply(p_values, function(p) likelihood_binomial(p, k, n))
p_values[which.max(likelihood_values)]
# now plot
df_likelihood_binomial <- data.frame(p_values, likelihood_values)

df_likelihood_binomial %>% 
  
  ggplot (aes(x = p_values, y = likelihood_values))+
  geom_point()+
  ggtitle("Likelihood for Coin Tosses (Binomial Distribution)")+
  xlab("probabilities")+
  ylab("likelihood values")+
  geom_vline(xintercept = p_values[which.max(likelihood_values)], colour = "red")+
  geom_vline(xintercept = 0.5, colour = "blue")
  

```

This is impressive, you have evidence that the parameter that maximises the likelihood is different to 0.5. But the judge gets back at you and says: all this could easily be due to chance. After all, probability is a matter of doing "experiments in the long run"/

You are stunned at the unprecedented numeracy of a lawyer. She warns you that she will throw out the case and you will not get the arrest warrant issued due to insufficient evidence.

How can you demonstrate that the difference between the blue and the red line is not simply due to chance?

## The likelihood ratio test

The question is whether the likelihood at 0.5 (the null) is different to the likelihood at what you found to be the maximum likelihood in the observed data. What if you built the ratio of these two?

Indeed, the likelihood ratio test will allow you to answer the numerate judge's question. Because you have taken logs, the problem simplifies to a subtraction (logging ratios turns to a subtraction).

$$
 \text{Likelihood  Ratio} = -2*(\text{Log Likelihood}_{0.5} - \text{Log Likelihood}_{ML} )
$$ {#eq-4}

Now, you may wonder about what that *2* is doing there--not to worry about it for the moment, its presence allows you to assume this difference follows a chi-squared distribution. Don't forget the minus sign--you will need this as the values of the chi-squared distribution are all positive.

```{r echo = F, warning = F, message = F}

# first find the likelihood values at 0.5 and at the MLE

lr_0.5 <- df_likelihood_binomial[df_likelihood_binomial$p_values == 0.5,]$likelihood_values  # this gets you the likelihood at the null, which is 0.5
lr_ML <- df_likelihood_binomial[which.max(df_likelihood_binomial$likelihood_values),]$likelihood_values # this gets you the likelihood at the max likelihood, its lowest value as per above

LLR = -2*(lr_0.5 - lr_ML)  # this is the ratio you are looking for, the LLR = log likelihood ratio

df <- 1 # there is only one degree of freedom in this subtraction

alpha <- 0.05 # if you fancy this for a significance value

# now take a chi-squared distribution and find the critical value, i.e. the value on the chi-squared distribution that is significant for this alpha and df. 
critical_value <- qchisq(1 - alpha, df)

# As you can see, the our LLR is much bigger than the critical value, hence it is significant. If you want to know the exact p-value

p_value_LLR_test <- pchisq(LLR, df, lower.tail = FALSE) # and this is your value

p_value_LLR_test

```

After this you can go back to the smart judge and convince her that your finding is very unlikely to have occurred by chance. To help you phrase things better to the judge, I have given you the exercise below.

### Exercise 1.2

a\) How exactly would you phrase your finding? How unlikely is it that Nick's games have occurred by chance?

b\) How would you construct standard errors and confidence intervals around those estimates? How would phrase the findings about the confidence intervals?

c\) Can you plot confidence intervals on the graph with the red and blue line?

Bonus Questions:

d\) You do a debrief with your team of detectives and informants. On this occasion, your informant had gathered 630 games. But what if he had sampled less, or more? Can you find out how many games he would have needed to have gathered for you to be able to demonstrate this difference to the judge (e.g. would 30 games be enough)? What do you call such a question in science?

## Small diversion: is coin tossing fair?

![](coin_tossing.jpg)

Check out this article here: <https://www.economist.com/science-and-technology/2023/10/15/how-to-predict-the-outcome-of-a-coin-toss>

# Week 3: Probability theory and The Bayes Theorem

## Of Men and Homicides

Before we revisit the above from a Bayesian perspective, we will need a small detour into probability theory.

About 90% of homicides in Europe are committed by men. How justified is it to say that "men are murderers". Think about this question also by replacing men with immigrant men, foreign men, or foreigners more generally. Think about what may be true in the aggregate (and generates stereotypes) and what is valuable at the individual level. Let's try to tackle this problem in a number of ways.

Let there be a population where,

the probability of homicide in a country be 2.3 in 100,000, i.e. $P(homicide) =$ 2.3\*10\^-5

the probability of being a man in that population be 50%, i.e. $P(man) =$ 0.5, and

the probability that if there is a homicide the perpetrator is a man be 90%, i.e. $P(man|homicide)$ = 0.9

The question is what is the probability that if I see a man on the street, he is a murderer, i.e. $P(homicide|man)$ .

Let's arrive at this step by step.

First, let's remind ourselves what the probability is of two events occurring together:

$$
P(A \cap B) = P(A) \cdot P(B) $$

This says that the co-occurrence of two events is the product of each event. However, this is only true if the two events are independent of each other, i.e. the occurrence of A has nothing to do with the occurrence of B. Is this the case here with men and homicides. What more general rule can we apply? Let's try to understand this graphically.

![](probability_set.jpg)

When you re-arrange this, you arrive at the very important following formula.

$$
P(A \cap B) = P(A \mid B) \cdot P(B)
$$

This formula allows you to calculate conditional probabilities if you have the joint ones and a prevalence, and vice versa. But this does not quite help us because we don't know the joint probability of homicides. For this we employ a trick. Re-arranged equation 9, also holds this way:

$$
P(A \cap B) = P(B\mid A) \cdot P(A)
$$

which then means that,

$$
P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)
$$ {#eq-10}

So, now you can estimate any of the two conditional probabilities, if you know the rest.

Let's apply this to our homicide example. Remember we need to estimate: $P(homicide|man)$

Therefore, rearranging and substituting our terms into Equation 10, we get:

$$
P(homicide\mid man) = \frac{P(man\mid homicide).P(homicide)}{P(man)}
$$ {#eq-11}

**CONGRATULATIONS: you have just entered the world of the Reverend Thomas Bayes! This is his theorem applied to homicides!**

As we will see further down, Bayes links back to the likelihood that we have been discussing and allows us to use priors and

### Exercise 3. 1

a\) Calculate conditional probability. b) Do so for a country like Brazil too, where the probability of homicide is about 10-fold higher. c) Comment on whether calling men, foreigners etc murderers may be considered stereotyping. What does it mean for individual prediction and what does it mean for public health and safety.

## Men, Homicides and marginal probabilities

Now let's look at the problem from a different angle. Let's create a table that captures the above in a representative sample, n = 100,000 of the population in Brazil, where the probability of homicide is about 20/10\^5, the gender ratio is assumed to be equal and the probability that a homicide is committed by a man is 0.9.

```{r echo = F, message = F, warning = F}
# Total obs
n <- 10^5

# Probabilities as above
p_male <- 0.5
p_female <- 0.5
p_homicide_total <- 20 / 100000 #this is the rate for Brazil
p_homicide_male <- 0.9 * p_homicide_total
p_homicide_female <- 0.1 * p_homicide_total

# simulate gender 
genders <- sample(c("Male", "Female"), size = n, replace = TRUE, prob = c(p_male, p_female))

# Initialize homicide data
homicides <- rep("No", n)

# number of homicides committed by males and females
num_homicides_total <- round(p_homicide_total * n)
num_homicides_male <- round(p_homicide_male * n)
num_homicides_female <- num_homicides_total - num_homicides_male

# assign homicides to males and females
male_indices <- which(genders == "Male")
female_indices <- which(genders == "Female")

male_homicide_indices <- sample(male_indices, num_homicides_male, replace = FALSE)
female_homicide_indices <- sample(female_indices, num_homicides_female, replace = FALSE)

homicides[male_homicide_indices] <- "Yes"
homicides[female_homicide_indices] <- "Yes"

# the contingency table
contingency_table <- table(genders, homicides)

contingency_table


```

How do you calculate here $P(homicide|male)$ . Do it by hand. Do it also after substituting the European homicide probability given above. Do you get the same results?

```{r echo = F, message = F, warning = F}
num_males <- sum(genders == "Male")
num_male_homicides <- contingency_table["Male", "Yes"]
p_homicide_given_male <- num_male_homicides / num_males
paste("P(homicide | male) =", round(p_homicide_given_male,5))
```

Congratulations, you have just used a **marginal probability**, namely you summed the two outcomes for men, the Nos and Yess, and used them as denominators. This is fundamental in Bayesian statistics, as we shall see in the next few sessions. Here it is very simple. By the way, do it for girls too, what is it? It is an order of magnitude less, as you might expect, but both are very low. Ask yourselves, would gender be a good test to detect suicides?

## Homicides and Probability Trees

Now let's say that a new company comes and tells you that it has excellent sensitivity and specificity with 95% to detect the scent of a criminal. What conditional probabilities do sensitivity and specificity refer to?

Exactly, Sensitivity is $P(Test+|Criminal+)$ where + denotes having the characteristics (test positivity and being a criminal), i.e. how likely is the test to be positive if you are a criminal. The specificity is $P(Test-|Criminal-)$

Below is a probability tree. Where do you find the sensitivity and where do you find the specificity? And how do you estimate the reverse of the sensitivity? This is the key question, you are not that much interested in how the test performs in criminals, but rather **how the test behaves** in the population you are likely to encounter. This is given by $P(Criminal+|Test+)$ , i.e. the probability that you are indeed a criminal if you have a positive test. How do you calculate this and what is your denominator?

![](crime%20probability%20tree.jpg)

This quantity is fundamental to all medical tests and indeed all tests where you want to draw inferences about the goodness of the test in a given population. It is the **Positive Predictive Value and it is a Bayesian quantity.** Using very basic algebra and the rules derived above, I will try to demonstrate this in the picture below.

![](Image_1.jpeg)The two key equations here are:

$$
P(homicide+|test+) = \frac{P(test+|homicide+).P(homicide+)}{P(test+)}
$$ {#eq-12}

Notice the similarity of this equation with that of Equation 11. I have also added in red some nomenclature that we will be encountering very soon, in the next lesson. Now, as I have derived above, there is another way to derive the same quantity using the known properties of the sensitivity and specificity and the prevalence of the population, without needing any other information.

$$
P(homicide+|test+) = \frac{P(test+|homicide+).P(homicide+)}{sensitivity.P(homicide+) + (1-specificity).P(homicide-)}
$$ {#eq-13}

Look at the denominators of both Eq. 12 and 13. These are the marginal likelihoods (also called the evidence). They are cumbersome, but not nearly as complex as what we will be encountering soon, even for solving the same simple binomial problem that we had solved in the last lesson. Indeed, these denominators are often analytically intractable and require approaches such as MCMC algorithms.

We will come to all this.

Meanwhile, I am going to give you some extra code that allows you to play around with sensitivities, specificities, PPV, NPV, but also with the chances of having a disease if someone tells you that you have a negative test (always very important).

Try to understand the basic notion of the Bayesian theorem and apply it to various situations of interest to you, like medical tests, exam results etc. Next time we will pick up again the problem of Nick the Shark and play around with the Bayesian estimation of the finding.

# Week 4: Some more on crime and punishment

## Contingency tables and expected values (a little detour)

Before we delve into some Bayesian stuff, it may be good to remind ourselves of some very simple principles that would help us decide whether men are more likely to commit crimes according to common standards of significance.

Just to remind you, this was our contingency table:

```{r echo = F, warning = F, message = F}
contingency_table
```

How would you decide on whether the differences are "significant". You did come up with probabilities for this problem above for males and females. But how would you know that they differ?

We will treat this problem using a standard frequentist approach and then turn to a Bayesian answer later in our meetings. What we need to do is create expected values for each cell in the above example. What would you do?

First, you will need the marginals. There are three types of marginals, the row marginals, the column marginals, and the totals.

The command below gives you the row marginals.

```{r}
marginSums(contingency_table,1)
```

This one the column ones

```{r}
marginSums(contingency_table,2)
```

and here is the totals

```{r}
marginSums(contingency_table)
```

or to create the whole table do

```{r}
addmargins(contingency_table)
```

$P(gender == female \cap homicide = no)$ = $P(gender == female \mid homicide == no). P(homicide == no)$

$P(gender == female \cap homicide = yes)$ = $P(gender == female \mid homicide == yes). P(homicide == yes)$

$\sum P(females == yes |homicide)$

Can you think of a way to get expected values now?

There are two principle ways to think about it which are complementary. Either to think of the "rule of three" from primary school maths, or to think in a Bayesian (or rather generally more abstract) way about it. I will start with the simple way.

Let's start with Females who do not kill, how many would we expect? This is the top lefthand corner cell. We **observe** 50167 who have not committed murder and 2 who have. How many would we have expected in each of these two cells? How do we use the term expected? In the sense that ignore the observed column values and say, well there are 50169 (the row total value) overall in 10\^5 people (the overall total). How many would there be in the 99980 (the No column, in which the first cell is situated). It follows that we obtain the **expected value** by doing $50169*99980/(10^5)$ which is 50159 after rounding.

If you do the same for each one of the cells, you get the following table of **expected** values.

```{r}
exp_table <- chisq.test(contingency_table)$expected
exp_table
```

compare the two tables, what do you see?

How can you formalise this into a statistical answer? This will be an exercise for next time (see below). For the moment, and in the interest of abstracting, what exactly did I do here when I used the rule of three?

Let's write out some simple operations in fancy terms.

What is the probability of being female and not being a murderer on the basis of the observed data?

$$
P(gender == female\cap homicide == no) = 50167/10^5 = 0.5017
$$ {#eq-14}

But, let's go even more fancy, let's apply the Bayesian theorem and the relationship between joint and conditional probabilities.

$$
P(gender == female \cap homicide == no) = P(gender == female |homicide == no).P(homicide == no)
$$ {#eq-15}

This gives us the same result as you can verify by multiplying the two fractions: 50167/99980 x 99980/10\^5.

This is a bit ridiculous and a near tautology, but humour me for a bit. What if I asked you what the following quantity is (which is the equivalent of Equations 6.8 and 6.9 in the Farrell and Lewandowsky book:

$$
P(gender == female) = \sum_{\theta} P(\text{female} \mid \text{homiicide}) \cdot P(\text{homicide})
$$ {#eq-16}

You needn't worry of course, because all you have to do is to calculate eq-16 is to add eq-15 and eq 17 below (which is its counterpart, the cell right next to it, i.e. murdering females.

$$
P(gender == female \cap homicide == yes) = P(gender == female |homicide == yes).P(homicide == yes)
$$ {#eq-17}

Which gives you: (2/20\*20/10\^5) and is the same as doing the following simpler calculation.

$$
P(gender == female|homicide == yes) = 2/10^5 = 2x10^-5
$$ {#eq-18}

Now, all you have to do is add the results of

Indeed, when you try to add the results of summing eqs 15 and 17, you get to the marginal probability, which is all that equation 18 is asking you to do, except in fancy formalism:

(50167/99980\*99980/10\^5) = 0.50169 which is the same as what you would get if you simply divided the marginal for gender with the total i.e. 50169/10\^5

**PHEW!**

Does all this make sense? It is quite simple but can be mind-boggling.

Here is an exercise to consolidate expected values, we will follow up with the Bayesian stuff below.

### Exercise 4.1

*Once you have solved this exercise, you may get a fundamental insight about model evaluation, at least I did when I grasped this.*

a\) look at the table above with the observed values, let's call this table O_table and also look at the one with the expected values E_table. Try to conceive of them as locations on some imaginary map. All those numbers are nothing but locations on that multi-dimensional map. Indeed, each table is a matrix and a matrix can be thought of as a location in a space that has the matrices dimensions. The question arises then: how far away is O_table from E_table? What simple mathematical operation would allow you to answer this question?

b\) how can you tweak that simple mathematical operation to calculate on the distance.

c\) if you want to do the statistical estimation you will need some extra tools. Hint: what is a common way, e.g. used in simple regression estimation to get rid of annoying signs (positive, negative, without taking the absolute though)? another hint: you will need a distribution for deciding.

## Back to Bayes and meeting the beta distribution

From the discourse above, we need to remember equation 17, which I am writing here in its general form:

$$
P(y) = \sum_{\theta} P(\text{y} \mid \theta) \cdot P(\theta)
$$ {#eq-19}

This is the broad definition of the marginal likelihood and it is going to pop up very often. Equation 20 is simply its instantiation for continuous variables

$$
P(y) = \int_{\theta} P(\text{y} \mid \theta) \cdot P(\theta)d\theta
$$ {#eq-20}

Then of course the fundamental Bayesian equation is written as:

$$
P(\theta\mid y) = \frac {P(y \mid\theta ). P(\theta)}{\sum_{\theta} P(\text{y} \mid \theta) \cdot P(\theta)}
$$ {#eq-21}

or, for continuous quantities,

$$
P(\theta\mid y) = \frac {P(y \mid\theta ). P(\theta)}{\int_{\theta} P(\text{y} \mid \theta) \cdot P(\theta)d\theta}
$$ {#eq-22}

IMPORTANT: $P(\theta\mid y)$ is the posterior distribution, it is what every Bayesian analysis strives for.

I will skip the chapter on analytic methods for obtaining posteriors in the book in favour of a more conceptual understanding of the beta distribution.

You will all know about the debate between Bayesians and Frequentists. It has been raging for years and it usually focuses on the issue of the **priors** and whether Bayesians are unduly **subjective**. Indeed, Bayesians argue that when you try to make a statement about data, you ought to take prior knowledge into account. Perhaps more importantly, they extend the argument to say, well, you should actually update your model as new knowledge accumulates! Only in this way will you be able to be fair to the state of the world.

I won't go into the various arguments, except to say that even frequentists make a lot of decisions that require scientific **judgement**. The most notable one is the likelihood model that we choose. As we have seen, this is key to all modelling of data. Bayesians would say that their approach provides a principled way of assessing the probability of parameters but also of models. How? We shall see below. I have added some materials about voting patterns and confidence intervals that you may want to study.

For the moment, let's revisit, Nick the gambler.

How would you go about evaluating his honesty in a Bayesian way?

You remember that we used the binomial distribution as our likelihood model to estimate the likelihood of what Nick came up with.

For this let's turn to the beta distribution and highlight some interesting features.

I will start at the end, with a re-writing of the equation 6. 24 in the book for obtaining the posterior distribution of a coin toss with *n* tosses and *k* heads:

$$
P(\theta \mid k,n) = beta(\theta|\alpha + k, \beta + n -k) 
$$ {#eq-23}

This basically says that if you toss a coin and want to use a Bayesian approach (and you choose as most Bayesians would) the beta distribution as a prior, all you have do is add something to those priors!

That seems super simple, but requires a lot of maths to arrive at. We will discuss all this at the next lesson. You may want to look here until then: <https://www.statlect.com/probability-distributions/beta-distribution>

But let's start with some basics.

Let's get an intuition for the beta. I am writing out its probability density function:

$$
f(x; \alpha, \beta) = \frac{x^{\alpha - 1} (1 - x)^{\beta - 1}}{B(\alpha, \beta)}
$$ {#eq-24}

We can glean that it has bits that look like the binomial in the numerator (and actually also in the denominator). We will delve into this next time.

For the time being let's play around with the beta distribution for different values of its two parameters. I am using the code below.

```{r echo = F, warning = F, message = F}
# i am defining the alpha and beta first
alpha <- .5
beta <-.5
# and the theta, i.e. the possible values of the toss
theta <- seq(0, 1, length.out = 1000)

prior_dist <- dbeta(theta, alpha, beta)
plot(theta, prior_dist)
```

Let now return to Nick... Play around with the priors by tuning the $\alpha$ and $\beta$ to various values and see what happens.

```{r echo = F, warning = F, message = F}

# Parameters
alpha_prior <- 1 #play around with this and the the beta
beta_prior <- 1
heads <- 271 # remember from above?
tosses <- 635

# Posterior parameters, see formalism above
alpha_post <- alpha_prior + heads
beta_post <- beta_prior + (tosses - heads)

# getting very fine grained with the beta to emulate continuity
theta <- seq(0, 1, length.out = 1000)

# Getting the prior and posterior distribution
prior_dist <- dbeta(theta, alpha_prior, beta_prior)
posterior_dist <- dbeta(theta, alpha_post, beta_post)

# Creatjng a long dffor plotting
df <- data.frame(
  theta = rep(theta, 2),
  density = c(prior_dist, posterior_dist),
  Distribution = rep(c("Prior", 
                       "Posterior"), 
                     each = length(theta))
)


# Plot the prior and posterior distributions
df %>% 
  ggplot(aes(x = theta, y = density, color = Distribution, fill = Distribution)) +
  geom_line() +
  geom_area(alpha = 0.3, position = "identity") +
  scale_y_continuous() +
  labs(title = "Prior and Posterior Distributions",
       x = "Theta (Probability of Heads)",
       y = "Density") +
  theme_minimal() +
  theme(legend.position = "top") +
  annotate("text", x = 0.25, y = 10, label = 
  paste0("Prior parameters:", "\nalpha = ", alpha_prior, ", beta = ", beta_prior))

# and back to our original question, what is the maximum value for the parameter?
theta[which.max(posterior_dist)]


```

Now what do you do with this information? Can you get confidence intervals? Yes!

What can you with these data, can you use them for prediction of future tosses.

### Exercise 4.2

a\) What is your understanding of what Ipsos, the polsters are saying here. Check out page 3, does it make sense?

<https://www.ipsos.com/sites/default/files/2017-03/IpsosPA_CredibilityIntervals.pdf>

b\) Can you see the relationship with this paper?

<https://www.tandfonline.com/doi/pdf/10.1080/01621459.2018.1448823?casa_token=phCtUIGpXcsAAAAA:TnbmBljQ5CaMfCreH_qxMLeIvdEKpJD_tTDPEE0cYK3a_-q0JBHYb3CUqkKHzf2V-gBYW64r5nEfyQ>

# Week 5: Self Study

# Week 6: Inference in Bayes through sampling.

This week we will try to familiarise ourselves with the notion and practicalities of sampling from the posterior. We are going to make small steps into this as it can become quite complex and is probably best revisited as the practical need arises (and it will indeed come up a lot in what we do).

## Why use sampling?

Notice the denominator of Equation 22, also called the evidence or the total space of the . Such an integral can be fairly simple, so that mere mortals like ourselves might stand a chance to solve analytically. But it can become quite complex, some times so complex that even evolved machines need help and special tricks to solve. But that does not matter and the reason is that we can make some reasonable assumptions that will help us arrive at knowledge about our parameters of inference in any given model.

This week, we will discuss one form of sampling, namely sampling from the posterior. We will assume that we have arrived at a posterior and will sample from it. It is to demonstrate the principle of sampling. We will only allude to what will become our main engine to support our Bayesian inference, namely Monte Carlo methods and chiefly the Metropolis-Hastings algorithms. We will however try to use today's example to fit our first models in Stan a probabilistic programming languages and brms, a package in R that talks to Stan.

## Some simple sampling in the case of Nick

The simple sampling from the posterior comes from chapter 3 of the McElreath book, Rethinking Statistics–a great resource by the way.

Let's consider the example of Nick the gambler again. We remember 271 heads out of 630 tosses was what our detective learnt about Nick's gambling games. In chpaters 2 and 3 we looked a lot at the likelihood, which as we recall is represented by the binomial probability density function. Please do refer to the formalisms above; here I will only remind us that we can use the dbinom function in base R for it. Let's then build a Bayesian model as we did above, in Week 4.

This code below generates the posterior distribution.

```{r }
n_heads <- 271
n_tosses <- 630

possible_prob_values <- seq(from = 0 , to = 1, length.out = 10^3) # these are the values  that our parameter for the probability takes

likelihood <- dbinom(n_heads, n_tosses, prob = possible_prob_values)  # our likelihood

prior <- rep(1: length(possible_prob_values)) # this prior is uniform, corresponds to a beta(1,1)

Bayes_numerator <- likelihood*prior # this is the numerator of Eqs 21 and 22.

posterior <- Bayes_numerator# /sum(Bayes_numerator) # the outcome of Eqs 21 and 22.

```

For me the simplest way of understanding what is going on is the following: to what value of the parameters (i.e. probabilities) does the maximum value of the posterior correspond to? I plot this here.

```{r}
plot(possible_prob_values, posterior)
```

You can also ask in the way we have before

```{r}
possible_prob_values[which.max(posterior)]
```

This number is very close to what we got when we did the maximum likelihood (which should be no surprise, since we used flat priors.

To illustrate that this can also be achieved through a draw 1000 values out of the posterior.

```{r}
sampled_posterior <- sample(possible_prob_values, # the parameter values to sample from
       posterior, # these are the probabilities that we input
       size = 10^3, # the size of it, 
       replace = T # got to do this
       )

plot(density(sampled_posterior))

```

Now this may not sound trivial, but you just did your first sampling from the posterior...

### Exercise 6.1

a)  Find the values of mean, median, mode and max of the sampled posterior. What do you observe?
b)  Instead of sampling from the posterior, sample from the Bayes numerator.
c)  Use the sample from (b) to re-run (a), what do you observe?
d)  Do (b) and (c) using the likelihood instead of the Bayes numerator.
e)  Find the 95% intervals–what would you call those intervals?

Here is a little pointer for 5.1.e

![](confidence_intervals_McElreath.jpg)

## Simplifying posterior sampling

The general point though of Exercise 5.1 is to illustrate something that the Farrell & Lewandowski book mentions in chapter 7, namely equation 7.1. It states that all we need to know to do the sampling is the Bayes numerator, i.e. the likelihood multiplied by the prior. This is because the evidence, i.e. the denominator (i.e. the marginal likelihood) is a constant in relation to this quantity. Therefore, equation 22:

$$
P(\theta\mid y) = \frac {P(y \mid\theta ). P(\theta)}{\int_{\theta} P(\text{y} \mid \theta) \cdot P(\theta)d\theta}
$$

can be reduced to:

$$
P(\theta\mid y) \propto P(y \mid\theta ). P(\theta)
$$ {#eq-25}

Where $\propto$ means proportional to, the posterior is proportional to the Bayes numerator, i.e. the likelihood times the prior.

Therefore, we can use algorithms like the Metropolis-Hastings, which we will talk about in more detail in the next few weeks, to arrive at inferences.

For this week though, let's estimate the Nick problem using hardcore Bayesian models...

First off, with brms, the very useful package: *Bayesian Regression Models Using Stan*, an R package that is structured like lme, but allows you to fit a wide range of linear and non-linear models. It uses an Hamilton Monte Carlo estimator, based on similar principles to MHMC that we will be seeing later.

Here is some brms code to run the Nick model.

```{r results='hide', warning=F, message=F}
library(brms)

brm(data = list(nheads = 271),
family = binomial(link = "identity"),
nheads | trials(630) ~ 0 + Intercept,
# we are using a flat prior--like the uniform above. 
prior(beta(1, 1), class = b, lb = 0, ub = 1),
iter = 2000, warmup = 500, # we will discuss this in more detail above.
seed = 3)


```

### Exercise 6.2

a\. go through the code above line by line. What is the model, see "nheads \| trials(630) \~ 0 + Intercept"

b\. run the code above and look at how long it takes, why, what is happening there.

c\. assign the model above to an object called "my_first_brms_model". Use summary to view the output. What do you notice about the estimate?

Now let's push the boundaries and try to do all this in Stan.

```{r results='hide', warning=F, message=F}
library(rstan)

# Data preparation
data_list <- list(nheads = 271, N = 630)

# Stan model code
stan_code <- "
data {
  int<lower=0> nheads;
  int<lower=0> N;
}
parameters {
  real<lower=0, upper=1> p;
}
model {
  p ~ beta(1, 1);
  nheads ~ binomial(N, p);
}
"

# Compile the model
model <- stan_model(model_code = stan_code)

# Fit the model
fit <- sampling(model, 
                data = data_list, 
                iter = 2000, 
                warmup = 500, 
                seed = 3)

# Print the summary of the fit
print(fit)

```

### Exercise 6.3

a\. Do what you did for Exercise 6.2, but this time for the Stan model.

b\. Does the Bayes denominator appear anywhere here?

b\. Whose grave is depicted in the photo below. Hint: it is in East London (Photo courtesy of Dr LA)

![](a_remarkable_grave.jpeg)

### A note on likelihood and notation.

People wondered yesterday about the following.

We have written the likelihood of a model, say the binomial as:

$$
\mathcal{L}(\theta \mid x)
$$

Which seems to clash apparently with the following notation for the likelihood that we see in the numerator of Eq 21 or 22, i.e. the Bayesian formulations:

$$
P(x\mid\theta)
$$

The confusion arises from the following. **The likelihood** $\mathcal{L}(\theta \mid x)$ **denotes the probability with which some value x of the outcome variable *X* is observed when** $\theta$ **assumes a particular value**. Indeed, as we said our first lesson, in equation 1 taking the probability density function of the normal as an example, we write:

$$
f(x | \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot e^{-\frac{(x - \theta)^2}{2\sigma^2}}
$$

i.e. a function of the value of *x* with $\theta$ fixed. Often you will see the notation $f(x, \theta)$ which avoids the notion that somehow x is conditioned on $\theta$.

I repeat here: **The likelihood** $\mathcal{L}(\theta \mid x)$ **denotes the probability with which some value x of the outcome variable *X* is observed when** $\theta$ **assumes a particular value.** For this reason, the likelihood in probability notation must be written as $P(X=x \mid \theta)$ which is what we have in the Bayes numerator as the likelihood.

To make this clear, let's go to back to the Nick example.

The likelihood is obviously given by the binomial:

$$\mathcal{L}(p|n,k) = \binom{n}{k} p^k (1-p)^{n-k}$$ {#eq-27}

which in Nick's case is:

$$
\mathcal{L}(p \mid n = 630, k = 271) = \frac{630!}{271! \cdot 359!} p^{271} (1 - p)^{359}
$$

When we run this over many different values of *p* , we get the likelihood curve that we have seen above. Now, let's run this for **a particular p value** say *p* = *0.5* i.e. a fair coin.

```{r}
dbinom(271,630, 0.5)
```

This is quite a small number. Let's now do it for *p* = *0.431* i.e. the value we know maximises the likelihood:

```{r}
dbinom(271,630, 0.431)
```

Now this is a much bigger number, as expected.

What is each of these numbers though? They are probabilities of observing value *x* given a value of $\theta$ . Therefore, in the first case we have $P(n = 630, k = 271\mid \theta = 0.5)$ and in the second case $P(n = 630, k = 271\mid \theta = 0.431)$. Each one of them is a probability for observing the outcome given a value of the parameter. It is **clearly not** the opposite, i.e. $P(\theta = 0.5\mid n = 630, k = 271)$ or $P(\theta = 0.431\mid n = 630, k = 271)$.

I hope this helps.

### A cultural addendum.

The Manhattan project was a vast operation, funded by government during war leading to huge advances not only in nuclear physics but also in pretty much every aspect of the natural sciences.. Some of you will have watched Oppenheimer the film (I haven't yet) which details a lot about it and the Los Alamos laboratory. Metropolis, the Rosenbluths and Teller co-authored the paper on the Metropolis algorithm in 1953, which has been cited over 50,000 times:*Equation of State Calculations by Fast Computing Machines.*

They were part of the Los Alamos laboratory amongst an incredible array of other scientists. The atmosphere is given by Stanislav Ulam (Mr Stan of Stan programme fame) himself in this Wikipedia excerpt:

"In his memoirs,^[\[12\]](https://en.wikipedia.org/wiki/Nicholas_Metropolis#cite_note-12)^ [Stanislaw Ulam](https://en.wikipedia.org/wiki/Stanislaw_Ulam "Stanislaw Ulam") remembers that a small group, including himself, Metropolis, [Calkin](https://en.wikipedia.org/wiki/John_Williams_Calkin "John Williams Calkin"), [Konopinski](https://en.wikipedia.org/wiki/Emil_Konopinski "Emil Konopinski"), [Kistiakowsky](https://en.wikipedia.org/wiki/George_Kistiakowsky "George Kistiakowsky"), [Teller](https://en.wikipedia.org/wiki/Edward_Teller "Edward Teller") and [von Neumann](https://en.wikipedia.org/wiki/John_von_Neumann "John von Neumann"), spent several evenings at [Los Alamos](https://en.wikipedia.org/wiki/Los_Alamos,_New_Mexico "Los Alamos, New Mexico") playing poker. They played for very small sums, but: "Metropolis once described what a triumph it was to win ten dollars from John von Neumann, author of a famous treatise on game theory. He then bought his book for five dollars and pasted the other five inside the cover as a symbol of his victory."

It also tells you about where the name Monte Carlo may be coming from for such approaches to inference.

The last author, of the Metropolis paper is Augusta Teller, a Hungarian scientist and computer programmer. She was married to Edmund Teller, the physicist who is most probably depicted as Dr Strangelove in Kubrick's film, which is a critique of nuclear weapons and their destructiveness, made in teh 60ies (and a landmark for cinema with Peter Sellers in three different roles!). Teller is said to have been very hurt by Kubrick's character until the end of his life. Here is Sellers as Strangelove:

![](download%20(1).jpeg)

## Week 7: Regression and the deep dive into the loss

The purpose of this week's lesson is to remind ourselves of some fundamental regression knowledge. It is so fundamental that we will need it throughout our journey into computational science.

Specifically, we will

a\) introduce a new dataset the Health Survey for England.

b\) start by fitting the linear regression model in three ways: with ordinary least squares, and in the two Bayesian approaches, brms and Stan.

c\) understanding the relationship between the likelihood and intercept and slope.

d\) understanding gradient descent and loss functions.

### The HSE data.

For the moment, we will use only the 16 and above data from HSE and concern ourselves with the variables Height, Weight (up to 100kg), Sex and Age without any missing data, to make our lives easy for paedagogic purposes.

```{r}
#I got the Health Survey for England data
library(haven) # need it as it is stata
hse16_eul_v5 <- read_dta("hse16_eul_v5.dta")
head(hse16_eul_v5)

turn_num_missing <-  # a little function to deal with missingness
  function(variable){
  
  variable[variable<0] <- NA
  
return(variable)
  
}

var_list <- list(
hse16_eul_v5$Weight, 
hse16_eul_v5$Sex, 
hse16_eul_v5$Height)

# apply the function
hse16_eul_v5[, c("Weight", "Sex", "Height")] <- lapply(var_list, turn_num_missing)


hse16_eul_v5$Sex <- factor(hse16_eul_v5$Sex, # need labels for sex to make things easier
levels = c(1,2),
labels = c("male", "female"))

#describe height
# summary(hse16_eul_v5$Height)
# hist(hse16_eul_v5$Height) # seems skewed probably due to young people

# can see that there are some very young people. Let's keep those >=16
df_hse_16_above <- hse16_eul_v5[hse16_eul_v5$Age35g>=16,]
df_hse_16_above$Height # and this is now pretty normal.

# because of some non linearities at the edges (to which we will return)
# let's also cap at a weight of 100kg. 
df_hse_16_above_up_to_100kg <- df_hse_16_above[df_hse_16_above$Weight<=100,]

# and finally, let's just sample 1000 rows at random without any missing data on the
# key variables weight, age, sex, height

# select key variables
retain_vars <- c("Weight", "Sex", "Height", "Age35g")

reduced_df_hse_16_above_up_to_100kg <- df_hse_16_above_up_to_100kg[, retain_vars]

# exclude any missing data to make your life easier
no_missing_reduced_df_hse_16_above_up_to_100kg <- na.omit(reduced_df_hse_16_above_up_to_100kg)

# sample 1000 rows
df_hse_wk7 <- no_missing_reduced_df_hse_16_above_up_to_100kg[sample(nrow(no_missing_reduced_df_hse_16_above_up_to_100kg),1000, 1974), ]

# now plot the main numeric variables
numeric_columns <- sapply(df_hse_wk7, is.numeric)
df_numeric <- df_hse_wk7[, numeric_columns]

par(mfrow = c(ncol(df_numeric), 1), mar = c(4, 4, 2, 1))
for(i in 1: ncol(df_numeric)){
  hist(df_numeric[[i]], main = colnames(df_numeric)[i])
}

# and here are the proportions of females per 1000
table(df_hse_wk7$Sex)


# finally, you need to remove the labels--this happens when you introduce data from stata and labels get in the way

# Convert haven_labelled variables to numeric or factor
df_hse_wk7 <- as.data.frame(lapply(df_hse_wk7, function(x) {
  if (inherits(x, "haven_labelled")) {
    return(as.numeric(as.character(x)))  # Convert to numeric
  } else {
    return(x)
  }
}))

```

### Fitting our first linear regression model

In this chapter we will be trying to **estimate people's weight given their height.**

The basic regression model can be stated in the following way:

$$y_i = \beta_0 + \beta_1.xi + error$$

where $\beta_0$ is called the intercept and $\beta_1$ is the slope. in our case $y_i$ is the height for each person (observation) and $x_i$ is the weight of each person.

But for reasons that will become apparent I will fit this model with an **intercept only** initially, i.e. without taking weight into account.

We will fit the mode in three ways: our usual linear regression with least squares OLS in R, brms and Stan. **Do not get too hung up on the fitting of these models in R, that is not the point of this week.**

**OLS intercept only**

```{r}
model_ols_intercept <- lm(Height ~ 1, df_hse_wk7)
summary(model_ols_intercept)
```

**brms intercept only**

```{r}
# Load the brms library
library(brms)

priors <- 
  prior(normal(170, 10), class = "Intercept" # I am giving it here(and below for stan) what I believe to be a reasonable prior. You can play around with others
)

# Fit the model using brms
model_brms_intercept <- brm(
  formula = Height ~ 1,  # intercept only
  data = df_hse_wk7,  
  family = gaussian(),  
  prior = priors,  
  chains = 2,  
  iter = 2000,  
  warmup = 500,  
  seed = 1974 
)

# Print summary of the model
summary(model_brms_intercept)

```

**Stan intercept only**

```{r}
library(rstan)
# Define the Stan model for intercept-only with weakly informative priors
stan_code_intercept <- "
data {
  int<lower=0> N;  // number of observations
  vector[N] y;     // outcome variable
}
parameters {
  real mu;  // intercept (mean of the outcome)
  real<lower=0> sigma;  // standard deviation, constrained to be positive
}
model {
  mu ~ normal(170, 10);  // weakly informative prior for the intercept
  sigma ~ normal(0, 10);  // weakly informative prior for the standard deviation, truncated to be positive
  y ~ normal(mu, sigma);  // likelihood
}
"

# Tell Stan where to find the N and Y 
  stan_data <- list(
  N = nrow(df_hse_wk7),
  y = df_hse_wk7$Height
)

  # and fit the model in stan
  fit_intercept_stan <- stan(
  model_code = stan_code_intercept, 
  data = stan_data,  
  chains = 2,  
  iter = 2000,  
  warmup = 500,  
  seed = 1974  
)

# Print summary of the model
summary(fit_intercept_stan)
```

The bottom line of all this is that in all three cases the estimate for the intercept is about 166, 8 cm.

What is this estimate? What does it mean.

### The intercept only is the sample mean

As we will see below, a linear model with an intercept only is the mean of the outcome variable. Below is some code to see this. Perhaps more important is the insight that this mean value is the most likely value if you use the likelihood function for the normal (go back to Week 1 to remind yourselves, I am using the code below)

```{r}
mean(df_hse_wk7$Height) ## this is the mean

# remember the likelihood function of the normal
log_likelihood_normal <-  function(x, theta, sigma) {
  log_pdf <- log(1 / (sqrt(2 * pi * sigma^2)) * exp(-((x - theta)^2) / (2 * sigma^2)))
  log_likelihood <- sum(log_pdf)  # Log-likelihood for the entire dataset
  return(log_likelihood)
}


# Find the log-likelihood of height in this dataset. 

mu_values <- seq(from = 120, to = 220,by = 0.5) # Range of mu values to test I am using here the range of possible values
sigma = sd(df_hse_wk7$Height) # I am cheating here, giving it the  stdev that I know 
#obviously, I could iterate the function over many values of the sigma too.

log_likelihood_values <- sapply(mu_values, function(mu) log_likelihood_normal(df_hse_wk7$Height, mu, sigma))

# This gives me the MLE
mle <- mu_values[which.max(log_likelihood_values)] # htis is the value of our 

plot(mu_values,log_likelihood_values) # as you can see the likelihood is at its max at mle
abline(v = mle, col = "red")
mtext(paste0("The maximum likelihood for the height data is ", mle), side=3)


```

### But what is the slope and the new intercept?

For this, I will first fit the linear model in OLS and plot it.

**OLS slope model**

```{r}
#First let's plot

plot(df_hse_wk7$Weight, df_hse_wk7$Height, pch = 19, col = 'blue', 
     xlab = "Weight", ylab = "Height",
     main = "True Heights and Predicted Heights")
model_ols_slope <- lm(Height ~ Weight, data = df_hse_wk7)
summary(model_ols_slope )

```

Now let's fit the OLS, brms, and Stan models for the full model with the slope.

```{r}
model_ols_slope <- lm(Height ~ Weight, df_hse_wk7)
summary(model_ols_slope)
```

**brms model with slope**

```{r}

priors <- c(
  prior(normal(170, 10), class = "Intercept"),  
  prior(normal(0, 10), class = "b")  # Prior for the slope (coefficient for Weight)
)
# Fit the model using brms
model_brms_slope <- brm(
  formula = Height ~ Weight,  # intercept only
  data = df_hse_wk7,  
  family = gaussian(),  
  prior = priors,  
  chains = 2,  
  iter = 2000,  
  warmup = 500,  
  seed = 1974 
)

# Print summary of the model
summary(model_brms_slope)
```

**Stand model with code**

```{r}
stan_code_slope <- "
data {
  int<lower=0> N;  // number of observations
  vector[N] y;     // outcome variable
  vector[N] x;     // predictor variable
}
parameters {
  real alpha;  // intercept
  real beta;   // slope
  real<lower=0> sigma;  // standard deviation, constrained to be positive
}
model {
  alpha ~ normal(170, 10);  // weakly informative prior for the intercept
  beta ~ normal(0, 10);   // weakly informative prior for the slope
  sigma ~ normal(0, 10) T[0,];  // weakly informative prior for the standard deviation, truncated to be positive
  y ~ normal(alpha + beta * x, sigma);  // likelihood
}
"

# Tell Stan where to find the N and Y 
  stan_data_slope <- list(
  N = nrow(df_hse_wk7),
  y = df_hse_wk7$Height,
  x = df_hse_wk7$Weight
)

  # and fit the model in stan
  fit_slope_stan <- stan(
  model_code = stan_code_slope, 
  data = stan_data_slope,  
  chains = 2,  
  iter = 2000,  
  warmup = 500,  
  seed = 1974  
)

# Print summary of the model
summary(fit_slope_stan)
```

Again all three methods converge very well. But now the intercept is different. What does it mean.\
Perhaps more importantly, what is the magic behind the calculations that the three powerful machines do?

I will try to give some intuition before we move on to the analytical solution in OLS (and in the next chapters delve more into Bayes regression).

The Loss Function and Brute Force Slope finding.

Let's look again at the graph of Heights and Weights

```{r}
plot(df_hse_wk7$Weight, df_hse_wk7$Height, pch = 19, col = 'blue', 
     xlab = "Weight", ylab = "Height",
     main = "True Heights and Predicted Heights")
```

What you are trying to do with regression (and in some ways, with most of computational modelling) is to fit the best possible curves to observed data. How do you know what curve is (fits) best to these data.

Well, intuitively, it is the line that is closest to each point in the dataset. Suppose you only had three points, you would want to find the line that has the smallest total distance (also called error) from each point, To estimate this you would take distance_1, distance_2, distance_3, representing each distance from the point to your line for several candidate lines and add them up each time. The one with the smallest total distance would be the winner. Make sense?

But, remember that because points could be above (positive) or below the line (negative), they could cancel each other out. Therefore, you do something else, you add up either their absolute values, or more commonly their squared values, i.e. the squared distances.

The function that helps find such distances is generally called a loss function and ordinary least squares is a prototype of it. For OLS, this sum of **squared errors (SSE)** is the **loss function *L*** and is written as follows:

$$
\begin{aligned}SSE = L(\hat{\beta}_0, \hat{\beta}_1) & = \sum_{i=1}^n(y_i - \hat y_i)^2  \\& = \sum_{i=1}^n(y_i - (\hat \beta_0 + \hat \beta_1.x_i))^2 \\\end{aligned}
$$

where $y_i$ are the observed and $y_i$ the predicted values.

Having defined this function, we could simply start fitting lots of curves and see which one fits the data in the best way, ie minimises the Loss Function that we defined above. It would be tedious to do by hand, but you can ask the computer to do. I have prepared this for you below. As you will see, what it does is it fits so many curves on the graph that the dots become invisible, lots of red lines cover it like a carpet!

This approach is akin to the gradient descent method (where partial differentials are used, see below), but in a brute force sort of way.

So I am basically taking the formula $$y_i = \beta_0 + \beta_1.xi$$

and generate a crazy number of curves.

```{r}


intercept <- seq(from = 140 , to = 160, by = .5)
slopes <- seq(from = 0 , to = 1, by = 0.025)
combos_inter_slope <- data.frame(expand.grid(intercept, slopes)) # create the combos
y_pred <- list()

 for(i in 1:nrow(combos_inter_slope)){
  

    y_pred[[i]] <- combos_inter_slope[i,1] +  combos_inter_slope[i,2]*df_hse_wk7$Weight
    
  }
  
#df_pred_heights <-do.call(rbind, y_pred)
#colnames(df_pred_heights) <- "predicted_heights"


# true heights as dots
plot(df_hse_wk7$Weight, df_hse_wk7$Height, pch = 19, col = 'blue', 
     xlab = "Weight", ylab = "Height",
     main = "True Heights and Predicted Heights")

# looping to add predicted heights 
for (i in 1:length(y_pred)) {
  lines(df_hse_wk7$Weight, y_pred[[i]], col = rgb(1, 0, 0, alpha = 0.5))
}

#legend
legend("topright", legend = c("True Heights", "Predicted Heights"),
       col = c("blue", "red"), pch = c(19, NA), lty = c(NA, 1))

SSE <- 0
for(i in 1: length(y_pred)){
  
  SSE[i] <- sum((y_pred[[i]] - df_hse_wk7$Height)^2, na.rm = T)
  
  
}

print(combos_inter_slope[which.min(SSE),1])
print(combos_inter_slope[which.min(SSE),2])


```

The fantastic thing is that I get pretty much the same values for the slope and intercept by using this brute force method, namely 0.325 and143 respectively.

You can think of the loss function as an attempt to climb down a mountain, or rather reach its bottom. Indeed, the loss function for two parameters is a three-dimensional space (think about more parameters...). You need to find the position for each of the intercept and slope that minimise it, i.e. their best combination.

Here is the figure of a mountain in relief, for those of you who like mountaineering.

![](advanced_contours_guide_relief%20(1).jpg)

```{r}
library(plotly)

# Reshape SSE values into a matrix for the contour plot
SSE_matrix <- matrix(SSE, nrow = length(intercept), ncol = length(slopes), byrow = TRUE)

# Plto
fig_loss <- plot_ly(x = intercept, y = slopes, z = SSE_matrix, 
               type = "surface", colorscale = "Blues") %>%
  layout(scene = list(
    xaxis = list(title = "Intercept (β0)"),
    yaxis = list(title = "Slope (β1)"),
    zaxis = list(title = "Sum of Squared Errors (SSE)"),
    zaxis = list(tickvals = pretty(range(SSE), 10))
  ))

# Show the plot
fig_loss
```

### A principled way of solving OLS: partial derivatives to help navigate rugged landscapes

### Calculate the partial derivatives and use them for regression

As we said above, we need to minimse the sum of squared errors, i.e. find the minima for it. This is what differntiation does, in the case of one variable it is a derivative, in the case of two or more it is a partial derivative. Let's remind ourselves of the problem:

$$
\begin{aligned}
SSE & = \sum_{i=1}^n(y_i - \hat y_i)^2  \\
& = \sum_{i=1}^n(y_i - (\hat \beta_0 + \hat \beta_1.x_i)^2) \\
& = \sum_{i=1}^n(y_i - \hat \beta_0 - \hat \beta_1.x_i)^2
\end{aligned}
$$ {#eq-28}

Now let's take the derivative with respect to the intercept. By doing this, we will arrive at some expression that we will be able to set to zero and solve to find the minima.

$$
\begin{aligned}
{\frac{\partial SSE}{\partial \hat \beta_0}} = \frac{\partial(\sum (y_i - \hat b_0 + \hat \beta_1.x_i)^2)}{\partial \hat \beta_0}\\
\end{aligned}
$$

and for the slope

$$
\begin{aligned}
{\frac{\partial SSE}{\partial \hat \beta_1}} = \frac{\partial (\sum(y_i - \hat b_0 + \hat \beta_1.x_i)^2)}{\partial \hat \beta_1}\\
\end{aligned}
$$

and then solving each one of them by setting each equal to zero. We can derive this together in class and here is a link where this is done. It is fundamentally simple, but involves a few algebraic tricks to get to the end product which is the following for the intercept:

$$
\hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}
$$ {#eq-29}

and for the slope:

$$
\begin{aligned}
\hat \beta_1 = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})} {\sum_{i=1}^{n} (x_i - \overline{x})^2} 
& = \frac{SXY}{SXX}
\end{aligned}
$$ {#eq-30}

There are several things to marvel at here, including the importance of the mean in all this. This is what your programme churns out and allows you to do the maths.

Indeed, let's see whether we can recover the slope, for example, through our simple formula

```{r}
# to make your life easier I have created a subset of the data so 
# you don't get hung up on missingness
weight_no_missing <- df_hse_wk7$Weight
height_no_missing <- df_hse_wk7$Height

df_no_missing <- data.frame(
  weight_no_missing = weight_no_missing,
  height_difference = weight_no_missing
)

lm(height_no_missing ~ weight_no_missing, df_no_missing)


mean_height <-     mean(height_no_missing)

mean_weight <- mean(weight_no_missing)

weight_difference <- weight_no_missing -
    mean_weight


height_difference <- height_no_missing - 
  mean_height

SXY <- sum(weight_difference*height_difference)
SXX  <- sum(weight_difference^2)

beta_1 <- SXY/SXX

beta_0 <- mean(height_no_missing) - beta_1*mean(weight_no_missing)

print(paste(beta_1, beta_0))


```

## Week 8: Extending the regression model to Bayes

Last week we found the estimates for the intercept and slope by specifying the OLS model and by fitting thousands of curves using brute force. Here I am going to first show the

### Using maximum likelihood and Bayes to find the slope and intercept

We can view this as a likelihood problem. The proper likelihood (see the explicit stan code above), here is based on a normal probability density function. I had fitted something like this above, in week 1, let's return to the function I had written. Below you will find extensively commented code for how to find the likelihood and the posterior in Bayes

```{r}

# this is the simple likelihood function, the one we have created above. 
likelihood_linear_regression <- function(beta0, beta1, x, y) { # the likelihood as above
  mu <- beta0 + beta1 * x  # predicted values
  log_likelihood <- sum(dnorm(y, mean = mu, sd = 1, log = TRUE))  
  return(log_likelihood)
}


prior_beta0 <- function(beta0) { # specifying the prior for the intercept here
  dnorm(beta0, mean = 170, sd = 10, log = TRUE)
}

prior_beta1 <- function(beta1) { # and for the slope
  dnorm(beta1, mean = 100, sd = 500, log = TRUE)
}

# here I am specifying values for the intercept and slope. This is my PARAMETER space
# the THETAS for the P(theta|y) from our first less
beta0_values <- seq(130, 170, by = 0.5) # the range for the intercept and slope, i.e. my parameter space. I have kept it intentionally very narrow
beta1_values <- seq(0.1, 0.4, by = 0.005) # kept this very narrow too to make things easier to see

# here all I do is create a grid to accommodate all the unique combinations of intercept and slope. 
grid_of_coefs <- expand.grid(beta0_values, beta1_values)
colnames(grid_of_coefs) <- c("beta0_values", "beta1_values")


# now here is a loop to iterate the above function over the range of parameter combos in the grid
log_likelihood <- 0

for(i in 1: nrow(grid_of_coefs)){
log_likelihood[i] <- likelihood_linear_regression(grid_of_coefs[i,1], 
                                                  grid_of_coefs[i,2], 
                                                  df_hse_wk7$Weight, df_hse_wk7$Height)

}

# attach the likelihood to the grid for ease
grid_of_coefs$log_likelihood <- log_likelihood

likelihood_solution <- grid_of_coefs[which.max(grid_of_coefs$log_likelihood),]# this tells you in which position of the grid you will find the max likelihood and gives you the row with the coefficients. 

###### You have just found the maximum likelihood!!!!!!!!


###If all this makes sense, then let's go to ****Bayes*****.
## all we have to do is to add the posteriors. 


# here you multiply with the priors--see the function I have created above to construc the priors
grid_of_coefs$log_posterior <- grid_of_coefs$log_likelihood + prior_beta0(beta0_values) + prior_beta1(beta1_values)


# now I need to do a trick to stablise the numbers here. Not to worry
max_log_posterior <- max(grid_of_coefs$log_posterior) # these two steps prevent numerical underflow. Try them also without
grid_of_coefs$stabilised_log_posterior <- grid_of_coefs$log_posterior - max_log_posterior

# the posterior is the exponent of the log posterior
grid_of_coefs$posterior <- exp(grid_of_coefs$stabilised_log_posterior)

# and here I simply normalise, remember likelihood*prior/sum(likelihood*prior)
grid_of_coefs$posterior <- grid_of_coefs$posterior/sum(grid_of_coefs$posterior)

# now does my Bayes posterior seem plausible
Bayes_solution <- grid_of_coefs[which.max(grid_of_coefs$posterior),]


print(likelihood_solution) 
print(Bayes_solution)
```

Lo and behold! Using the likelihood function that we had developed in Week 1, also approximates the intercept and slope and so does Bayes

### Recapping the linear regression model in its more complete Bayesian specification.

#### Specifying once again the Bayes model

Here I specify the two things we need for the Bayes formula, namely the priors and the likelihood. It is a more complete picture than what I simulated out of convenience above.

$$
\begin{align*}
\text{The prior for the intercept is:} \quad \beta_0 & \sim \mathcal{N}(170, 10^2) \\
\text{The prior for the slope is:} \quad\beta_1 & \sim \mathcal{N}(0, 1^2) \\
\text{The prior for the stdev is:} \quad\sigma & \sim \mathcal{N}(0, 10)  
T[0,]\\
\text{The likelihood is:} \quad y & \sim \mathcal{N}(\beta_0 + \beta_1 x, \sigma)
\end{align*}
$$

Note that there are other good priors for the $\sigma$ , including an inverse gamma or an exponential, the key is to use a distribution that does not let it go below zero, which here I achieve through the truncated normal.

Here is the Stan code again

```{r}
stan_code_slope <- "
data {
  int<lower=0> N;  // number of observations
  vector[N] y;     // outcome variable
  vector[N] x;     // predictor variable
}
parameters {
  real beta_0;  // intercept
  real beta_1;   // slope
  real<lower=0> sigma;  // standard deviation, constrained to be positive
}
model {
  beta_0 ~ normal(170, 10);  // weakly informative prior for the intercept
  beta_1 ~ normal(0, 10);   // weakly informative prior for the slope
  sigma ~ normal(0, 10) T[0,];  // weakly informative prior for the standard deviation, truncated to be positive
  y ~ normal(beta_0 + beta_1 * x, sigma);  // likelihood
}
"

# Tell Stan where to find the N and Y 
  stan_data_slope <- list(
  N = nrow(df_hse_wk7),
  y = df_hse_wk7$Height,
  x = df_hse_wk7$Weight
)

  # and fit the model in stan
  fit_slope_stan_uncentered <- stan(
  model_code = stan_code_slope, 
  data = stan_data_slope,  
  chains = 2,  
  iter = 2000,  
  warmup = 500,  
  seed = 1974  
)

# Print summary of the model
summary(fit_slope_stan_uncentered)
```

#### Centering the predictor

As we discussed last time, the intercept becomes meaningless from a substantive point of view once the predictor is entered into the model. You can overcome this by centering the predictor, ie subtracting the mean from each one of its values. Now the intercept is thet value at the mean. Centering has another great advantage: it decorrelates intercept and slope, something which comes in very handy when you have complex models and colinearity is a threat.

Please see in the code below the effects of centering on de-correlation.

```{r}
df_hse_wk7$Weight_centered <- df_hse_wk7$Weight - mean(df_hse_wk7$Weight)

stan_code_slope <- "
data {
  int<lower=0> N;  // number of observations
  vector[N] y;     // outcome variable
  vector[N] x;     // predictor variable
}
parameters {
  real beta_0;  // intercept
  real beta_1;   // slope
  real<lower=0> sigma;  // standard deviation, constrained to be positive
}
model {
  beta_0 ~ normal(170, 10);  // weakly informative prior for the intercept
  beta_1 ~ normal(0, 10);   // weakly informative prior for the slope
  sigma ~ normal(0, 10) T[0,];  // weakly informative prior for the standard deviation, truncated to be positive
  y ~ normal(beta_0 + beta_1 * x, sigma);  // likelihood
}
"

# Tell Stan where to find the N and Y 
  stan_data_slope <- list(
  N = nrow(df_hse_wk7),
  y = df_hse_wk7$Height,
  x = df_hse_wk7$Weight_centered
)

  # and fit the model in stan
  fit_slope_stan <- stan(
  model_code = stan_code_slope, 
  data = stan_data_slope,  
  chains = 2,  
  iter = 2000,  
  warmup = 500,  
  seed = 1974  
)

# Print summary of the model
print(fit_slope_stan)
```

Notice the intercept now. Also, see the code chunk below where I demonstrate the decorrelation due to centering. The first step is to extract the posterior samples–**this is the all important step in all you do for inference too.**

```{r}
# extracting posterior samples
posterior_samples_uncentered <- extract(fit_slope_stan_uncentered)
posterior_samples <- extract(fit_slope_stan)
cor(beta_0_samples, beta_1_samples)
# getting beta_0 and beta_1 for the centred and uncentred
beta_0_samples_uncentered <- posterior_samples_uncentered $beta_0
beta_1_samples_uncentered  <- posterior_samples_uncentered $beta_1

beta_0_samples <- posterior_samples$beta_0
beta_1_samples <- posterior_samples$beta_1

# correlation for the uncentered
cor(beta_0_samples_uncentered, beta_1_samples_uncentered)
cor(beta_0_samples, beta_1_samples)
#### Notice the huge difference

```

#### Obtaining credible intervals for the coefficients in the Bayesian model

How do you do inference on the coefficient? By sampling from the posterior. Here is how to do this and plot. Having extracted the posterior above, I can do some fun things.

Note: Argyris to compare this to the frequentist inference that requires the standard error.

```{r}


# creidble intervals at 95%
cred_int_beta_0 <- quantile(beta_0_samples, probs = c(0.025, 0.975)) 
cred_int_beta_1 <- quantile(beta_1_samples, probs = c(0.025, 0.975))

# Plot distribution for beta_0
hist(beta_0_samples, breaks = 30, main = "Posterior distribution of beta_0", xlab = "beta_0", probability = TRUE)
abline(v = cred_int_beta_0, col = "red")


# Plot distribution for beta_1
hist(beta_1_samples, breaks = 30, main = "Posterior distribution of beta_1", xlab = "beta_1", probability = TRUE)
abline(v = cred_int_beta_1, col = "red", lwd = 2, lty = 2)


```

#### A quick sanity check for the Bayesian regression model

And here is a sanity check. We will do a lot more of these later when we delve into the Markov chains and MH sampling quite a bit more.

```{r}
plot(beta_0_samples, type = "l", main = "Trace plot for beta_0", ylab = "beta_0", xlab = "Iteration")

# Plot chains for beta_1
plot(beta_1_samples, type = "l", main = "Trace plot for beta_1", ylab = "beta_1", xlab = "Iteration")
```

### Interpretation of the outcome of the linear regression.

How do you interpret the output of the regression model (assuming for the moment that it ran correctly!). Here are some question to help.

-   is there a quick way to test whether your model is a good represetation of the dat.

-   what are the units of the slope and what are those of the intercept. What are the units for your statement?

-   What do the credible intervals above mean?

-   How tall would you expect a person to be who is 90 kg of weight? How would you find that? Is this a prediction? How certain can you be about your finding?

### Is this a good model? Simulating from the data.

There are many ways to answer this question, including postulating some alternative model, e.g. a null. We will come to that. For the moment, how well does the model you generated compare to the observed data?

To do this, I will run

```{r}
beta_0_samples <- posterior_samples$beta_0
beta_1_samples <- posterior_samples$beta_1
sigma_samples <- posterior_samples$sigma

# for ease
x <- df_hse_wk7$Weight_centered
y <- df_hse_wk7$Height


# Simulate new Heights
set.seed(1974) 
num_samples <- length(beta_0_samples)
y_simulated <- matrix(NA, nrow = length(x), ncol = num_samples)

for (i in 1:num_samples) {
  y_simulated[, i] <- rnorm(length(x), mean = beta_0_samples[i] + beta_1_samples[i] * x, sd = sigma_samples[i]) ### please notice here the likelihood function!
}

# Calculate the mean and 95% credible intervals for the simulated y values
y_simulated_mean <- apply(y_simulated, 1, mean)
y_simulated_lower <- apply(y_simulated, 1, quantile, probs = 0.025)
y_simulated_upper <- apply(y_simulated, 1, quantile, probs = 0.975)
```

```{r}
# Plot the observed y values and the simulated y values with credible intervals
plot(x, y, main = "Observed vs Simulated Heights", xlab = "x", ylab = "y", pch = 19, col = "blue")
lines(x, y_simulated_mean, col = "red")
lines(x, y_simulated_lower, col = "red")
lines(x, y_simulated_upper, col = "red")

# Add legend
legend("topright", legend = c("Observed Heights", "Simulated Heights (mean)", "95% credible interval"), 
       col = c("blue", "red", "red"), lwd = c(1, 2, 1), lty = c(NA, 1, 2), pch = c(19, NA, NA))

```

This actually looks very pretty, our simulation shows that we can largely recover the data.

#### What are the units of the slope and what are those of the intercept. What are the units for your statement?

Well Height will be in cm, weight in kg, but what is the slope?

#### How tall do you expect a person weighing 90kg to be?

There is a simple answer to this:

```{r}
Height_90_kg <- 143.061 + 0.322*90 # note I have used here the cefficients from the first model without centrering
Height_90_kg 

# if you want to do it with the centered data, do the following

166.57 + 0.322*df_hse_wk7[df_hse_wk7$Weight==90,]$Weight_centered
```

But what would you say if I asked you how certain you are about this prediction?

For this you will need a predictive interval. In the Bayesian context this is achieved by plugging in 90kg to the various slopes and intercepts that you have obtained from the posterior.

```{r}
# got 
sigma_samples <- posterior_samples$sigma

# Define the new x value
weight_90 <-df_hse_wk7[df_hse_wk7$Weight==90,]$Weight_centered # see previous chunk


# Calculate the predicted y values for the new x
predicted_y <- beta_0_samples + beta_1_samples * weight_90

# Add noise according to the posterior sigma for predictive distribution
predicted_y_simulated <- rnorm(length(predicted_y), mean = predicted_y, sd = sigma_samples)

# Calculate the predictive mean and 95% predictive interval
predicted_mean <- mean(predicted_y_simulated)
predicted_lower <- quantile(predicted_y_simulated, probs = 0.025)
predicted_upper <- quantile(predicted_y_simulated, probs = 0.975)


# Plot the predictive distribution
hist(predicted_y_simulated, breaks = 30, main = paste("Predictive distribution for x =", 90), xlab = "Predicted y", probability = TRUE)
abline(v = predicted_lower, col = "red", lwd = 2, lty = 2)
abline(v = predicted_upper, col = "red", lwd = 2, lty = 2)
abline(v = predicted_mean, col = "blue", lwd = 2, lty = 1)
legend("topright", legend = c("Predictive mean Height", "95% predictive interval"), col = c("blue", "red"), lwd = 2, lty = c(1, 2))
```

#### Let's try a different model: include a quadratic term

Why not specify a model with a quadratic term included. You often have to do this, particularly when you model time processes. Let's see how to specify this in stan.

```{r}

df_hse_wk7$wt_sq <- df_hse_wk7$Weight^2
df_hse_wk7$Weight_centered <- df_hse_wk7$Weight - mean(df_hse_wk7$Weight)
  df_hse_wk7$wt_sq_centered <- df_hse_wk7$Weight_centered^2


stan_code_slope_quad <- "
data {
  int<lower=0> N;  // number of observations
  vector[N] Height;     // outcome 
  vector[N] Weight_centered;     // predictor
  vector[N] wt_sq_centered; // quadratic predictor 
}

parameters {
  real beta_0;  // intercept
  real beta_1;   // coef x
  real beta_2; // coef x^2
  real<lower=0> sigma;  // standard deviation, constrained to be positive
}

model {
  beta_0 ~ normal(140, 10);  // weakly informative prior for the intercept
  beta_1 ~ normal(0, 10);   // weakly informative prior for the slope
  beta_2 ~ normal(0, 10); //
  sigma ~ normal(0, 10) T[0,];  // weakly informative prior for the standard deviation, truncated to be positive
  Height ~ normal(beta_0 + beta_1 * Weight_centered + beta_2*wt_sq_centered, sigma);  // likelihood
}

"
stan_data_slope_quad <- list(
  N = nrow(df_hse_wk7),
  Height = df_hse_wk7$Height,
  Weight_centered = df_hse_wk7$Weight_centered,
  wt_sq_centered = df_hse_wk7$wt_sq_centered
)


  # and fit the model in stan
  fit_slope_stan_quad <- stan(
  model_code = stan_code_slope_quad, 
  data = stan_data_slope_quad,  
  chains = 2,  
  iter = 2000,  
  warmup = 500,  
  seed = 1974  
)

# Print summary of the model
print(fit_slope_stan_quad)

# print lm
summary(lm(Height ~ Weight_centered + wt_sq_centered, df_hse_wk7))
```

#### Plot from the posterior for the quadratic

```{r}
posterior_samples_q <- extract(fit_slope_stan_quad)

# getting beta_0 and beta_1
beta_0_samples_q <- posterior_samples_q$beta_0
beta_1_samples_q <- posterior_samples_q$beta_1
beta_2_samples_q <- posterior_samples_q$beta_2
sigma_samples_q <- posterior_samples_q$sigma

# creidble intervals at 95%
cred_int_beta_0_q <- quantile(beta_0_samples_q, probs = c(0.025, 0.975)) 
cred_int_beta_1_q <- quantile(beta_1_samples_q, probs = c(0.025, 0.975))
cred_int_beta_2_q <- quantile(beta_1_samples_q, probs = c(0.025, 0.975))

# Plot distribution for beta_0
hist(beta_0_samples_q, breaks = 30, main = "Posterior distribution of beta_0", xlab = "beta_0", probability = TRUE)
abline(v = cred_int_beta_0_q, col = "red")
```

This is now a rough check that the model ran ok.

```{r}
plot(beta_0_samples_q, type = "l", main = "Trace plot for beta_0", ylab = "beta_0", xlab = "Iteration")

# chains for beta_1
plot(beta_1_samples_q, type = "l", main = "Trace plot for beta_1", ylab = "beta_1", xlab = "Iteration")
# chains for beta_2
plot(beta_2_samples_q, type = "l", main = "Trace plot for beta_2", ylab = "beta_2", xlab = "Iteration")
```

#### Simulate the quadratic model

This is what we did above for the linear model. It is not too bad as you can see. Note the curvature.

```{r}

x_centered <- df_hse_wk7$Weight_centered # for ease--don't forget to use the centred values
y <- df_hse_wk7$Height

# Number of posterior samples
num_samples_q <- length(beta_0_samples_q)

# create a matrix to store simulated y values
y_simulated_q <- matrix(NA, nrow = length(x_centered), ncol = num_samples_q)

# Simulate y values from the posterior 
for (i in 1:num_samples_q) {
  y_simulated_q[, i] <- rnorm(length(x_centered), 
                              mean = beta_0_samples_q[i] + beta_1_samples_q[i] * x_centered + beta_2_samples_q[i] * x_centered^2, 
                              sd = sigma_samples_q[i])
}

# extract mean and 95% credible intervals for the simulated y values
y_simulated_mean_q <- apply(y_simulated_q, 1, mean)
y_simulated_lower_q <- apply(y_simulated_q, 1, quantile, probs = 0.025)
y_simulated_upper_q <- apply(y_simulated_q, 1, quantile, probs = 0.975)

# better to sort x values and corresponding simulated values for smooth plotting
sorted_indices <- order(x_centered)
x_sorted <- x_centered[sorted_indices]
y_simulated_mean_sorted <- y_simulated_mean_q[sorted_indices]
y_simulated_lower_sorted <- y_simulated_lower_q[sorted_indices]
y_simulated_upper_sorted <- y_simulated_upper_q[sorted_indices]

# osberved y values and the simulated y values with credible intervals
plot(x_centered, y, main = "Observed vs Simulated Heights (Quadratic Model)", 
     xlab = "Centered Weight", ylab = "Height", pch = 19, col = "blue")
lines(x_sorted, y_simulated_mean_sorted, col = "red", lwd = 2)
lines(x_sorted, y_simulated_lower_sorted, col = "red", lty = 2)
lines(x_sorted, y_simulated_upper_sorted, col = "red", lty = 2)

legend("topright", legend = c("Observed Heights", "Predicted Heights (mean)", "95% Credible Interval"), 
       col = c("blue", "red", "red"), lwd = c(1, 2, 1), lty = c(NA, 1, 2), pch = c(19, NA, NA))

```

#### Compare the linear and the quadratic models

We are going to use three main metrics for this purpose.

First the **Mean Squared Error**. We encounted the sum of squared errors last week. This is simply the mean. Formally it is:

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

remember the angular hat over the y in $\hat{y}$ is the predicted value.

Second the Mean Absolute Error. Remember, last week we spoke about the possibility of using absolute rather than squared values. Here it is as a metric:

$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

The final one is the proportion of the variance, expressed as the $R^2$

$$
R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
$$

where the straight hat over the y, $\bar{y}$ is called a bar and represents the mean and the denominator is the famous Total Sum of Squares, a key metric in anovas, for example. The numerator is well known to us already...

Below I arite a function to estimate these seemlesly.

```{r}

compare_models <- function(y_observed, x_centered, posterior_samples, posterior_samples_q) {
  
  y_pred_linear <- apply(sapply(1:length(beta_0_samples), function(i) {
    beta_0_samples[i] + beta_1_samples[i] * x_centered
  }), 1, mean)
  
  # Predict y values for quadratic model
  y_pred_quad <- apply(sapply(1:length(beta_0_samples_q), function(i) {
    beta_0_samples_q[i] + beta_1_samples_q[i] * x_centered + beta_2_samples_q[i] * x_centered^2
  }), 1, mean)
  
   # MSE
  mse_linear <- mean((y_observed - y_pred_linear)^2)
  mse_quad <- mean((y_observed - y_pred_quad)^2)
  
  # MAE
  mae_linear <- mean(abs(y_observed - y_pred_linear))
  mae_quad <- mean(abs(y_observed - y_pred_quad))
  
  # R-squared
  ss_total <- sum((y_observed - mean(y_observed))^2)
  ss_res_linear <- sum((y_observed - y_pred_linear)^2)
  ss_res_quad <- sum((y_observed - y_pred_quad)^2)
  
  r_squared_linear <- 1 - (ss_res_linear / ss_total)
  r_squared_quad <- 1 - (ss_res_quad / ss_total)
  
  # Create a comparison data frame
  comparison <- data.frame(
    Metric = c("MSE", "MAE", "R-squared"),
    Linear_Model = c(mse_linear, mae_linear, r_squared_linear),
    Quadratic_Model = c(mse_quad, mae_quad, r_squared_quad)
  )
  
  return(comparison)
}

y <- df_hse_wk7$Height
x_centered <- df_hse_wk7$Weight_centered

# Compare models
comparison_result <- compare_models(y, x_centered, posterior_samples, posterior_samples_q)


print(comparison_result)
```

What do you think? Is any of the two better than the other? Have you heard of Occams razor?

![](occam.jpeg)

#### Exercise 8.1

a\) Take the whole of the HSE sample, except for the data you have used already. It is simple to subset.

b\) Use the linear and quadratic models from above to predict the new data in two ways: i) by constructing predictive values out of the standard equation for each; ii) by incorporating the uncertainty (of sampling and of the posterior) into it, as I have shown above for a single value.

c\) Compare the the performance of the two models on the basis of the MSE, MAE and R-squared values.

```{r}

# this is the simple likelihood function, the one we have created above. 
likelihood_linear_regression <- function(beta0, beta1, x, y) { # the likelihood as above
  mu <- beta0 + beta1 * x  # predicted values
  log_likelihood <- sum(dnorm(y, mean = mu, sd = 1, log = TRUE))  
  return(log_likelihood)
}


prior_beta0 <- function(beta0) { # specifying the prior for the intercept here
  dnorm(beta0, mean = 170, sd = 10, log = TRUE)
}

prior_beta1 <- function(beta1) { # and for the slope
  dnorm(beta1, mean = 100, sd = 500, log = TRUE)
}

# here I am specifying values for the intercept and slope. This is my PARAMETER space
# the THETAS for the P(theta|y) from our first less
beta0_values <- seq(130, 170, by = 0.5) # the range for the intercept and slope, i.e. my parameter space. I have kept it intentionally very narrow
beta1_values <- seq(0.1, 0.4, by = 0.005) # kept this very narrow too to make things easier to see

# here all I do is create a grid to accommodate all the unique combinations of intercept and slope. 
grid_of_coefs <- expand.grid(beta0_values, beta1_values)
colnames(grid_of_coefs) <- c("beta0_values", "beta1_values")


# now here is a loop to iterate the above function over the range of parameter combos in the grid
log_likelihood <- 0

for(i in 1: nrow(grid_of_coefs)){
log_likelihood[i] <- likelihood_linear_regression(grid_of_coefs[i,1], 
                                                  grid_of_coefs[i,2], 
                                                  df_hse_wk7$Weight, df_hse_wk7$Height)

}

# attach the likelihood to the grid for ease
grid_of_coefs$log_likelihood <- log_likelihood

likelihood_solution <- grid_of_coefs[which.max(grid_of_coefs$log_likelihood),]# this tells you in which position of the grid you will find the max likelihood and gives you the row with the coefficients. 

# You have just found the maximum likelihood. 


###If all this makes sense, then let's go to Bayes.
## all we have to do is to add the posteriors. 


# here you multiply with the priors--see the function I have created above to construc the priors
grid_of_coefs$log_posterior <- grid_of_coefs$log_likelihood + prior_beta0(beta0_values) + prior_beta1(beta1_values)


# now I need to do a trick to stablise the numbers here. Not to worry
max_log_posterior <- max(grid_of_coefs$log_posterior) # these two steps prevent numerical underflow. Try them also without
grid_of_coefs$stabilised_log_posterior <- grid_of_coefs$log_posterior - max_log_posterior

# the posterior is the exponent of the log posterior
grid_of_coefs$posterior <- exp(grid_of_coefs$stabilised_log_posterior)

# and here I simply normalise, remember likelihood*prior/sum(likelihood*prior)
grid_of_coefs$posterior <- grid_of_coefs$posterior/sum(grid_of_coefs$posterior)

# now does my Bayes posterior seem plausible
Bayes_solution <- grid_of_coefs[which.max(grid_of_coefs$posterior),]


print(likelihood_solution) 
print(Bayes_solution)


```

## Week 9: Logistic Regression

Now imagine that instead of wanting to predict hiegh, you wanted to predic instead whether someone will be "tall" or not. We will define "tall" in the HSE dataset to be anything that is above the median height--see code chunk below.

```{r}
df_hse_wk7$Height_dich <- ifelse (df_hse_wk7$Height <median(df_hse_wk7$Height) , 0, 1)

df_hse_wk7$Height_dich <- factor(df_hse_wk7$Height_dich, 
levels = c(0,1), 
labels = c("not_tall", "tall"))
```

As you recall, we had fitted an OLS model to predict the Height from Weight. THe model had the form:

$$y_i = \beta_0 + \beta_1.xi + error$$ What is different now is that $y_i$ is no longer a "continuous" quantity, it is binary, in this case 0 or 1, "not tall" vs "tall"

In order to decide how to handle this situation, let's remind ourselves of the graph of height and weight.

```{r}
plot(df_hse_wk7$Weight, df_hse_wk7$Height, pch = 19, col = 'blue', 
     xlab = "Weight", ylab = "Height",
     main = "Heights vs Weights and a regression prediction")
abline(lm(Height ~ Weight, data = df_hse_wk7), col = "red")
```

Now what happens if we have a binary outcome?

```{r}
plot(df_hse_wk7$Weight, df_hse_wk7$Height_dich, pch = 19, col = 'blue', 
     xlab = "Weight", ylab = "Tallness",
     main = "Tallness and Weight")

```

For the moment, let's run the logistic regression

```{r}
glm(Height_dich ~ Weight, data = df_hse_wk7, family = binomial)
```

But what does this mean? We need to understand this below! We will then return to this model.

Our task is to answer the question: how likely is a person of weight = X to be "tall"? This is a probability question and probabilities are constrained between 0 and 1 or 0% and 100%. The question is then, how do we turn a predictive model into one that uses probabilities?

Enter the logistic function, a wonderous device that can squeeze any real value within a predefined range.

$$\sigma(t) = \frac{1}{1 + \exp(-a \cdot t)}$$ {#eq-9.1}

```{r}
t <- -10:10 # the x variable equivalent
a <- 1 # a slope coefficient
probs <- 1/(1 + exp(-a*t)) # the logistic function
plot(probs)
```

### Exercise 9.1

Play around with the code chunk above. a) vary the slope coefficient.

```{r}
t <- -10:10 # the x variable equivalent
a <- seq(0.1, 3, by = 0.5) # a slope coefficient
probs <- list() 
for(i in 1:length(a)){
  
  probs[[i]] <- 1/(1 + exp(-a[i]*t)) # the logistic function
  plot(probs[[i]])
  title(a[i])
  
}


```

b)  vary the maximum range, or carrying capacity.

```{r}
t <- -10:10 # the x variable equivalent
a <- 1 # a slope coefficient
numerator <- 1:10
probs <- list()
for(i in 1: length(numerator)){
probs[[i]] <- numerator[i]/(1 + exp(-a*t)) # the logistic function
plot(probs[[i]])
}
```

We can therefore use the logistic function to run our regression. All we have to do to obtain probabilities for each value of x is the following:

$$ P(x) = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 \cdot x)}$$ {#eq-9.2}

Now you will ask two important questions, of course. The first one is: where does the logistic function even come from? The second is, what am I supposed to do with this?

### Population Growth and the logistic equation

The logistic is a form of sigmoid equations. Many real-life phenomena are best described using logistic euqtions, including the catlytic activity of enzymes. Population growth is arguably the most famous such area. In the 18th century people started becoming pre-occupied with scarce resources and their allocation. It was Thomas Robert Malthus who wrote his famous Essay on the Principle of Population. He wrote that:

*Through the animal and vegetable kingdoms, nature has scattered the seeds of life abroad with the most profuse and liberal hand. ... The germs of existence contained in this spot of earth, with ample food, and ample room to expand in, would fill millions of worlds in the course of a few thousand years. Necessity, that imperious all pervading law of nature, restrains them within the prescribed bounds. The race of plants, and the race of animals shrink under this great restrictive law. And the race of man cannot, by any efforts of reason, escape from it. Among plants and animals its effects are waste of seed, sickness, and premature death. Among mankind, misery and vice.*

This seems to suggest that populations can grow when there is plenty but at some point population growth exceeds the capacity of production of goods and that's when the miserable bound is reached.

Mathematically, it seems that he only half formulated his idea as the formalisms that came out of his writings was the exponential growth:

$$\frac {dP}{dt} = rP$$ {#eq-9.3} where *P* is the population, *t* is time and *r* is a growth rate, or expressed differently, after solving the above differential equation:

$$P(t) = P_{0} e^{rt}$$ {#eq-9.4} You may want to flex your calculus muscles and go ahead and solve the differential equation above using integration.

However, in a way, this equation is unsatisfactory because it does not include the key bit, which is the flattening of the population curve. This is were the Belgian mathematician Pierre Francois Verhulst arrived, having read Malthus he said:

$$\frac{dP}{dt} = rP \left(1 - \frac{P}{K}\right)$$ {#eq-9.5}

where K is the carrying capacity. See what happens when K approaches P

Here is the form that arises when solving the dynamic equation above:

$$ P(t) = \frac{K}{1 + CKe^{-rt}} = \frac{K}{1 + \left(\frac{K - P_{0}}{P_{0}}\right)e^{-rt}}
 $$ {#eq-9.6}

Now this looks very much like our sigmoid equation for the logistic regression, when K = 1!

This should answer the first question, i.e. where does the logistic equation come from.

Now to the second question: what are we supposed to do with the logistic equation.

### Probabilities, odds and odds ratios

Let the probability of p = 3/10, then the odds for the event happening is Of = 3/7 and the odds for the event not happening is Oa = 7/3.

More formally, the odds can be defined as:

$$ O = \frac {p}{1-p} $$ {#eq-9.7} Verify this yourself: $$ Of = \frac{\frac{3}{10}}{\frac{7}{10}} = \frac{3}{7} $$

Now let's return to the logistic equation above:

$$ P(x) = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 \cdot x)}$$

Below, I am showing you that you can arrive at a formulation that links the linear equation with the odds. Here it goes:

![](images/logistic_odds.jpg)

I am rewriting the last few bits here as they are extremely important:

$$
exp(\beta_0 + beta_1.x) = \frac{P}{1-P} = odds
$$ {#eq-9.8}

This is the connection between the linear equation and the odds

and

$$beta_0 + beta_1.x = ln(\frac{P}{1-P}) = log_odds
$$ {#eq-9.9}

and this is the logit, the link function, which connects the linear predictor to the log odds.

For a continuous variable X, then the odds ratio OR can be interpreted as follows:

$$
OR = \frac{O(x+1)}{O(x)} = \frac{\frac{P(x+1)}{1-P(x+1)}}{\frac{P(x)}{1-P(x)}}= \frac{exp(\beta_0 + beta_1.(x+1))}{exp(\beta_0 + beta_1.(x))} = exp(\beta_1)
$$ {#eq-9.10}

In other words, the exponentiated coefficient $\beta_1$ in the regression model can be interpreted as an odds ratio of the increase (or decrease) by a unit change in the continuous predictor variable.

There is also the geometric interepretation due to Agresti, which we will also explain below.

### Week 10 Logistic Regression II

#### Interpretation of the logistic output

We will now fit a logistic regression model to the data. We will initially fit it using R's inbuilt functions to estimate the parameters, we will show next time how to use maximum likelihood to estimate the parameters ourselves.

We will use this to aid the interpretation of the outputs of the model

Remember what we said above that we fit the logistic model in analogy to the OLS:

$$y_i = \beta_0 + \beta_1.xi + error$$

Here is how to fit it.

```{r}
# run the logistic regression using inbuilt R funcs
logistic_model <-glm(Height_dich ~ Weight, data = df_hse_wk7, family = binomial)

summary(logistic_model)


```

We now have an estimate for the intercept and one for the slope. Let's see what these mean. I will extract the parameters below and then use them.

```{r}
logistic_model_params <- broom:: tidy(logistic_model)
beta_0 <- logistic_model_params[logistic_model_params$term=="(Intercept)",]$estimate  # the x variable equivalent
beta_1 <- logistic_model_params[logistic_model_params$term=="Weight",]$estimate


# probs_at_95kg <- 1/(1 + exp(-(beta_0 + beta_1*95))) # the logistic function
# probs_at_95kg
# 
# #How well does it agree with what we get from the data?
# #Read it from the data above 
# plot(sorted_weights, sorted_probabilities, col = "red", lwd = 2)
# abline(v=95, lty = "dashed" )
# # now an interpolation trick to read off the y-value at the intersection
# y_value_at_95 <- approx(sorted_weights, sorted_probabilities, xout = 95)$y
# abline(h = y_value_at_95, lty = "dashed" )

```

Once we have the parameters, we can use them to derive probabilities for each Weight value using equations 9.1 and 9.2 from last week, i.e. the sigmoid function with a carrying capacity of 1. Remember that equation 9.2 defines the probabilities in the regression model:

$$ P(x) = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 \cdot x)}$$

so we can now substitute for each value of x and recover a sigmoid curve.

```{r}
# using the parameters from above
logistic_model_probabilities<- 1/(1 + exp(-(beta_0 + beta_1*df_hse_wk7$Weight)))


# Plot the probability to weight curve
plot(df_hse_wk7$Weight, logistic_model_probabilities, col = "blue", lwd = 2,
     ylab = "Probabilities", xlab = "Weights")

# the probabilities at the quartals
y_values <- c(0.25, 0.5, 0.75)

# Loop through each y-value to calculate corresponding x-value and plot the lines
for (i in y_values) {
  
  # use this to interpoplate the x-values
  x_value <- approx(logistic_model_probabilities, df_hse_wk7$Weight, xout = i)$y
  
  #  the horizontal line
  segments(x0 = min(df_hse_wk7$Weight), y0 = i, x1 = x_value, y1 = i, col = "red", lty = "dashed")
  
  # vertical line from the x-axis to the curve
  segments(x0 = x_value, y0 = min(logistic_model_probabilities), x1 = x_value, y1 = i, col = "red", lty = "dashed")
  
  # Mark the intersection point
  points(x_value, i, col = "red", pch = 19)
}


```

I have added the red lines to indicate the 25, 50 and 75% probabilities. Particularly the 50% line is important in many applications of the sigmoid function, such as in toxicology where it is taken describe the poisonous aspects of substances, the value at which there is a 50% probability that mice (or whatever experimental animal or cell-line is used for this purpose) after receiving the substance. This helps compare them to each other. Here it denotes the 50% probability that one will be tall.

As predicted, the slope our tallness probability to weight slope relatively flat and the "sigmoidicity" of the curve is hardly recognisable.

I suggest you play around with this curve. Increase the $\beta_1$ parameter for example

```{r}
# increase the beta_1 by 50%
new_beta_1 <- 1.5*beta_1
new_beta_1_logistic_model_probabilities<- 1/(1 + exp(-(beta_0 + new_beta_1*df_hse_wk7$Weight)))

# according to the 
plot(df_hse_wk7$Weight, new_beta_1_logistic_model_probabilities, col = "blue", lwd = 2)

# the probabilities at the quartals
y_values <- c(0.25, 0.5, 0.75)

# Loop through each y-value to calculate corresponding x-value and plot the lines
for (i in y_values) {
  
  # use this to interpoplate the x-values
  x_value <- approx(new_beta_1_logistic_model_probabilities, df_hse_wk7$Weight, xout = i)$y
  
  #  the horizontal line
  segments(x0 = min(df_hse_wk7$Weight), y0 = i, x1 = x_value, y1 = i, col = "red", lty = "dashed")
  
  # vertical line from the x-axis to the curve
  segments(x0 = x_value, y0 = min(new_beta_1_logistic_model_probabilities), x1 = x_value, y1 = i, col = "red", lty = "dashed")
  
  # Mark the intersection point
  points(x_value, i, col = "red", pch = 19)
}

```

Notice the shift. The curve (due to the limit at the x-values) has lost its sigmoid shape. Also the corresponding values for our three landmark % effect have been shifted to the left on the x-axis

You may want to play some more with this curve and have it shift to the left or right by adding or subtracting to the x-values.

Now, you will wonder what the intercept means. It means nothing except that it is a parameter needed to fit our equation, it has no intrinsic meaning (what would a negative weight be). However, let's see what happens when we center our x-variable.

```{r}
# remember from our Week 8 lesson
df_hse_wk7$Weight_centred <- df_hse_wk7$Weight - mean(df_hse_wk7$Weight)

logistic_model_centred <-glm(Height_dich ~ Weight_centred, data = df_hse_wk7, family = binomial)

summary(logistic_model_centred)


```

Now the centred interecept looks different. For a start it is a positive number. What does it correspond to?

Well let's check what happens when we plot against centred weight

```{r}
logistic_model_params_centred <- broom:: tidy(logistic_model_centred)
beta_0_centred <- logistic_model_params_centred[logistic_model_params_centred$term=="(Intercept)",]$estimate  # the x variable equivalent
beta_1_centred <- logistic_model_params_centred[logistic_model_params_centred$term=="Weight_centred",]$estimate


# using the parameters from above
logistic_model_probabilities_centred <- 1/(1 + exp(-(beta_0_centred + beta_1_centred*df_hse_wk7$Weight_centred)))


# Plot the probability to weight curve
plot(df_hse_wk7$Weight_centred, logistic_model_probabilities_centred, col = "blue", lwd = 2,
     ylab = "Probabilities", xlab = "Weights")


```

We see that the intercept parameter becomes the probability at the mean (or the logit at the mean). As with the linear regression, think about what happens when you use the centred variable for the value 0, i.e. where x is equal to the mean (in this case the mean weight). The equation $\beta_0 + \beta_1*0$ equals $\beta_0$ and therefore because of equation 9.9, at the mean of x (here at the mean weight) the intercept is the log odds, which is the output of the intercept.

OK, we can now use the equations we have derived above to obtain from teh probabilities the log odds and the odds ratios.

Let's get the odds first. As we recall from equation 9.7, we can obtain th odds from teh probabilities (or from exponentiating the linear equations):

$$
exp(\beta_0 + beta_1.x) = \frac{P}{1-P} = odds
$$

```{r}
# getting the odds
logistic_model_odds <- exp(beta_0 + beta_1*df_hse_wk7$Weight)
```

Let's now get the log odds.

```{r}
# getting the log odds

logistic_model_log_odds <- beta_0 + beta_1*df_hse_wk7$Weight
```

Let's now put all these in a dataframe and play with them.

```{r}

df_probs_odds_log_odds <- data.frame(probs = logistic_model_probabilities,
odds = logistic_model_odds, 
log_odds = logistic_model_log_odds,
weight = df_hse_wk7$Weight)

head(df_probs_odds_log_odds)


```

Play around with this table. You should be able to get the odds from the probabilities and vice versa. For example, taking the first row, 0.58847/(1-0.58847) gives you \~ 1.43, which is the odds, as you can see next to it, log it and you get the log odds, or exponentiate the log odds (exp(0.3576623)) and you get the odds for weight at 80kg.

OK you will ask, what about the odds ratio? We have defined it in 9.10 as

$$
OR = \frac{O(x+1)}{O(x)} = \frac{\frac{P(x+1)}{1-P(x+1)}}{\frac{P(x)}{1-P(x)}}= \frac{exp(\beta_0 + beta_1.(x+1))}{exp(\beta_0 + beta_1.(x))} = exp(\beta_1)
$$

So, you should be able to get the odds ratio by simply exponentiating the coefficient $\beta_1$

Let's do it:

```{r}
odds_ratio <- exp(beta_1)
odds_ratio
```

The interepretation of the odds ratio is notoriously difficult when it comes to continuous variables. It denotes the increase by one unit in x (here weight), or as the formula above states, the odds at x+1 over the odds at x. Let's see whether that's true.

I have below kept the rows for weights from 70 and 71, and 80 and 81 kg and extracted the odds and built their ratio (obviously you can try it out for any weight change).

```{r}

odds_80 <- unique(df_probs_odds_log_odds[df_probs_odds_log_odds$weight==80,]$odds)

odds_81 <- unique(df_probs_odds_log_odds[df_probs_odds_log_odds$weight==81,]$odds)

odds_70 <- unique(df_probs_odds_log_odds[df_probs_odds_log_odds$weight==70,]$odds)

odds_71 <- unique(df_probs_odds_log_odds[df_probs_odds_log_odds$weight==71,]$odds)

round(odds_81/odds_80, 3) == round(odds_ratio, 3)
round(odds_71/odds_70, 3) == round(odds_ratio, 3)
```

And you can see that indeed, the odds ratio agrees with our exponentiation of the coefficient we got from the logistic equation for both cases.

We can now state that the odds of every step increase in x is the odds of the previous value scaled by the exponentiated coefficient, or that an **increase of one point on x over the previous value is the odds ratio.**

In our case, the odds ratio is what you get for a kilogram increase, you get an increase in odds to be tall of roughly 6%.

**Please note:** the 6% increase is over the previous value, the increase over 10 kilograms, for example, is not 60% though! Instead, if you wanted to find out the odds for an additional 10kg (and couldn't bother to read it from the table, or if you had to extrapolate it) you have to do the following:

$$
odds_n = odds_{base} .OR^n
$$

where, n is in this case the number of kilograms, hence for 10 kg

```{r}
calculated_odds_at_80 <- odds_70*odds_ratio^10
round(calculated_odds_at_80, 3) == round(odds_80, 3)

# and the rate increase is 
perc_increase_70_80 <- 100*((odds_80 - odds_70)/odds_70)
perc_increase_70_80
```

One last thing is the **geometric interpretation** of the change in probabilities, which is due to Alan Agresti in his book.

If you look at the curve with the probability, you can see that the change in probabilities at every point of weight, that is $\frac{\partial P}{\partial x}$ in other words, we are trying to find the tangent at every point of this curve. Someone apparently differentiated this and arrived at the following formalism for a change in x:

$$
\frac{\partial P}{\partial x} = \beta_1.P(1-P)
$$

Here is the curve again.

```{r}
plot(df_hse_wk7$Weight, logistic_model_probabilities, col = "blue", lwd = 2,
      ylab = "Probabilities", xlab = "Weights")
```

And here is some playing around with the linear appproximation.

```{r}
linear_approximation <- beta_1*logistic_model_probabilities*(1-logistic_model_probabilities)

df_probs_odds_log_odds$linear_approximation <- linear_approximation

# at what weight is the maximum change in probabilities
df_probs_odds_log_odds[which.max(df_probs_odds_log_odds$linear_approximation),]$weight


# the maximum is around 50% probability, as you might expect
df_probs_odds_log_odds[which.max(df_probs_odds_log_odds$linear_approximation),]$probs

#at what weight is the minimum change in probabilities
df_probs_odds_log_odds[which.min(df_probs_odds_log_odds$linear_approximation),]$weight
```

### 

## Week 11: Pairwise comparisons and the Bradley Terry model

The logistic regression is related to the Bradley Terry model, is a very useful way to analyse rankings, and therefore also choices. We are having this lesson because I think that this approach can be fruitful for us in terms of getting at peopole's choices and their latent preferences.

Economists and people in advertising use forced pairwise choices to assess people's preferences. I have created an example here and many of you have responded by choosing your favourite Cranach portrait out of each pair of portraits from a sample of five: <https://transatlantic-comppsych.github.io/cranach_comparison/cranach_pairwise_comparison.html>

When people have to choose between $n$ items (let's assume you don't allow ties), every item gets compared to every other item, with the total number of comparisons between items being given by the formula:

$$n_{comparisons} = \frac{n_{items}.(n_{items}-1)}{2}$$

So, for the 5 portraits we had 10 pairs presented. This results in a ranking of the same type you get when one sees in sports leagues where every team gets to play each other. Below I show such a ranking for the portraits:

```{r}

list_results_cranach <- list.files("/Users/argyris/Downloads/Cranach_results", all.files = T,
                                   pattern = ".csv", full.names = T)
list_df_results <- list()

for(i in 1: length(list_results_cranach)){
  
list_df_results[[i]] <- read.csv(list_results_cranach[i])

list_df_results[[i]]$id <- rep(i, nrow(list_df_results[[i]]))
  
}

df_results_cranach <- do.call(rbind, list_df_results)

# simplify name
df_results_cranach$Choice <- gsub("images/", "",
                            gsub(".jpg", "", df_results_cranach$Choice))

df_results_cranach$Pair <- gsub("images/",  "", 
                                gsub(".jpg","", df_results_cranach$Pair))

library(dplyr)
library(tidyr)
df_results_cranach <- df_results_cranach %>% 
  separate(Pair, into = c("Portrait_1", "Portrait_2"), sep = " vs ")

# table(df_results_cranach$Choice)


library(ggplot2)
plot_portraits <- df_results_cranach %>% 
  count(Choice) %>% 
  arrange(desc(n)) %>%  
  mutate(Choice = factor(Choice, levels = Choice)) %>%  
  ggplot(aes(x = Choice, y = n)) +
  geom_bar(stat = "identity") +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  
  labs(x = "", y = "Number of Times Chosen", title = "Number of Wins for each Portrait")  


plot_portraits_perc <- df_results_cranach %>% 
  count(Choice) %>%  
  mutate(percentage = n / sum(n) * 100) %>%  
  arrange(desc(percentage)) %>%  
  mutate(Choice = factor(Choice, levels = Choice)) %>%  
  ggplot(aes(x = Choice, y = percentage)) +
  geom_bar(stat = "identity") +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(x = "", y = "Percentage of Times Chosen", title = "Percentage of Wins for Each Portrait") 


```

First, let's look at the data overall:

```{r}
head(df_results_cranach)
```

I have wranlged the data so as to have each portrait and then a column with the winner (Choice). I have also recorded the position at which the chosen portrait appeared (to check for people who simply click one side across all comparisons), and also have the latencies, the time it took people to do each comparison (I should probably remove the first trial from all subjects as people were still orienting and my instructions weren't the best).

Since we have $n_{participants} =$ \`\`\`{r} n \<- length(df_results_cranach\$id) \`\`\` We end up with that $n_{participants}$ times 10 comparisons. We can see in the graph below that the Portait of a Lady has the most wins across participants, i.e. it has the greatest number of "wins" in the league of fiver portraits. We can see that overall in this example, female portraits fared better overall and that poor Johann Friedrich's portrait was the least preferred.\

```{r}
plot_portraits_perc
```

We can also look at the data in a different way, namely by looking at the ranks within each participant (this is important, because given a sufficient number of choices, each person has their own league). How many times did each portrait top the league across people?

```{r}
wins_per_person <- df_results_cranach %>% 
  group_by(id) %>% 
  count(Choice)  %>% 
  filter(n == max(n)) %>%   
  group_by(Choice) %>% 
    count(Choice)


```

We see that, as one would have expected, Portrait of a Lady tops the preferences if we look at each individual.

Now we will want to model these data. The Bradley-Terry model asks the simple question about how much more likely it is for item $i$ ot be preferred over item $j$ , say how much more likely it is for Portrait of a Lady to be preferred over Portrait of Martin Luther whenever they are presented next to each other. Formally, this can be expressed as:

$$
P(i>j) = \frac{P(i)}{P(i) + P(j)}
$$ {#eq-11.1}

So, let's look at our example of how many more times Portrait of a Lady to be preferred over Portrait of Martin Luther.

```{r}
portrait_L_ML <- c("Portrait_of_a_Lady" ,"Portrait_of_Martin_Luther")
portrait_L_ML_counts <- df_results_cranach %>% 
  filter(Portrait_1 %in% portrait_L_ML & Portrait_2 %in% portrait_L_ML) %>% 
  count(Choice)
knitr:: kable(portrait_L_ML_counts)
```

which means that Portrait of a Lady is preferred 15/17 times, that is, $P_{i>j} =$ 0.882.

One could do inference on this through a simple test of proportions (see also our first few lessons) which tests whether $P_i = P_j =0.5$ aginst the alternative of it not being equal to 0.5.

```{r}
n_lady <- portrait_L_ML_counts[portrait_L_ML_counts$Choice == "Portrait_of_a_Lady", ]$n
n_luther <- portrait_L_ML_counts[portrait_L_ML_counts$Choice == "Portrait_of_Martin_Luther", ]$n

n_lady_luther <- n_lady + n_luther  

binom.test(n_lady, n_lady_luther)
```

In a simple case of two choices without covariates and without any other considerations, the BT model will lead to a similar conclusion.

But let's run a simple BT model on all our data.

Let's reformulate the model as log odds (remember, odds is the probability of an event happening over its complement, i.e. the event not happening):

$$
logitP(i>j) = log\frac{P(i>j)}{(1-P(i>j))} =  log\frac{P(i>j)}{P_(j>i)} 
$$ {#eq-11.2}

The last step is due to the simplification The formulation of the BT model uses the log to find the coefficients, i.e.

$$
logit(P(i>j)) =  log\frac{P(i>j)}{P_(j>i)} 
$$

which because of 11.1 reduces to

$$logitP(i>j) = log\frac{P(i)}{P(j)}$$

Now, the BT is formulated by expressing the log of the P(i) as a coefficient $\beta$

$$
logit(P(i>j)) = log\frac{P(i)}{P(j)}= log\frac{e^{\beta_i}}{e^{\beta_j}}= \beta_i - \beta_j
$$ If the middle part is difficult to follow, think of the following (which we know to be true because that's how $\log$ works. $$
log\frac{P(i)}{P(j)} = log(P(i)) - log(P(j)) = \beta_i - \beta_j
$$

Why use a $\beta$ to express this? There are two inter-related reasons, remember that we want to express all these models with exponential distributions as a general linear model, i.e. as something that involves the simple linear equation. The related reasons is that the log probability is easier to handle in a model, not least because it can take on negative numbers and because we turn multiplicative relationships into additive ones (as mentioned above). For example here we can say that the logit of the strength of item $i$ over $j$ is the difference between these two coefficients.

Also, we can derive an odds ratio (OR) when we exponentiate. To do this, start with the original equation:

$$
P(i>j) = \frac{P(i)}{P(i) + P(j)}
$$

we can simply substitute and get:

$$
P(i>j) = \frac{e^{\beta_i}}{e^{\beta_i} + e^{\beta_j}}
$$

So to get the OR (also see equation 11.2 above for the intermediate step):

$$
OR(i>j) = \frac{\frac{P(i>j)} {P(j>i)}} {\frac{P(j>i)} {P(i>j)}} = \frac{P(i>j)} {P(j>i)} = \frac{P(i)} {P(j)} = \frac{\frac{e^{\beta_i}}{e^{\beta_i} + e^{\beta_j}}} {\frac{e^{\beta_j}}{e^{\beta_i} + e^{\beta_j}}} = \frac{e^{\beta_i}} {e^{\beta_j}} = e^{(\beta_i - \beta_j)}
$$

This can be interpreted as a regular odds ratio about how much more likely it is for $i$ to be chosen over $j$.

Finally, there is another important connection, that to the logistic model:

$$ P(i>j) = \frac{1}{1 + e^{-(\beta_i - \beta_j)}}
$$

And if we test this out below, we can see that the two approaches converge.

```{r}
p_i <- n_lady/(n_lady + n_luther) # the value for p_i from the data
exponent <- log(p_i) - log(1 -p_i) # the exponent of the denominator of the sigmoid


P_i_greater_j_sigmoid <- 1/(1 + exp(-exponent)) # the sigmoid

P_i_greater_j_sigmoid == n_lady/(n_lady + n_luther) # test if equal to simple probability


```

Now this may seem awfully trivial, but it allows us to estimate things nicely. Let's fit a BT model.

```{r}
p_i <- seq(from = 0,  to = 1, by = 0.1)
exponent <- log(p_i) - log(1 -p_i) 
Probabilities <- 1/(1 + exp(-exponent))

plot( log(p_i) - log(1 -p_i), Probabilities)
```

I have done some simple data wrangling here which should facilitate our BT model running.

```{r}
df_results_cranach$merged_portraits <- paste(df_results_cranach$Portrait_1, df_results_cranach$Portrait_2)

df_results_cranach_BT <- df_results_cranach %>% 
  group_by(merged_portraits) %>% 
  count(Choice)

df_results_cranach_BT <- df_results_cranach_BT[duplicated(df_results_cranach_BT$merged_portraits),]

df_results_cranach_BT <- df_results_cranach_BT %>% 
  separate(merged_portraits, into = c("Portrait_1", "Portrait_2"), sep = " ")

total_n <- length(unique(df_results_cranach$id))

df_results_cranach_BT$win_1 <- ifelse(
     df_results_cranach_BT$Portrait_1  == df_results_cranach_BT$Choice, df_results_cranach_BT$n, total_n-df_results_cranach_BT$n)

df_results_cranach_BT$win_2 <- total_n - df_results_cranach_BT$win_1

df_results_cranach_BT <- df_results_cranach_BT %>% 
  dplyr::select(!c(Choice, n))


```

```{r}
#library(BradleyTerry2)

all_portraits <- unique(c(df_results_cranach_BT$Portrait_1, df_results_cranach_BT$Portrait_2))

df_results_cranach_BT$Portrait_1 <- factor(df_results_cranach_BT$Portrait_1, levels = all_portraits)
df_results_cranach_BT$Portrait_2 <- factor(df_results_cranach_BT$Portrait_2, levels = all_portraits)


df_results_cranach_BT$Portrait_1 <- as.factor(df_results_cranach_BT$Portrait_1)
df_results_cranach_BT$Portrait_2 <- as.factor(df_results_cranach_BT$Portrait_2)


bt_model <- BTm(
  cbind(win_1, win_2), 
  player1 = Portrait_1, 
  player2 = Portrait_2, 
  data = df_results_cranach_BT
)

#or if you want a different category
summary(update(bt_model, refcat = "Portrait_of_a_Lady"))
```

```{r}
# Install and load the BradleyTerry2 package if necessary
install.packages("BradleyTerry2")
library(BradleyTerry2)

# Prepare the data
data <- data.frame(
  Portrait_1 = c("Portrait_of_Johann_Friedrich", "Portrait_of_Johann_Friedrich", 
                 "Portrait_of_Johann_Friedrich", "Portrait_of_Martin_Luther", 
                 "Portrait_of_Martin_Luther", "Portrait_of_a_Lady", 
                 "Portrait_of_a_Lady", "Portrait_of_a_Lady", 
                 "Portrait_of_a_Lady", "Portrait_of_a_Young_Girl"),
  Portrait_2 = c("Portrait_of_Martin_Luther", "Portrait_of_a_Saxon_Princess", 
                 "Portrait_of_a_Young_Girl", "Portrait_of_a_Saxon_Princess", 
                 "Portrait_of_a_Young_Girl", "Portrait_of_Johann_Friedrich", 
                 "Portrait_of_Martin_Luther", "Portrait_of_a_Saxon_Princess", 
                 "Portrait_of_a_Young_Girl", "Portrait_of_a_Saxon_Princess"),
  win_1 = c(8, 8, 1, 10, 4, 17, 16, 11, 10, 14),
  win_2 = c(10, 10, 17, 8, 14, 1, 2, 7, 8, 4)
)

# Combine all portrait names into a single vector to ensure factor levels are consistent
all_portraits <- unique(c(data$Portrait_1, data$Portrait_2))

# Convert Portrait_1 and Portrait_2 to factors with identical levels
data$Portrait_1 <- factor(data$Portrait_1, levels = all_portraits)
data$Portrait_2 <- factor(data$Portrait_2, levels = all_portraits)

# Verify that both columns are factors with identical levels
stopifnot(identical(levels(data$Portrait_1), levels(data$Portrait_2)))

# Fit the Bradley-Terry model
bt_model <- BTm(
  cbind(win_1, win_2), 
  player1 = Portrait_1, 
  player2 = Portrait_2, 
  data = data
)

# Summarize the model results
summary(bt_model)


```

```{r}

# distributions <- function(n_teams, district_name, avg, std){
# 
# distributions_list <- list()
#   
# for(i in 1 :n_teams){
#   
# distributions_list[[i]] <-  
#   
#              data.frame(scores = rnorm(n_teams, avg[i], std ),
#              ids = paste0(district_name, "_", 1:length(n_teams)), 
#              disctrict = rep(district_name,length(n_teams) )
#              
#              
#              
#   )
# }
# }
# test <- distributions(10, "A", seq(from = 5, to = 12, length.out = 10), 1)

n_per_group <- 5
grps <- LETTERS[1:n_per_group]
avgs <- seq(10, 20, length.out = n_per_group)
std <- seq(2, 3, length.out = n_per_group)
distributions <- list()
for(i in 1:length(avgs)){
  
distributions[[i]] <- rnorm(n_per_group, avgs[i], std)


}

names(distributions) <- grps
test <- data.frame(do.call(rbind, distributions))
test$group <- grps
test_long <- pivot_longer(test, cols = starts_with("X"), names_to = "groups", values_to = "score")
test_long$id <- paste0(test_long$group, "_", test_long$groups)
 
test_expanded <- expand.grid(school_X = test_long$id, school_Y = test_long$id)

test_expanded <- test_expanded %>% 
  filter(school_X != school_Y)

test_winners <- test_expanded %>% 
  left_join(test_long, by = c("school_X" = "id")) %>%
  rename(score_X = score) %>%
  left_join(test_long, by = c("school_Y" = "id")) %>%
  rename(score_Y = score) %>%
  mutate(
    win_1 = ifelse(score_X > score_Y, 1, 0),
    win_2 = ifelse(score_Y > score_X, 1, 0)
  ) %>%
  select(school_X, school_Y, win_1, win_2)

test_winners <- test_winners %>% 
  filter(row_number() %% 2 == 1)

```

<!-- ################### FoR THE MOMENT IRRELEVANT CODE -->

<!-- ```{r{r} -->

<!-- # The log-likelihood for different values of beta0 and beta1 REMIND ME TO TELL YOU ABOUT THE #PLAUSIBLE RANGE! -->

<!-- beta0_values <- seq(130, 160, by = 1)   -->

<!-- beta1_values <- seq(0, 1, by = 0.01)   -->

<!-- log_likelihood_values <- outer(beta0_values, beta1_values, # the outer function allows me to #have a vector o.  the two parameters, which you then pass each through the vectorised #function. This can be mind-boggling and I have created below an example with a simpler input #and function -->

<!--                                Vectorize(function(b0, b1) stan_data_slope(b0, b1, df_hse_wk7$Weight  , df_hse_wk7$Height))) -->

<!-- # The MLE for beta0 and beta1 -->

<!-- max_indices <- which(log_likelihood_values == max(log_likelihood_values), arr.ind = TRUE) -->

<!-- mle_beta0 <- beta0_values[max_indices[1]] -->

<!-- mle_beta1 <- beta1_values[max_indices[2]] -->

<!-- fig_likelihood <- plot_ly(x = beta0_values, y = beta1_values, z = log_likelihood_values,  -->

<!--                type = "surface", colorscale = "Blues") %>% -->

<!--   layout(scene = list( -->

<!--     xaxis = list(title = "Intercept (β0)", titlefont = list(size = 18, color = "black")), -->

<!--     yaxis = list(title = "Slope (β1)", titlefont = list(size = 18, color = "black")), -->

<!--     zaxis = list(title = "Log-Likelihood", titlefont = list(size = 18, color = "black")), -->

<!--     annotations = list( -->

<!--       list(x = 145, y = 0.5, z = max(log_likelihood_values), text = "Optimal Point",  -->

<!--            showarrow = TRUE, arrowhead = 2, ax = 20, ay = -40) -->

<!--     ) -->

<!--   )) -->

<!-- # Show the plot -->

<!-- fig_likelihood -->

<!-- ``` -->

<!-- ### The linear regression model as a Bayes model estimated through grid approximation. -->

<!-- We have run the Bayesian model in Stan and in brms above. We will discuss their specification further down. For the time being, let's fit the model in a simple way "by hand" that is without the help of a specialised package. This is to get a fee of what is going on under the hub. -->

<!-- As you will have guessed, we use the likelihood function that we have developed and used above. And, in this Bayesian framework we also use two priors, one for the intercept and one for the slope. -->

<!-- ```{r} -->

<!-- likelihood_linear_regression <- function(beta0, beta1, x, y) { # the likelihood as above -->

<!--   mu <- beta0 + beta1 * x  # predicted values -->

<!--   log_likelihood <- sum(dnorm(y, mean = mu, sd = 1, log = TRUE))   -->

<!--   return(log_likelihood) -->

<!-- } -->

<!-- prior_beta0 <- function(beta0) { # specifying the prior for the intercept here -->

<!--   dnorm(beta0, mean = 170, sd = 10, log = TRUE) -->

<!-- } -->

<!-- prior_beta1 <- function(beta1) { # and for the slope -->

<!--   dnorm(beta1, mean = 0.5, sd = 0.1, log = TRUE) -->

<!-- } -->

<!-- beta0_values <- seq(130, 160, by = 1) # the range for the intercept and slope, i.e. my parameter space -->

<!-- beta1_values <- seq(0, 1, by = 0.01) -->

<!-- # Calculate the log-posterior ## see in week 1 the addednum about how to understand Vectorize if this is confusing. It can be for me and had to look up. -->

<!-- log_posterior <- outer(beta0_values, beta1_values, Vectorize(function(b0, b1) { -->

<!--   log_likelihood <- likelihood_linear_regression(b0, b1, df_hse_wk7$Weight, df_hse_wk7$Height) -->

<!--   log_prior_beta0 <- prior_beta0(b0) -->

<!--   log_prior_beta1 <- prior_beta1(b1) -->

<!--   log_posterior <- log_likelihood + log_prior_beta0 + log_prior_beta1 -->

<!--   return(log_posterior) -->

<!-- })) -->

<!-- # Convert to probability scale -->

<!-- # Normalize the posterior -->

<!-- posterior <- posterior / sum(posterior) # a vector of probabilities -->

<!-- # now I am going to sample from the posterior, like above.  -->

<!-- set.seed(1974) -->

<!-- samples <- sample(seq_along(posterior), size = length(posterior), replace = TRUE, prob = posterior)# this gives you positions on the vector -->

<!-- #now I need to translate the vector positions into something that can find the positions  -->

<!-- # on the specified beta0 and beta1 grid.  -->

<!-- # what I came up with is the following -->

<!-- # create the grid of the intercepts and slopes.  -->

<!-- # the number of rows is the length of the vector -->

<!-- array_to_find_values <- expand.grid(beta0_values, beta1_values) -->

<!-- df_to_find_values <- data.frame(table(samples)) # this tells me how many times each value appears, or rather each "position" -->

<!-- row_nums <- as.numeric(as.character(df_to_find_values$samples)) # easier to get them out of the dataframe. -->

<!-- num_times <- as.numeric(as.character(df_to_find_values$Freq)) -->

<!-- # create a loop to allocate the positions to the values. -->

<!-- sampled_intercepts <- list() -->

<!-- sampled_slopes <- list() -->

<!-- for(i in 1:nrow(df_to_find_values)){ -->

<!--  sampled_intercepts[[i]] <- rep(expanded[row_nums,1], num_times) -->

<!--  sampled_slopes[[i]] <- rep(expanded[row_nums,2], num_times) -->

<!-- } -->

<!--  sampled_intercepts <- unlist( sampled_intercepts) -->

<!--  sampled_slopes <- unlist( sampled_slopes) -->

<!-- mean(sampled_intercepts) -->

<!--  mean(sampled_slopes) -->

<!-- # and this is another way. -->

<!--  sampled_beta0 <- beta0_values[(samples - 1) %% length(beta0_values) + 1] -->

<!--  sampled_beta1 <- beta1_values[(samples - 1) %/% length(beta0_values) + 1] -->

<!-- ``` -->