15-maximum_likelihood.Rmd


# Maximum likelihood {#ml}

This chapter deals with maximum likelihood estimation.

The students are expected to acquire the following knowledge:

- How to derive MLE.
- Applying MLE in R.
- Calculating and interpreting Fisher information.
- Practical use of MLE.

<style>
.fold-btn { 
  float: right; 
  margin: 5px 5px 0 0;
}
.fold { 
  border: 1px solid black;
  min-height: 40px;
}
</style>

<script type="text/javascript">
$(document).ready(function() {
  $folds = $(".fold");
  $folds.wrapInner("<div class=\"fold-blck\">"); // wrap a div container around content
  $folds.prepend("<button class=\"fold-btn\">Unfold</button>");  // add a button
  $(".fold-blck").toggle();  // fold all blocks
  $(".fold-btn").on("click", function() {  // add onClick event
    $(this).text($(this).text() === "Fold" ? "Unfold" : "Fold");  // if the text equals "Fold", change it to "Unfold"or else to "Fold" 
    $(this).next(".fold-blck").toggle("linear");  // "swing" is the default easing function. This can be further customized in its speed or the overall animation itself.
  })
});
</script>

```{r, echo = FALSE, warning = FALSE, message = FALSE}
togs <- T
library(ggplot2)
library(dplyr)
library(reshape2)
library(tidyr)
library(MASS)
library(ggforce)
library(mixtools)
library(ade4)
library(ggfortify)
library(numDeriv)

# togs <- FALSE
```

## Deriving MLE
```{exercise}


a. Derive the maximum likelihood estimator of variance for N$(\mu, \sigma^2)$.
b. Compare with results from \@ref(exr:cbest). What does that say about the MLE estimator?

```
<div class="fold">
```{solution, echo = togs}


a. The mean is assumed constant, so we have the likelihood
\begin{align}
  L(\sigma^2; y) &= \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(y_i - \mu)^2}{2 \sigma^2}} \\
    &= \frac{1}{\sqrt{2 \pi \sigma^2}^n} e^{\frac{-\sum_{i=1}^n (y_i - \mu)^2}{2 \sigma^2}}
\end{align}
We need to find the maximum of this function. We first observe that we can replace $\frac{-\sum_{i=1}^n (y_i - \mu)^2}{2}$ with a constant $c$, since none of the terms are dependent on $\sigma^2$. Additionally, the term $\frac{1}{\sqrt{2 \pi}^n}$ does not affect the calculation of the maximum. So now we have
\begin{align}
  L(\sigma^2; y) &= (\sigma^2)^{-\frac{n}{2}} e^{\frac{c}{\sigma^2}}.
\end{align}
Differentiating we get
\begin{align}
 \frac{d}{d \sigma^2} L(\sigma^2; y) &= (\sigma^2)^{-\frac{n}{2}} \frac{d}{d \sigma^2} e^{\frac{c}{\sigma^2}} +  e^{\frac{c}{\sigma^2}} \frac{d}{d \sigma^2} (\sigma^2)^{-\frac{n}{2}} \\
  &= - (\sigma^2)^{-\frac{n}{2}} e^{\frac{c}{\sigma^2}} \frac{c}{(\sigma^2)^2} - e^{\frac{c}{\sigma^2}} \frac{n}{2} (\sigma^2)^{-\frac{n + 2}{2}} \\
  &= - (\sigma^2)^{-\frac{n + 4}{2}} e^{\frac{c}{\sigma^2}} c - e^{\frac{c}{\sigma^2}} \frac{n}{2} (\sigma^2)^{-\frac{n + 2}{2}} \\
  &= - e^{\frac{c}{\sigma^2}} (\sigma^2)^{-\frac{n + 4}{2}} \Big(c + \frac{n}{2}\sigma^2 \Big).
\end{align}
To get the maximum, this has to equal to 0, so
\begin{align}
  c + \frac{n}{2}\sigma^2 &= 0 \\
  \sigma^2 &= -\frac{2c}{n} \\
  \sigma^2 &= \frac{\sum_{i=1}^n (Y_i - \mu)^2}{n}.
\end{align}

b. The MLE estimator is biased.

```
</div>


```{exercise, name = "Multivariate normal distribution"}


a. Derive the maximum likelihood estimate for the mean and covariance matrix of the multivariate normal.

b. <span style="color:blue">Simulate $n = 40$ samples from a bivariate normal distribution (choose non-trivial parameters, that is, mean $\neq 0$ and covariance $\neq 0$). Compute the MLE for the sample. Overlay the data with an ellipse that is determined by the MLE and an ellipse that is determined by the chosen true parameters.</span>

c. <span style="color:blue">Repeat b. several times and observe how the estimates (ellipses) vary around the true value.</span>

Hint: For the derivation of MLE, these identities will be helpful: $\frac{\partial b^T a}{\partial a} = \frac{\partial a^T b}{\partial a} = b$, $\frac{\partial a^T A a}{\partial a} = (A + A^T)a$, $\frac{\partial \text{tr}(BA)}{\partial A} = B^T$, $\frac{\partial \ln |A|}{\partial A} = (A^{-1})^T$, $a^T A a = \text{tr}(a^T A a) = \text{tr}(a a^T A) = \text{tr}(Aaa^T)$.


```
<div class="fold">
```{solution, echo = togs}
The log likelihood of the MVN distribution is
\begin{align*}
  l(\mu, \Sigma ; x) &= -\frac{1}{2}\Big(\sum_{i=1}^n k\ln(2\pi) + |\Sigma| + (x_i - \mu)^T \Sigma^{-1} (x_i - \mu)\Big) \\
    &= -\frac{n}{2}\ln|\Sigma| + -\frac{1}{2}\Big(\sum_{i=1}^n(x_i - \mu)^T \Sigma^{-1} (x_i - \mu)\Big) + c,
\end{align*}
where $c$ is a constant with respect to $\mu$ and $\Sigma$.
To find the MLE we first need to find partial derivatives. Let us start with $\mu$.
\begin{align*}
  \frac{\partial}{\partial \mu}l(\mu, \Sigma ; x) &= \frac{\partial}{\partial \mu} -\frac{1}{2}\Big(\sum_{i=1}^n x_i^T \Sigma^{-1} x_i - x_i^T \Sigma^{-1} \mu - \mu^T \Sigma^{-1} x_i + \mu^T \Sigma^{-1} \mu \Big) \\
  &= -\frac{1}{2}\Big(\sum_{i=1}^n - \Sigma^{-1} x_i - \Sigma^{-1} x_i + 2 \Sigma^{-1} \mu \Big) \\
  &= -\Sigma^{-1}\Big(\sum_{i=1}^n - x_i + \mu \Big).
\end{align*}
Equating above with zero, we get
\begin{align*}
  \sum_{i=1}^n - x_i + \mu &= 0 \\
  \hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i,
\end{align*}
which is the dimension-wise empirical mean. Now for the covariance matrix
\begin{align*}
  \frac{\partial}{\partial \Sigma^{-1}}l(\mu, \Sigma ; x) &= \frac{\partial}{\partial \Sigma^{-1}} -\frac{n}{2}\ln|\Sigma| + -\frac{1}{2}\Big(\sum_{i=1}^n(x_i - \mu)^T \Sigma^{-1} (x_i - \mu)\Big) \\
  &= \frac{\partial}{\partial \Sigma^{-1}} -\frac{n}{2}\ln|\Sigma| + -\frac{1}{2}\Big(\sum_{i=1}^n \text{tr}((x_i - \mu)^T \Sigma^{-1} (x_i - \mu))\Big) \\
  &= \frac{\partial}{\partial \Sigma^{-1}} -\frac{n}{2}\ln|\Sigma| + -\frac{1}{2}\Big(\sum_{i=1}^n \text{tr}((\Sigma^{-1} (x_i - \mu) (x_i - \mu)^T )\Big) \\
  &= \frac{n}{2}\Sigma + -\frac{1}{2}\Big(\sum_{i=1}^n (x_i - \mu) (x_i - \mu)^T \Big).
\end{align*}
Equating above with zero, we get
\begin{align*}
  \hat{\Sigma} = \frac{1}{n}\sum_{i=1}^n (x_i - \mu) (x_i - \mu)^T.
\end{align*}


```
```{r, echo = togs, message = FALSE, eval = togs, warning=FALSE}
set.seed(1)
n     <- 40
mu    <- c(1, -2)
Sigma <- matrix(data = c(2, -1.6, -1.6, 1.8), ncol = 2)
X     <- mvrnorm(n = n, mu = mu, Sigma = Sigma)
colnames(X) <- c("X1", "X2")
X     <- as.data.frame(X)

# plot.new()
tru_ellip <- ellipse(mu, Sigma, draw = FALSE)
colnames(tru_ellip) <- c("X1", "X2")
tru_ellip <- as.data.frame(tru_ellip)

mu_est    <- apply(X, 2, mean)
tmp       <- as.matrix(sweep(X, 2, mu_est))
Sigma_est <- (1 / n) * t(tmp) %*% tmp

est_ellip <- ellipse(mu_est, Sigma_est, draw = FALSE)
colnames(est_ellip) <- c("X1", "X2")
est_ellip <- as.data.frame(est_ellip)

ggplot(data = X, aes(x = X1, y = X2)) +
  geom_point() +
  geom_path(data = tru_ellip, aes(x = X1, y = X2, color = "truth")) +
  geom_path(data = est_ellip, aes(x = X1, y = X2, color = "estimated")) +
  labs(color = "type")
```
</div>


```{exercise, name = "Logistic regression", label = logisticmle}
Logistic regression is a popular discriminative model when our target variable is binary (categorical with 2 values). One of the ways of looking at logistic regression is that it is linear regression but instead of using the linear term as the mean of a normal RV, we use it as the mean of a Bernoulli RV. Of course, the mean of a Bernoulli is bounded on $[0,1]$, so, to avoid non-sensical values, we squeeze the linear between 0 and 1 with the inverse logit function inv_logit$(z) =  1 / (1 + e^{-z})$. This leads to the following model: $y_i | \beta, x_i \sim \text{Bernoulli}(\text{inv_logit}(\beta x_i))$.


a. Explicitly write the likelihood function of beta.

b. <span style="color:blue">Implement the likelihood function in R. Use black-box box-constraint optimization (for example, _optim()_ with L-BFGS) to find the maximum likelihood estimate for beta for $x$ and $y$ defined below.</span>

c. <span style="color:blue">Plot the estimated probability as a function of the independent variable. Compare with the truth.</span>

d. Let $y2$ be a response defined below. Will logistic regression work well on this dataset? Why not? How can we still use the model, without changing it?

```
```{r, message = FALSE, warning=FALSE}
inv_log <- function (z) {
  return (1 / (1 + exp(-z)))
}
set.seed(1)
x  <- rnorm(100)
y  <- rbinom(100, size = 1, prob = inv_log(1.2 * x))
y2 <- rbinom(100, size = 1, prob = inv_log(1.2 * x + 1.4 * x^2))


```
<div class="fold">
```{solution, echo = togs}
\begin{align*}
  l(\beta; x, y) &= p(y | x, \beta) \\
    &= \ln(\prod_{i=1}^n \text{inv_logit}(\beta x_i)^{y_i} (1 - \text{inv_logit}(\beta x_i))^{1 - y_i}) \\
    &= \sum_{i=1}^n y_i \ln(\text{inv_logit}(\beta x_i)) + (1 - y_i) \ln(1 - \text{inv_logit}(\beta x_i)).
\end{align*}

```
```{r, echo = togs, message = FALSE, eval = togs, warning=FALSE}
set.seed(1)
inv_log <- function (z) {
  return (1 / (1 + exp(-z)))
}

x <- rnorm(100)
y <- x
y <- rbinom(100, size = 1, prob = inv_log(1.2 * x))

l_logistic <- function (beta, X, y) {
  logl <- -sum(y * log(inv_log(as.vector(beta %*% X))) + (1 - y) * log((1 - inv_log(as.vector(beta %*% X)))))
  return(logl)
}

my_optim <- optim(par = 0.5, fn = l_logistic, method = "L-BFGS-B",
                  lower = 0, upper = 10, X = x, y = y)
my_optim$par

truth_p <- data.frame(x = x, prob = inv_log(1.2 * x), type = "truth")
est_p   <- data.frame(x = x, prob = inv_log(my_optim$par * x), type = "estimated")
plot_df <- rbind(truth_p, est_p)
ggplot(data = plot_df, aes(x = x, y = prob, color = type)) +
  geom_point(alpha = 0.3)

y2 <- rbinom(2000, size = 1, prob = inv_log(1.2 * x + 1.4 * x^2))
X2 <- cbind(x, x^2)
my_optim2 <- optim(par = c(0, 0), fn = l_logistic, method = "L-BFGS-B",
                   lower = c(0, 0), upper = c(2, 2), X = t(X2), y = y2)
my_optim2$par

tmp     <- sweep(data.frame(x = x, x2 = x^2), 2, my_optim2$par, FUN = "*")
tmp     <- tmp[ ,1] + tmp[ ,2]
truth_p <- data.frame(x = x, prob = inv_log(1.2 * x + 1.4 * x^2), type = "truth")
est_p   <- data.frame(x = x, prob = inv_log(tmp), type = "estimated")
plot_df <- rbind(truth_p, est_p)
ggplot(data = plot_df, aes(x = x, y = prob, color = type)) +
  geom_point(alpha = 0.3)

```
</div>


```{exercise, name = "Linear regression"}
<span style="color:blue">For the data generated below, do the following:</span>
  
  
a. <span style="color:blue">Compute the least squares (MLE) estimate of coefficients beta using the matrix exact solution.</span>
b. <span style="color:blue">Compute the MLE by minimizing the sum of squared residuals using black-box optimization (_optim()_).</span>
c. <span style="color:blue">Compute the MLE by using the output built-in linear regression (lm() ). Compare (a-c and the true coefficients).</span>
d. <span style="color:blue">Compute 95% CI on the beta coefficients using the output of built-in linear regression.</span>
e. <span style="color:blue">Compute 95% CI on the beta coefficients by using (a or b) and the bootstrap with percentile method for CI. Compare with d.</span>

```
```{r, message = FALSE, warning=FALSE}
set.seed(1)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)

X <- cbind(x1, x2, x3)
beta <- c(0.2, 0.6, -1.2)

y <- as.vector(t(beta %*% t(X))) + rnorm(n, sd = 0.2)

```
<div class="fold">
```{r, echo = togs, message = FALSE, eval = togs, warning=FALSE}

set.seed(1)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)

X <- cbind(x1, x2, x3)
beta <- c(0.2, 0.6, -1.2)

y <- as.vector(t(beta %*% t(X))) + rnorm(n, sd = 0.2)
LS_fun <- function (beta, X, y) {
  return(sum((y - beta %*% t(X))^2))
}
my_optim <- optim(par = c(0, 0, 0), fn = LS_fun, lower = -5, upper = 5,
                  X = X, y = y, method = "L-BFGS-B")
my_optim$par


df <- data.frame(y = y, x1 = x1, x2 = x2, x3 = x3)
my_lm <- lm(y ~ x1 + x2 + x3 - 1, data = df)
my_lm

# matrix solution
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y
beta_hat

out <- summary(my_lm)
out$coefficients[ ,2]

# bootstrap CI
nboot <- 1000
beta_boot <- matrix(data = NA, ncol = length(beta), nrow = nboot)
for (i in 1:nboot) {
  inds     <- sample(1:n, n, replace = T)
  new_df   <- df[inds, ]
  X_tmp    <- as.matrix(new_df[ ,-1])
  y_tmp    <- new_df[ ,1]
  # print(nrow(new_df))
  tmp_beta <- solve(t(X_tmp) %*% X_tmp) %*% t(X_tmp) %*% y_tmp
  beta_boot[i, ] <- tmp_beta
}
apply(beta_boot, 2, mean)
apply(beta_boot, 2, quantile, probs = c(0.025, 0.975))
out$coefficients[ ,2]

```
</div>


```{exercise, name = "Principal component analysis"}
<span style="color:blue">Load the _olympic_ data set from package _ade4_. The data show decathlon results for 33 men in 1988 Olympic Games. This data set serves as a great example of finding the latent structure in the data, as there are certain characteristics of the athletes that make them excel at different events. For example an explosive athlete will do particulary well in sprints and long jumps.</span>

a. <span style="color:blue">Perform PCA (_prcomp_) on the data set and interpret the first 2 latent dimensions. Hint: Standardize the data first to get meaningful results.</span>

b. <span style="color:blue">Use MLE to estimate the covariance of the standardized multivariate distribution.</span>

c. <span style="color:blue">Decompose the estimated covariance matrix with the eigendecomposition. Compare the eigenvectors to the output of PCA.</span>

```
<div class="fold">
```{r, echo = togs, message = FALSE, eval = togs, warning=FALSE}
data(olympic)

X        <- olympic$tab
X_scaled <- scale(X)
my_pca   <- prcomp(X_scaled)
summary(my_pca)

autoplot(my_pca, data = X, loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)

Sigma_est <- (1 / nrow(X_scaled)) * t(X_scaled) %*% X_scaled
Sigma_dec <- eigen(Sigma_est)

Sigma_dec$vectors
my_pca$rotation

```
</div>


## Fisher information
```{exercise}
Let us assume a Poisson likelihood.

a. Derive the MLE estimate of the mean.

b. Derive the Fisher information.

c. <span style="color:blue">For the data below compute the MLE and construct confidence intervals.</span>

d. <span style="color:blue">Use bootstrap to construct the CI for the mean. Compare with c) and discuss.</span>
```
```{r, message = FALSE, warning=FALSE}
x <- c(2, 5, 3, 1, 2, 1, 0, 3, 0, 2)

```
<div class="fold">
```{solution, echo = togs}


a. The log likelihood of the Poisson is
\begin{align*}
  l(\lambda; x) = \sum_{i=1}^n x_i \ln \lambda - n \lambda - \sum_{i=1}^n \ln x_i!
\end{align*}
Taking the derivative and equating with 0 we get
\begin{align*}
  \frac{1}{\hat{\lambda}}\sum_{i=1}^n x_i - n &= 0 \\
  \hat{\lambda} &= \frac{1}{n} \sum_{i=1}^n x_i.
\end{align*}
Since $\lambda$ is the mean parameter, this was expected.


b. For the Fischer information, we first need the second derivative, which is

\begin{align*}
  - \lambda^{-2} \sum_{i=1}^n x_i. \\
\end{align*}
Now taking the expectation of the negative of the above, we get
\begin{align*}
  E[\lambda^{-2} \sum_{i=1}^n x_i] &= \lambda^{-2} E[\sum_{i=1}^n x_i] \\
  &= \lambda^{-2} n \lambda \\
  &= \frac{n}{\lambda}.
\end{align*}

```
```{r, echo = togs, message = FALSE, eval = togs, warning=FALSE}
set.seed(1)
x          <- c(2, 5, 3, 1, 2, 1, 0, 3, 0, 2)
lambda_hat <- mean(x)
finfo      <- length(x) / lambda_hat
mle_CI     <- c(lambda_hat - 1.96 * sqrt(1 / finfo),
                lambda_hat + 1.96 * sqrt(1 / finfo))
boot_lambda <- c()
nboot       <- 1000
for (i in 1:nboot) {
  tmp_x          <- sample(x, length(x), replace = T)
  boot_lambda[i] <- mean(tmp_x)
}
boot_CI <- c(quantile(boot_lambda, 0.025),
             quantile(boot_lambda, 0.975))
mle_CI
boot_CI
```
</div>


```{exercise}


a. Find the Fisher information matrix for the Gamma distribution.

b. <span style="color:blue">Generate 20 samples from a Gamma distribution and plot a confidence ellipse of the inverse of Fisher information matrix around the ML estimates of the parameters. Also plot the theoretical values. Repeat the sampling several times. What do you observe?</span>

c. Discuss what a non-diagonal Fisher matrix implies.

Hint: The digamma function is defined as $\psi(x) = \frac{\frac{d}{dx} \Gamma(x)}{\Gamma(x)}$. Additionally, you do not need to evaluate $\frac{d}{dx} \psi(x)$. To calculate its value in R, use package **numDeriv**.
```
<div class="fold">
```{solution, echo = togs}


a. The log likelihood of the Gamma is
\begin{equation*}
  l(\alpha, \beta; x) = n \alpha \ln \beta - n \ln \Gamma(\alpha) + (\alpha - 1) \sum_{i=1}^n \ln x_i - \beta \sum_{i=1}^n x_i.
\end{equation*}
Let us calculate the derivatives.
\begin{align*}
  \frac{\partial}{\partial \alpha} l(\alpha, \beta; x) &= n \ln \beta - n \psi(\alpha) + \sum_{i=1}^n \ln x_i, \\
  \frac{\partial}{\partial \beta} l(\alpha, \beta; x) &= \frac{n \alpha}{\beta} - \sum_{i=1}^n x_i, \\
  \frac{\partial^2}{\partial \alpha \beta} l(\alpha, \beta; x) &= \frac{n}{\beta}, \\
  \frac{\partial^2}{\partial \alpha^2} l(\alpha, \beta; x) &= - n \frac{\partial}{\partial \alpha} \psi(\alpha), \\
\frac{\partial^2}{\partial \beta^2} l(\alpha, \beta; x) &= - \frac{n \alpha}{\beta^2}.
\end{align*}
The Fisher information matrix is then

\begin{align*}
I(\alpha, \beta) =
  -  E[
    \begin{bmatrix}
    - n \psi'(\alpha)       & \frac{n}{\beta} \\
      \frac{n}{\beta} & - \frac{n \alpha}{\beta^2}
  \end{bmatrix}
  ] = 
  \begin{bmatrix}
      n \psi'(\alpha)       & - \frac{n}{\beta} \\
    - \frac{n}{\beta} & \frac{n \alpha}{\beta^2}
    \end{bmatrix}

\end{align*}

c. A non-diagonal Fisher matrix implies that the parameter estimates are linearly dependent. 

```
```{r, echo = togs, message = FALSE, eval = togs, warning=FALSE}
set.seed(1)
n  <- 20
pars_theor <- c(5, 2)
x  <- rgamma(n, 5, 2)


# MLE for alpha and beta
log_lik <- function (pars, x) {
  n <- length(x)
  return (- (n * pars[1] * log(pars[2]) -
             n * log(gamma(pars[1])) +
             (pars[1] - 1) * sum(log(x)) -
             pars[2] * sum(x)))
}
my_optim <- optim(par = c(1,1), fn = log_lik, method = "L-BFGS-B",
                  lower = c(0.001, 0.001), upper = c(8, 8), x = x)
pars_mle <- my_optim$par

fish_mat <- matrix(data = NA, nrow = 2, ncol = 2)
fish_mat[1,2] <- - n / pars_mle[2]
fish_mat[2,1] <- - n / pars_mle[2]
fish_mat[2,2] <- (n * pars_mle[1]) / (pars_mle[2]^2)
fish_mat[1,1] <- n * grad(digamma, pars_mle[1])

fish_mat_inv <- solve(fish_mat)

est_ellip <- ellipse(pars_mle, fish_mat_inv, draw = FALSE)
colnames(est_ellip) <- c("X1", "X2")
est_ellip <- as.data.frame(est_ellip)

ggplot() +
  geom_point(data = data.frame(x = pars_mle[1], y = pars_mle[2]), aes(x = x, y = y)) +
  geom_path(data = est_ellip, aes(x = X1, y = X2)) +
  geom_point(aes(x = pars_theor[1], y = pars_theor[2]), color = "red") +
  geom_text(aes(x = pars_theor[1], y = pars_theor[2], label = "Theoretical parameters"), 
            color = "red",
            nudge_y = -0.2)

```
</div>

## The German tank problem
```{exercise, name = "The German tank problem"}
During WWII the allied intelligence were faced with an important problem of estimating the total production of certain German tanks, such as the Panther. What turned out to be a successful approach was to estimate the maximum from the serial numbers of the small sample of captured or destroyed tanks (describe the statistical model used).


a. What assumptions were made by using the above model? Do you think they are reasonable assumptions in practice?
b. Show that the plug-in estimate for the maximum (i.e. the maximum of the sample) is a biased estimator.
c. Derive the maximum likelihood estimate of the maximum.
d. Check that the following estimator is not biased: $\hat{n} = \frac{k + 1}{k}m - 1$.


```
<div class="fold">
```{solution, echo = togs}
The data are the serial numbers of the tanks. The parameter is $n$, the total production of the tank. The distribution of the serial numbers is a discrete uniform distribution over all serial numbers.

a. One of the assumptions is that we have i.i.d samples, however in practice this might not be true, as some tanks produced later could be sent to the field later, therefore already in theory we would not be able to recover some values from the population.


b. To find the expected value we first need to find the distribution of $m$. Let us start with the CDF.
\begin{align*}
  F_m(x) = P(Y_1 < x,...,Y_k < x).
\end{align*}
If $x < k$ then $F_m(x) = 0$ and if $x \geq 1$ then $F_m(x) = 1$. What about between those values. So the probability that the maximum value is less than or equal to $m$ is just the number of possible draws from $Y$ that are all smaller than $m$, divided by all possible draws. This is $\frac{{x}\choose{k}}{{n}\choose{k}}$. The PDF on the suitable bounds is then
\begin{align*}
  P(m = x) = F_m(x) - F_m(x - 1) = \frac{\binom{x}{k} - \binom{x - 1}{k}}{\binom{n}{k}} = \frac{\binom{x - 1}{k - 1}}{\binom{n}{k}}.
\end{align*}
Now we can calculate the expected value of $m$ using some combinatorial identities.
\begin{align*}
  E[m] &= \sum_{i = k}^n i \frac{{i - 1}\choose{k - 1}}{{n}\choose{k}} \\
       &= \sum_{i = k}^n i \frac{\frac{(i - 1)!}{(k - 1)!(i - k)!}}{{n}\choose{k}} \\
       &= \frac{k}{\binom{n}{k}}\sum_{i = k}^n \binom{i}{k} \\
       &= \frac{k}{\binom{n}{k}} \binom{n + 1}{k + 1} \\
       &= \frac{k(n + 1)}{k + 1}.
\end{align*}
The bias of this estimator is then
\begin{align*}
  E[m] - n = \frac{k(n + 1)}{k + 1} - n = \frac{k - n}{k + 1}.
\end{align*}

c. The probability that we observed our sample $Y = {Y_1, Y_2,...,,Y_k}$ given $n$ is $\frac{1}{{n}\choose{k}}$. We need to find such $n^*$ that this function is maximized. Additionally, we have a constraint that $n^* \geq m = \max{(Y)}$. Let us plot this function for $m = 10$ and $k = 4$. 
```
```{r, echo = togs, message = FALSE, eval = togs, warning=FALSE}
library(ggplot2)
my_fun <- function (x, m, k) {
  tmp        <-  1 / (choose(x, k))
  tmp[x < m] <- 0
  return (tmp)
}
x  <- 1:20
y  <- my_fun(x, 10, 4)
df <- data.frame(x = x, y = y)
ggplot(data = df, aes(x = x, y = y)) +
  geom_line()
```
```{solution, echo = togs}


c. (continued) We observe that the maximum of this function lies at the maximum value of the sample. Therefore $n^* = m$ and ML estimate equals the plug-in estimate.

  
d. 
\begin{align*}
  E[\hat{n}] &= \frac{k + 1}{k} E[m] - 1 \\
  &= \frac{k + 1}{k} \frac{k(n + 1)}{k + 1} - 1 \\
  &= n.
\end{align*}
```
</div>