From 50860e238c59e80b46e16fcc2a04f902a0a992ee Mon Sep 17 00:00:00 2001 From: LeonHvastja Date: Mon, 2 Oct 2023 16:07:54 +0200 Subject: [PATCH] fixes after review, authors add --- 18-distributions_intuition.Rmd | 44 +++++++++++++----------------- docs/404.html | 4 +-- docs/A1.html | 4 +-- docs/ard.html | 4 +-- docs/bi.html | 4 +-- docs/boot.html | 4 +-- docs/ci.html | 4 +-- docs/condprob.html | 4 +-- docs/crv.html | 4 +-- docs/distributions-intutition.html | 34 ++++++++++------------- docs/distributions.html | 4 +-- docs/eb.html | 4 +-- docs/ev.html | 4 +-- docs/index.html | 8 +++--- docs/integ.html | 4 +-- docs/introduction.html | 4 +-- docs/lt.html | 4 +-- docs/ml.html | 4 +-- docs/mrv.html | 4 +-- docs/mrvs.html | 4 +-- docs/nhst.html | 4 +-- docs/references.html | 4 +-- docs/rvs.html | 4 +-- docs/search_index.json | 2 +- docs/uprobspaces.html | 4 +-- index.Rmd | 2 +- 26 files changed, 82 insertions(+), 92 deletions(-) diff --git a/18-distributions_intuition.Rmd b/18-distributions_intuition.Rmd index 5a57781..0704669 100644 --- a/18-distributions_intuition.Rmd +++ b/18-distributions_intuition.Rmd @@ -41,10 +41,10 @@ $(document).ready(function() { ## Discrete distributions ```{exercise, name = "Bernoulli intuition 1"} -Arguably the simplest distribution you will encounter is the Bernoulli distribution. +The simplest distribution you will encounter is the Bernoulli distribution. It is a discrete probability distribution used to represent the outcome of a yes/no -question. It has one parameter $p$ which is the probability of success. The -probability of failure is $(1-p)$, sometimes denoted as $q$. +question. It has one parameter $0 \leq p \leq 1$, which is the probability of success. The +probability of failure is $q = (1-p)$. A classic way to think about a Bernoulli trial (a yes/no experiment) is a coin flip. Real coins are fair, meaning the probability of either heads (1) @@ -101,7 +101,7 @@ is divisible by 3: $C = \{3, 6\}$ satisfies this condition. ```{exercise, name = "Binomial intuition 1"} The binomial distribution is a generalization of the Bernoulli distribution. -Instead of considering a single Bernoulli trial, we now consider a sequence of $n$ trials, +Instead of considering a single Bernoulli trial, we now consider a sum of a sequence of $n$ trials, which are independent and have the same parameter $p$. So the binomial distribution has two parameters $n$ - the number of trials and $p$ - the probability of success for each trial. @@ -118,7 +118,7 @@ a. Take the [pmf of the binomial distribution](#distributions) and plug in $n=1 check that it is in fact equivalent to a Bernoulli distribution. b. In our examples we show the graph of a binomial distribution over 10 trials with -$p=0.8$. If we take a look at the graph, it appears as though the probabilities of getting 0,1,2 or 3 +$p=0.8$. If we take a look at the graph, it appears as though the probabilities of getting 0,1, 2 or 3 heads in 10 flips are zero. Is it actually zero? Check by plugging in the values into the pmf. ``` @@ -284,8 +284,8 @@ failures** before the first success in a sequence of independent Bernoulli trial It has a single parameter $p$, representing the probability of success and its support is all non-negative integers $\{0,1,2,\dots\}$. -NOTE: There are two forms of this distribution, the one we just described -and another that models the **number of trials** before the first success. The +NOTE: There is an alternative way to think about this distribution, one that models +the **number of trials** before the first success. The difference is subtle yet significant and you are likely to encounter both forms. The key to telling them apart is to check their support, since the number of trials has to be at least $1$, for this case we have $\{1,2,\dots\}$. @@ -299,7 +299,7 @@ probability. a) Create an equivalent graph that represents the probability of rolling a 6 with a fair 6-sided die. b) Use the formula for the [mean](#distributions) of the geometric distribution and determine the average number of **failures** before you roll a 6. -c) Look up the alternative form of the geometric distribtuion and again use the formula for the mean to determine the average number of **trials** before you roll a 6. +c) Look up the alternative form of the geometric distribtuion and again use the formula for the mean to determine the average number of **trials** up to and including rolling a 6. ``` @@ -371,7 +371,7 @@ The need for a randomness is a common problem. A practical solution are so-calle It has two parameters $a$ and $b$, which define the beginning and end of its support respectively. -a) Let's think of the mean intuitively. The expected value or mean of a distribution is the pivot point on our x-axis, which "balances" the graph. Given parameters $a$ and $b$ what is your intuitive guess of the mean for this distribution? +a) Let's think about the mean intuitively. Think of the area under the graph as a geometric shape. The expected value or mean of a distribution is the x-axis value of its center of mass. Given parameters $a$ and $b$ what is your intuitive guess of the mean for the uniform distribution? b) A special case of the uniform distribution is the **standard uniform distribution** with $a=0$ and $b=1$. Write the pdf $f(x)$ of this particular distribution. ``` ```{r, fig.width=5, fig.height=3, echo=FALSE, warning=FALSE, message=FALSE} @@ -398,7 +398,7 @@ print(p) ```
```{solution, echo = togs} -a. It's the midpoint between $a$ and $b$, so $\frac{a+b}{2}$ +a. The center of mass is the center of the square from $a$ to $b$ and from 0 to $\frac{1}{b-a}$. Its value on the x-axis is the midpoint between $a$ and $b$, so $\frac{a+b}{2}$ b. Inserting the parameter values we get:$$f(x) = \begin{cases} 1 & \text{if } 0 \leq x \leq 1 \\ @@ -517,11 +517,9 @@ Below you've been provided with some code that you can copy into Rstudio. Once y Play around with the parameters to get: -a) A straight line from (0,0) to (1,2) -b) A straight line from (0,2) to (1,0) -c) A symmetric bell curve -d) A bowl-shaped curve -e) The standard uniform distribution is actually a special case of the beta distribution. Find the exact parameters $\alpha$ and $\beta$. Once you do, prove the equality by inserting the values into our pdf. +a) A symmetric bell curve +b) A bowl-shaped curve +c) The standard uniform distribution is actually a special case of the beta distribution. Find the exact parameters $\alpha$ and $\beta$. Once you do, prove the equality by inserting the values into our pdf. *Hint*: The beta function is evaluated as $\text{B}(a,b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}$, the gamma function for **positive integers** $n$ is evaluated as $\Gamma(n)= (n-1)!$ @@ -566,15 +564,11 @@ shinyApp(ui = ui, server = server) ```
```{solution, echo = togs} - a) $\alpha = 2, \beta=1$ + a) Possible solution $\alpha = \beta= 5$ - b) $\alpha = 1, \beta=2$ - - c) Possible solution $\alpha = \beta= 5$ - - d) Possible solution $\alpha = \beta= 0.5$ + b) Possible solution $\alpha = \beta= 0.5$ - e) The correct parameters are $\alpha = 1, \beta=1$, to prove the equality we insert them into the beta pdf: + c) The correct parameters are $\alpha = 1, \beta=1$, to prove the equality we insert them into the beta pdf: $$\frac{x^{\alpha - 1} (1 - x)^{\beta - 1}}{\text{B}(\alpha, \beta)} = \frac{x^{1 - 1} (1 - x)^{1 - 1}}{\text{B}(1, 1)} = \frac{1}{\frac{\Gamma(1)\Gamma(1)}{\Gamma(1+1)}}= @@ -593,7 +587,7 @@ a) What is the mean time between phone calls? The cdf $F(x)$ tells us what percentage of calls occur within x amount of time of each other. -b) You want to take an hour long lunch break but are worried about missing calls. Calculate the percentage of calls you are likely to miss if you're gone for an hour. Hint: The cdf is $F(x) = \int_{-\infty}^{x} f(x) dx$ +b) You want to take an hour long lunch break but are worried about missing calls. Calculate the probability of missing at least one call if you're gone for an hour. Hint: The cdf is $F(x) = \int_{-\infty}^{x} f(x) dx$ ```
@@ -612,12 +606,12 @@ b. First we derive the CDF, we can integrate from 0 instead of $-\infty$, since Then we just evaluate it for a time of 1 hour: $$F(1 \text{ hour}) = 1 - e^{-\frac{1 \text{ call}}{3.2 \text{ hours}} \cdot 1 \text{ hour}}= 1 - e^{-\frac{1 \text{ call}}{3.2 \text{ hours}}} \approx 0.268$$ - So we have about a 27% chance of missing a call if we're gone for an hour. + So we have about a 27% chance of missing at least one call if we're gone for an hour. ```
```{exercise, name = "Gamma intuition 1"} -The gamma distribution is a continuous distribution characterized by two parameters, $\alpha$ and $\beta$, both greater than 0. These parameters afford the distribution a broad range of shapes, leading to it being commonly referred to as a *family of distributions*. Given its support over the positive real numbers, it is well suited for modeling a diverse range of positive-valued phenomena. +The gamma distribution is a continuous distribution with by two parameters, $\alpha$ and $\beta$, both greater than 0. These parameters afford the distribution a broad range of shapes, leading to it being commonly referred to as a *family of distributions*. Given its support over the positive real numbers, it is well suited for modeling a diverse range of positive-valued phenomena. a) The exponential distribution is actually just a particular form of the gamma distribution. What are the values of $\alpha$ and $\beta$? b) Copy the code from our beta distribution Shiny app and modify it to simulate the gamma distribution. Then get it to show the exponential. diff --git a/docs/404.html b/docs/404.html index fd15a7d..16745c2 100644 --- a/docs/404.html +++ b/docs/404.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/A1.html b/docs/A1.html index f3b6fd8..80d8875 100644 --- a/docs/A1.html +++ b/docs/A1.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/ard.html b/docs/ard.html index 82eb67b..26a3e7c 100644 --- a/docs/ard.html +++ b/docs/ard.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/bi.html b/docs/bi.html index a164a4e..1597c8d 100644 --- a/docs/bi.html +++ b/docs/bi.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/boot.html b/docs/boot.html index 80f30f8..0a6b41e 100644 --- a/docs/boot.html +++ b/docs/boot.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/ci.html b/docs/ci.html index 4c0e9ac..10b0f07 100644 --- a/docs/ci.html +++ b/docs/ci.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/condprob.html b/docs/condprob.html index a3e195d..f8b7026 100644 --- a/docs/condprob.html +++ b/docs/condprob.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/crv.html b/docs/crv.html index 4d043c0..8db20b8 100644 --- a/docs/crv.html +++ b/docs/crv.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/distributions-intutition.html b/docs/distributions-intutition.html index da87a6e..6467605 100644 --- a/docs/distributions-intutition.html +++ b/docs/distributions-intutition.html @@ -20,10 +20,10 @@ - + - + @@ -327,10 +327,10 @@

Chapter 18 Distributions intutiti

18.1 Discrete distributions

-

Exercise 18.1 (Bernoulli intuition 1) Arguably the simplest distribution you will encounter is the Bernoulli distribution. +

Exercise 18.1 (Bernoulli intuition 1) The simplest distribution you will encounter is the Bernoulli distribution. It is a discrete probability distribution used to represent the outcome of a yes/no -question. It has one parameter \(p\) which is the probability of success. The -probability of failure is \((1-p)\), sometimes denoted as \(q\).

+question. It has one parameter \(0 \leq p \leq 1\), which is the probability of success. The +probability of failure is \(q = (1-p)\).

A classic way to think about a Bernoulli trial (a yes/no experiment) is a coin flip. Real coins are fair, meaning the probability of either heads (1) or tails (0) are the same, so \(p=0.5\) as shown below in figure a. Alternatively @@ -364,7 +364,7 @@

18.1 Discrete distributions

Exercise 18.2 (Binomial intuition 1) The binomial distribution is a generalization of the Bernoulli distribution. -Instead of considering a single Bernoulli trial, we now consider a sequence of \(n\) trials, +Instead of considering a single Bernoulli trial, we now consider a sum of a sequence of \(n\) trials, which are independent and have the same parameter \(p\). So the binomial distribution has two parameters \(n\) - the number of trials and \(p\) - the probability of success for each trial.

@@ -378,7 +378,7 @@

18.1 Discrete distributions

Take the pmf of the binomial distribution and plug in \(n=1\), check that it is in fact equivalent to a Bernoulli distribution.

  • In our examples we show the graph of a binomial distribution over 10 trials with -\(p=0.8\). If we take a look at the graph, it appears as though the probabilities of getting 0,1,2 or 3 +\(p=0.8\). If we take a look at the graph, it appears as though the probabilities of getting 0,1, 2 or 3 heads in 10 flips are zero. Is it actually zero? Check by plugging in the values into the pmf.

  • @@ -478,8 +478,8 @@

    18.1 Discrete distributions before the first success in a sequence of independent Bernoulli trials. It has a single parameter \(p\), representing the probability of success and its support is all non-negative integers \(\{0,1,2,\dots\}\).

    -

    NOTE: There are two forms of this distribution, the one we just described -and another that models the number of trials before the first success. The +

    NOTE: There is an alternative way to think about this distribution, one that models +the number of trials before the first success. The difference is subtle yet significant and you are likely to encounter both forms. The key to telling them apart is to check their support, since the number of trials has to be at least \(1\), for this case we have \(\{1,2,\dots\}\).

    @@ -492,7 +492,7 @@

    18.1 Discrete distributions
  • Create an equivalent graph that represents the probability of rolling a 6 with a fair 6-sided die.
  • Use the formula for the mean of the geometric distribution and determine the average number of failures before you roll a 6.
  • -
  • Look up the alternative form of the geometric distribtuion and again use the formula for the mean to determine the average number of trials before you roll a 6.
  • +
  • Look up the alternative form of the geometric distribtuion and again use the formula for the mean to determine the average number of trials up to and including rolling a 6.
  • @@ -539,7 +539,7 @@

    18.2 Continuous distributionsExercise 18.6 (Uniform intuition 1) The need for a randomness is a common problem. A practical solution are so-called random number generators (RNGs). The simplest RNG one would think of is choosing a set of numbers and having the generator return a number at random, where the probability of returning any number from this set is the same. If this set is an interval of real numbers, then we’ve basically described the continuous uniform distribution.

    It has two parameters \(a\) and \(b\), which define the beginning and end of its support respectively.

      -
    1. Let’s think of the mean intuitively. The expected value or mean of a distribution is the pivot point on our x-axis, which “balances” the graph. Given parameters \(a\) and \(b\) what is your intuitive guess of the mean for this distribution?
    2. +
    3. Let’s think about the mean intuitively. Think of the area under the graph as a geometric shape. The expected value or mean of a distribution is the x-axis value of its center of mass. Given parameters \(a\) and \(b\) what is your intuitive guess of the mean for the uniform distribution?
    4. A special case of the uniform distribution is the standard uniform distribution with \(a=0\) and \(b=1\). Write the pdf \(f(x)\) of this particular distribution.

    @@ -548,7 +548,7 @@

    18.2 Continuous distributions

    Solution.

      -
    1. It’s the midpoint between \(a\) and \(b\), so \(\frac{a+b}{2}\)
    2. +
    3. The center of mass is the center of the square from \(a\) to \(b\) and from 0 to \(\frac{1}{b-a}\). Its value on the x-axis is the midpoint between \(a\) and \(b\), so \(\frac{a+b}{2}\)
    4. Inserting the parameter values we get:\[f(x) = \begin{cases} 1 & \text{if } 0 \leq x \leq 1 \\ @@ -630,8 +630,6 @@

      18.2 Continuous distributionsBelow you’ve been provided with some code that you can copy into Rstudio. Once you run the code, an interactive Shiny app will appear and you will be able to manipulate the graph of the beta distribution.

      Play around with the parameters to get:

        -
      1. A straight line from (0,0) to (1,2)
      2. -
      3. A straight line from (0,2) to (1,0)
      4. A symmetric bell curve
      5. A bowl-shaped curve
      6. The standard uniform distribution is actually a special case of the beta distribution. Find the exact parameters \(\alpha\) and \(\beta\). Once you do, prove the equality by inserting the values into our pdf.
      7. @@ -678,8 +676,6 @@

        18.2 Continuous distributions

        Solution.

          -
        1. \(\alpha = 2, \beta=1\)

        2. -
        3. \(\alpha = 1, \beta=2\)

        4. Possible solution \(\alpha = \beta= 5\)

        5. Possible solution \(\alpha = \beta= 0.5\)

        6. The correct parameters are \(\alpha = 1, \beta=1\), to prove the equality we insert them into the beta pdf: @@ -699,7 +695,7 @@

          18.2 Continuous distributions

          The cdf \(F(x)\) tells us what percentage of calls occur within x amount of time of each other.

            -
          1. You want to take an hour long lunch break but are worried about missing calls. Calculate the percentage of calls you are likely to miss if you’re gone for an hour. Hint: The cdf is \(F(x) = \int_{-\infty}^{x} f(x) dx\)
          2. +
          3. You want to take an hour long lunch break but are worried about missing calls. Calculate the probability of missing at least one call if you’re gone for an hour. Hint: The cdf is \(F(x) = \int_{-\infty}^{x} f(x) dx\)

    -

    Exercise 18.10 (Gamma intuition 1) The gamma distribution is a continuous distribution characterized by two parameters, \(\alpha\) and \(\beta\), both greater than 0. These parameters afford the distribution a broad range of shapes, leading to it being commonly referred to as a family of distributions. Given its support over the positive real numbers, it is well suited for modeling a diverse range of positive-valued phenomena.

    +

    Exercise 18.10 (Gamma intuition 1) The gamma distribution is a continuous distribution with by two parameters, \(\alpha\) and \(\beta\), both greater than 0. These parameters afford the distribution a broad range of shapes, leading to it being commonly referred to as a family of distributions. Given its support over the positive real numbers, it is well suited for modeling a diverse range of positive-valued phenomena.

    1. The exponential distribution is actually just a particular form of the gamma distribution. What are the values of \(\alpha\) and \(\beta\)?
    2. Copy the code from our beta distribution Shiny app and modify it to simulate the gamma distribution. Then get it to show the exponential.
    3. diff --git a/docs/distributions.html b/docs/distributions.html index 24ad8ae..16bfc91 100644 --- a/docs/distributions.html +++ b/docs/distributions.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/eb.html b/docs/eb.html index 09cd76b..bb372f7 100644 --- a/docs/eb.html +++ b/docs/eb.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/ev.html b/docs/ev.html index c3e2dec..8938503 100644 --- a/docs/ev.html +++ b/docs/ev.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/index.html b/docs/index.html index 589e5e1..de01699 100644 --- a/docs/index.html +++ b/docs/index.html @@ -20,10 +20,10 @@ - + - + @@ -299,8 +299,8 @@

      Preface

      diff --git a/docs/integ.html b/docs/integ.html index 2d3e14f..2d51736 100644 --- a/docs/integ.html +++ b/docs/integ.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/introduction.html b/docs/introduction.html index f472b1d..468df61 100644 --- a/docs/introduction.html +++ b/docs/introduction.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/lt.html b/docs/lt.html index f70caad..10f550d 100644 --- a/docs/lt.html +++ b/docs/lt.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/ml.html b/docs/ml.html index 01a8d4c..b30e532 100644 --- a/docs/ml.html +++ b/docs/ml.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/mrv.html b/docs/mrv.html index e52e8ce..0ed2084 100644 --- a/docs/mrv.html +++ b/docs/mrv.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/mrvs.html b/docs/mrvs.html index d08c574..498fbea 100644 --- a/docs/mrvs.html +++ b/docs/mrvs.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/nhst.html b/docs/nhst.html index bc849af..7256cbc 100644 --- a/docs/nhst.html +++ b/docs/nhst.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/references.html b/docs/references.html index 5415593..0e6bfb9 100644 --- a/docs/references.html +++ b/docs/references.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/rvs.html b/docs/rvs.html index a9abfa1..73bf8f3 100644 --- a/docs/rvs.html +++ b/docs/rvs.html @@ -20,10 +20,10 @@ - + - + diff --git a/docs/search_index.json b/docs/search_index.json index 1068bd3..dae68a3 100644 --- a/docs/search_index.json +++ b/docs/search_index.json @@ -1 +1 @@ -[["index.html", "Principles of Uncertainty – exercises Preface", " Principles of Uncertainty – exercises Gregor Pirš and Erik Štrumbelj 2023-10-01 Preface These are the exercises for the Principles of Uncertainty course of the Data Science Master’s at University of Ljubljana, Faculty of Computer and Information Science. This document will be extended each week as the course progresses. At the end of each exercise session, we will post the solutions to the exercises worked in class and select exercises for homework. Students are also encouraged to solve the remaining exercises to further extend their knowledge. Some exercises require the use of R. Those exercises (or parts of) are coloured blue. Students that are not familiar with R programming language should study A to learn the basics. As the course progresses, we will cover more relevant uses of R for data science. "],["introduction.html", "Chapter 1 Probability spaces 1.1 Measure and probability spaces 1.2 Properties of probability measures 1.3 Discrete probability spaces", " Chapter 1 Probability spaces This chapter deals with measures and probability spaces. At the end of the chapter, we look more closely at discrete probability spaces. The students are expected to acquire the following knowledge: Theoretical Use properties of probability to calculate probabilities. Combinatorics. Understanding of continuity of probability. R Vectors and vector operations. For loop. Estimating probability with simulation. sample function. Matrices and matrix operations. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 1.1 Measure and probability spaces Exercise 1.1 (Completing a set to a sigma algebra) Let \\(\\Omega = \\{1,2,...,10\\}\\) and let \\(A = \\{\\emptyset, \\{1\\}, \\{2\\}, \\Omega \\}\\). Show that \\(A\\) is not a sigma algebra of \\(\\Omega\\). Find the minimum number of elements to complete A to a sigma algebra of \\(\\Omega\\). Solution. \\(1^c = \\{2,3,...,10\\} \\notin A \\implies\\) \\(A\\) is not sigma algebra. First we need the complements of all elements, so we need to add sets \\(\\{2,3,...,10\\}\\) and \\(\\{1,3,4,...,10\\}\\). Next we need unions of all sets – we add the set \\(\\{1,2\\}\\). Again we need the complement of this set, so we add \\(\\{3,4,...,10\\}\\). So the minimum number of elements we need to add is 4. Exercise 1.2 (Diversity of sigma algebras) Let \\(\\Omega\\) be a set. Find the smallest sigma algebra of \\(\\Omega\\). Find the largest sigma algebra of \\(\\Omega\\). Solution. \\(A = \\{\\emptyset, \\Omega\\}\\) \\(2^{\\Omega}\\) Exercise 1.3 Find all sigma algebras for \\(\\Omega = \\{0, 1, 2\\}\\). Solution. \\(A = \\{\\emptyset, \\Omega\\}\\) \\(A = 2^{\\Omega}\\) \\(A = \\{\\emptyset, \\{0\\}, \\{1,2\\}, \\Omega\\}\\) \\(A = \\{\\emptyset, \\{1\\}, \\{0,2\\}, \\Omega\\}\\) \\(A = \\{\\emptyset, \\{2\\}, \\{0,1\\}, \\Omega\\}\\) Exercise 1.4 (Difference between algebra and sigma algebra) Let \\(\\Omega = \\mathbb{N}\\) and \\(\\mathcal{A} = \\{A \\subseteq \\mathbb{N}: A \\text{ is finite or } A^c \\text{ is finite.} \\}\\). Show that \\(\\mathcal{A}\\) is an algebra but not a sigma algebra. Solution. \\(\\emptyset\\) is finite so \\(\\emptyset \\in \\mathcal{A}\\). Let \\(A \\in \\mathcal{A}\\) and \\(B \\in \\mathcal{A}\\). If both are finite, then their union is also finite and therefore in \\(\\mathcal{A}\\). Let at least one of them not be finite. Then their union is not finite. But \\((A \\cup B)^c = A^c \\cap B^c\\). And since at least one is infinite, then its complement is finite and the intersection is too. So finite unions are in \\(\\mathcal{A}\\). Let us look at numbers \\(2n\\). For any \\(n\\), \\(2n \\in \\mathcal{A}\\) as it is finite. But \\(\\bigcup_{k = 1}^{\\infty} 2n \\notin \\mathcal{A}\\). Exercise 1.5 We define \\(\\sigma(X) = \\cap_{\\lambda \\in I} S_\\lambda\\) to be a sigma algebra, generated by the set \\(X\\), where \\(S_\\lambda\\) are all sigma algebras such that \\(X \\subseteq S_\\lambda\\). \\(S_\\lambda\\) are indexed by \\(\\lambda \\in I\\). Let \\(A, B \\subseteq 2^{\\Omega}\\). Prove that \\(\\sigma(A) = \\sigma(B) \\iff A \\subseteq \\sigma(B) \\land B \\subseteq \\sigma(A)\\). Solution. To prove the equivalence, we need to prove that the left hand side implies the right hand side and vice versa. Proving \\(\\sigma(A) = \\sigma(B) \\Rightarrow A \\subseteq \\sigma(B) \\land B \\subseteq \\sigma(A)\\): we know \\(A \\subseteq \\sigma(A)\\) is always true, so by substituting in \\(\\sigma(B)\\) from the left hand side equality we obtain \\(A \\subseteq \\sigma(B)\\). We obtain \\(B \\subseteq \\sigma(A)\\) by symmetry. This proves the implication. Proving \\(A \\subseteq \\sigma(B) \\land B \\subseteq \\sigma(A) \\Rightarrow \\sigma(A) = \\sigma(B)\\): by definition of a sigma algebra, generated by a set, we have \\(\\sigma(B) = \\cap_{\\lambda \\in I} S_\\lambda\\) where \\(S_\\lambda\\) are all sigma algebras where \\(B \\subseteq S_\\lambda\\). But \\(\\sigma(A)\\) is one of \\(S_\\lambda\\), so we can write \\(\\sigma(B) = \\sigma(A) \\cap \\left(\\cap_{\\lambda \\in I} S_\\lambda \\right)\\), which implies \\(\\sigma(B) \\subseteq \\sigma(A)\\). By symmetry, we have \\(\\sigma(A) \\subseteq \\sigma(B)\\). Since \\(\\sigma(A) \\subseteq \\sigma(B)\\) and \\(\\sigma(B) \\subseteq \\sigma(A)\\), we obtain \\(\\sigma(A) = \\sigma(B)\\), which proves the implication and completes the equivalence proof. Exercise 1.6 (Intro to measure) Take the measurable space \\(\\Omega = \\{1,2\\}\\), \\(F = 2^{\\Omega}\\). Which of the following is a measure? Which is a probability measure? \\(\\mu(\\emptyset) = 0\\), \\(\\mu(\\{1\\}) = 5\\), \\(\\mu(\\{2\\}) = 6\\), \\(\\mu(\\{1,2\\}) = 11\\) \\(\\mu(\\emptyset) = 0\\), \\(\\mu(\\{1\\}) = 0\\), \\(\\mu(\\{2\\}) = 0\\), \\(\\mu(\\{1,2\\}) = 1\\) \\(\\mu(\\emptyset) = 0\\), \\(\\mu(\\{1\\}) = 0\\), \\(\\mu(\\{2\\}) = 0\\), \\(\\mu(\\{1,2\\}) = 0\\) \\(\\mu(\\emptyset) = 0\\), \\(\\mu(\\{1\\}) = 0\\), \\(\\mu(\\{2\\}) = 1\\), \\(\\mu(\\{1,2\\}) = 1\\) \\(\\mu(\\emptyset)=0\\), \\(\\mu(\\{1\\})=0\\), \\(\\mu(\\{2\\})=\\infty\\), \\(\\mu(\\{1,2\\})=\\infty\\) Solution. Measure. Not probability measure since \\(\\mu(\\Omega) > 1\\). Neither due to countable additivity. Measure. Not probability measure since \\(\\mu(\\Omega) = 0\\). Probability measure. Measure. Not probability measure since \\(\\mu(\\Omega) > 1\\). Exercise 1.7 Define a probability space that could be used to model the outcome of throwing two fair 6-sided dice. Solution. \\(\\Omega = \\{\\{i,j\\}, i = 1,...,6, j = 1,...,6\\}\\) \\(F = 2^{\\Omega}\\) \\(\\forall \\omega \\in \\Omega\\), \\(P(\\omega) = \\frac{1}{6} \\times \\frac{1}{6} = \\frac{1}{36}\\) 1.2 Properties of probability measures Exercise 1.8 A standard deck (52 cards) is distributed to two persons: 26 cards to each person. All partitions are equally likely. Find the probability that: The first person gets 4 Queens. The first person gets at least 2 Queens. R: Use simulation (sample) to check the above answers. Solution. \\(\\frac{\\binom{48}{22}}{\\binom{52}{26}}\\) 1 - \\(\\frac{\\binom{48}{26} + 4 \\times \\binom{48}{25}}{\\binom{52}{26}}\\) For the simulation, let us represent cards with numbers from 1 to 52, and let 1 through 4 represent Queens. set.seed(1) cards <- 1:52 n <- 10000 q4 <- vector(mode = "logical", length = n) q2 <- vector(mode = "logical", length = n) tmp <- vector(mode = "logical", length = n) for (i in 1:n) { p1 <- sample(1:52, 26) q4[i] <- sum(1:4 %in% p1) == 4 q2[i] <- sum(1:4 %in% p1) >= 2 } sum(q4) / n ## [1] 0.0572 sum(q2) / n ## [1] 0.6894 Exercise 1.9 Let \\(A\\) and \\(B\\) be events with probabilities \\(P(A) = \\frac{2}{3}\\) and \\(P(B) = \\frac{1}{2}\\). Show that \\(\\frac{1}{6} \\leq P(A\\cap B) \\leq \\frac{1}{2}\\), and give examples to show that both extremes are possible. Find corresponding bounds for \\(P(A\\cup B)\\). R: Draw samples from the examples and show the probability bounds of \\(P(A \\cap B)\\) . Solution. From the properties of probability we have \\[\\begin{equation} P(A \\cup B) = P(A) + P(B) - P(A \\cap B) \\leq 1. \\end{equation}\\] From this follows \\[\\begin{align} P(A \\cap B) &\\geq P(A) + P(B) - 1 \\\\ &= \\frac{2}{3} + \\frac{1}{2} - 1 \\\\ &= \\frac{1}{6}, \\end{align}\\] which is the lower bound for the intersection. Conversely, we have \\[\\begin{equation} P(A \\cup B) = P(A) + P(B) - P(A \\cap B) \\geq P(A). \\end{equation}\\] From this follows \\[\\begin{align} P(A \\cap B) &\\leq P(B) \\\\ &= \\frac{1}{2}, \\end{align}\\] which is the upper bound for the intersection. For an example take a fair die. To achieve the lower bound let \\(A = \\{3,4,5,6\\}\\) and \\(B = \\{1,2,3\\}\\), then their intersection is \\(A \\cap B = \\{3\\}\\). To achieve the upper bound take \\(A = \\{1,2,3,4\\}\\) and $B = {1,2,3} $. For the bounds of the union we will use the results from the first part. Again from the properties of probability we have \\[\\begin{align} P(A \\cup B) &= P(A) + P(B) - P(A \\cap B) \\\\ &\\geq P(A) + P(B) - \\frac{1}{2} \\\\ &= \\frac{2}{3}. \\end{align}\\] Conversely \\[\\begin{align} P(A \\cup B) &= P(A) + P(B) - P(A \\cap B) \\\\ &\\leq P(A) + P(B) - \\frac{1}{6} \\\\ &= 1. \\end{align}\\] Therefore \\(\\frac{2}{3} \\leq P(A \\cup B) \\leq 1\\). We use sample in R: set.seed(1) n <- 10000 samps <- sample(1:6, n, replace = TRUE) # lower bound lb <- vector(mode = "logical", length = n) A <- c(1,2,3) B <- c(3,4,5,6) for (i in 1:n) { lb[i] <- samps[i] %in% A & samps[i] %in% B } sum(lb) / n ## [1] 0.1605 # upper bound ub <- vector(mode = "logical", length = n) A <- c(1,2,3) B <- c(1,2,3,4) for (i in 1:n) { ub[i] <- samps[i] %in% A & samps[i] %in% B } sum(ub) / n ## [1] 0.4913 Exercise 1.10 A fair coin is tossed repeatedly. Show that, with probability one, a head turns up sooner or later. Show similarly that any given finite sequence of heads and tails occurs eventually with probability one. Solution. \\[\\begin{align} P(\\text{no heads}) &= \\lim_{n \\rightarrow \\infty} P(\\text{no heads in first }n \\text{ tosses}) \\\\ &= \\lim_{n \\rightarrow \\infty} \\frac{1}{2^n} \\\\ &= 0. \\end{align}\\] For the second part, let us fix the given sequence of heads and tails of length \\(k\\) as \\(s\\). A probability that this happens in \\(k\\) tosses is \\(\\frac{1}{2^k}\\). \\[\\begin{align} P(s \\text{ occurs}) &= \\lim_{n \\rightarrow \\infty} P(s \\text{ occurs in first } nk \\text{ tosses}) \\end{align}\\] The right part of the upper equation is greater than if \\(s\\) occurs either in the first \\(k\\) tosses, second \\(k\\) tosses,…, \\(n\\)-th \\(k\\) tosses. Therefore \\[\\begin{align} P(s \\text{ occurs}) &\\geq \\lim_{n \\rightarrow \\infty} P(s \\text{ occurs in first } n \\text{ disjoint sequences of length } k) \\\\ &= \\lim_{n \\rightarrow \\infty} (1 - P(s \\text{ does not occur in first } n \\text{ disjoint sequences})) \\\\ &= 1 - \\lim_{n \\rightarrow \\infty} P(s \\text{ does not occur in first } n \\text{ disjoint sequences}) \\\\ &= 1 - \\lim_{n \\rightarrow \\infty} (1 - \\frac{1}{2^k})^n \\\\ &= 1. \\end{align}\\] Exercise 1.11 An Erdos-Renyi random graph \\(G(n,p)\\) is a model with \\(n\\) nodes, where each pair of nodes is connected with probability \\(p\\). Calculate the probability that there exists a node that is not connected to any other node in \\(G(4,0.6)\\). Show that the upper bound for the probability that there exist 2 nodes that are not connected to any other node for an arbitrary \\(G(n,p)\\) is \\(\\binom{n}{2} (1-p)^{2n - 3}\\). R: Estimate the probability from the first point using simulation. Solution. Let \\(A_i\\) be the event that the \\(i\\)-th node is not connected to any other node. Then our goal is to calculate \\(P(\\cup_{i=1}^n A_i)\\). Using the inclusion-exclusion principle, we get \\[\\begin{align} P(\\cup_{i=1}^n A_i) &= \\sum_i A_i - \\sum_{i<j} P(A_i \\cap A_j) + \\sum_{i<j<k} P(A_i \\cap A_j \\cap A_k) - P(A_1 \\cap A_2 \\cap A_3 \\cap A_4) \\\\ &=4 (1 - p)^3 - \\binom{4}{2} (1 - p)^5 + \\binom{4}{3} (1 - p)^6 - (1 - p)^6 \\\\ &\\approx 0.21. \\end{align}\\] Let \\(A_{ij}\\) be the event that nodes \\(i\\) and \\(j\\) are not connected to any other node. We are interested in \\(P(\\cup_{i<j}A_{ij})\\). By using Boole`s inequality, we get \\[\\begin{align} P(\\cup_{i<j}A_{ij}) \\leq \\sum_{i<j} P(A_{ij}). \\end{align}\\] What is the probability of \\(A_{ij}\\)? There need to be no connections to the \\(i\\)-th node to the remaining nodes (excluding \\(j\\)), the same for the \\(j\\)-th node, and there can be no connection between them. Therefore \\[\\begin{align} P(\\cup_{i<j}A_{ij}) &\\leq \\sum_{i<j} (1 - p)^{2(n-2) + 1} \\\\ &= \\binom{n}{2} (1 - p)^{2n - 3}. \\end{align}\\] set.seed(1) n_samp <- 100000 n <- 4 p <- 0.6 conn_samp <- vector(mode = "logical", length = n_samp) for (i in 1:n_samp) { tmp_mat <- matrix(data = 0, nrow = n, ncol = n) samp_conn <- sample(c(0,1), choose(4,2), replace = TRUE, prob = c(1 - p, p)) tmp_mat[lower.tri(tmp_mat)] <- samp_conn tmp_mat[upper.tri(tmp_mat)] <- t(tmp_mat)[upper.tri(t(tmp_mat))] not_conn <- apply(tmp_mat, 1, sum) if (any(not_conn == 0)) { conn_samp[i] <- TRUE } else { conn_samp[i] <- FALSE } } sum(conn_samp) / n_samp ## [1] 0.20565 1.3 Discrete probability spaces Exercise 1.12 Show that the standard measurable space on \\(\\Omega = \\{0,1,...,n\\}\\) equipped with binomial measure is a discrete probability space. Define another probability measure on this measurable space. Show that for \\(n=1\\) the binomial measure is the same as the Bernoulli measure. R: Draw 1000 samples from the binomial distribution \\(p=0.5\\), \\(n=20\\) (rbinom) and compare relative frequencies with theoretical probability measure. Solution. We need to show that the terms of \\(\\sum_{k=0}^n \\binom{n}{k} p^k (1 - p)^{n - k}\\) sum to 1. For that we use the binomial theorem \\(\\sum_{k=0}^n \\binom{n}{k} x^k y^{n-k} = (x + y)^n\\). So \\[\\begin{equation} \\sum_{k=0}^n \\binom{n}{k} p^k (1 - p)^{n - k} = (p + 1 - p)^n = 1. \\end{equation}\\] \\(P(\\{k\\}) = \\frac{1}{n + 1}\\). When \\(n=1\\) then \\(k \\in \\{0,1\\}\\). Inserting \\(n=1\\) into the binomial measure, we get \\(\\binom{1}{k}p^k (1-p)^{1 - k}\\). Now \\(\\binom{1}{1} = \\binom{1}{0} = 1\\), so the measure is \\(p^k (1-p)^{1 - k}\\), which is the Bernoulli measure. set.seed(1) library(ggplot2) library(dplyr) bin_samp <- rbinom(n = 1000, size = 20, prob = 0.5) bin_samp <- data.frame(x = bin_samp) %>% count(x) %>% mutate(n = n / 1000, type = "empirical_frequencies") %>% bind_rows(data.frame(x = 0:20, n = dbinom(0:20, size = 20, prob = 0.5), type = "theoretical_measure")) bin_plot <- ggplot(data = bin_samp, aes(x = x, y = n, fill = type)) + geom_bar(stat="identity", position = "dodge") plot(bin_plot) Exercise 1.13 Show that the standard measurable space on \\(\\Omega = \\{0,1,...,\\infty\\}\\) equipped with geometric measure is a discrete probability space, equipped with Poisson measure is a discrete probability space. Define another probability measure on this measurable space. R: Draw 1000 samples from the Poisson distribution \\(\\lambda = 10\\) (rpois) and compare relative frequencies with theoretical probability measure. Solution. \\(\\sum_{k = 0}^{\\infty} p(1 - p)^k = p \\sum_{k = 0}^{\\infty} (1 - p)^k = p \\frac{1}{1 - 1 + p} = 1\\). We used the formula for geometric series. \\(\\sum_{k = 0}^{\\infty} \\frac{\\lambda^k e^{-\\lambda}}{k!} = e^{-\\lambda} \\sum_{k = 0}^{\\infty} \\frac{\\lambda^k}{k!} = e^{-\\lambda} e^{\\lambda} = 1.\\) We used the Taylor expansion of the exponential function. Since we only have to define a probability measure, we could only assign probabilities that sum to one to a finite number of events in \\(\\Omega\\), and probability zero to the other infinite number of events. However to make this solution more educational, we will try to find a measure that assigns a non-zero probability to all events in \\(\\Omega\\). A good start for this would be to find a converging infinite series, as the probabilities will have to sum to one. One simple converging series is the geometric series \\(\\sum_{k=0}^{\\infty} p^k\\) for \\(|p| < 1\\). Let us choose an arbitrary \\(p = 0.5\\). Then \\(\\sum_{k=0}^{\\infty} p^k = \\frac{1}{1 - 0.5} = 2\\). To complete the measure, we have to normalize it, so it sums to one, therefore \\(P(\\{k\\}) = \\frac{0.5^k}{2}\\) is a probability measure on \\(\\Omega\\). We could make it even more difficult by making this measure dependent on some parameter \\(\\alpha\\), but this is out of the scope of this introductory chapter. set.seed(1) pois_samp <- rpois(n = 1000, lambda = 10) pois_samp <- data.frame(x = pois_samp) %>% count(x) %>% mutate(n = n / 1000, type = "empirical_frequencies") %>% bind_rows(data.frame(x = 0:25, n = dpois(0:25, lambda = 10), type = "theoretical_measure")) pois_plot <- ggplot(data = pois_samp, aes(x = x, y = n, fill = type)) + geom_bar(stat="identity", position = "dodge") plot(pois_plot) Exercise 1.14 Define a probability measure on \\((\\Omega = \\mathbb{Z}, 2^{\\mathbb{Z}})\\). Define a probability measure such that \\(P(\\omega) > 0, \\forall \\omega \\in \\Omega\\). R: Implement a random generator that will generate samples with the relative frequency that corresponds to your probability measure. Compare relative frequencies with theoretical probability measure . Solution. \\(P(0) = 1, P(\\omega) = 0, \\forall \\omega \\neq 0\\). \\(P(\\{k\\}) = \\sum_{k = -\\infty}^{\\infty} \\frac{p(1 - p)^{|k|}}{2^{1 - 1_0(k)}}\\), where \\(1_0(k)\\) is the indicator function, which equals to one if \\(k\\) is 0, and equals to zero in every other case. n <- 1000 geom_samps <- rgeom(n, prob = 0.5) sign_samps <- sample(c(FALSE, TRUE), size = n, replace = TRUE) geom_samps[sign_samps] <- -geom_samps[sign_samps] my_pmf <- function (k, p) { indic <- rep(1, length(k)) indic[k == 0] <- 0 return ((p * (1 - p)^(abs(k))) / 2^indic) } geom_samps <- data.frame(x = geom_samps) %>% count(x) %>% mutate(n = n / 1000, type = "empirical_frequencies") %>% bind_rows(data.frame(x = -10:10, n = my_pmf(-10:10, 0.5), type = "theoretical_measure")) geom_plot <- ggplot(data = geom_samps, aes(x = x, y = n, fill = type)) + geom_bar(stat="identity", position = "dodge") plot(geom_plot) Exercise 1.15 Define a probability measure on \\(\\Omega = \\{1,2,3,4,5,6\\}\\) with parameter \\(m \\in \\{1,2,3,4,5,6\\}\\), so that the probability of outcome at distance \\(1\\) from \\(m\\) is half of the probability at distance \\(0\\), at distance \\(2\\) is half of the probability at distance \\(1\\), etc. R: Implement a random generator that will generate samples with the relative frequency that corresponds to your probability measure. Compare relative frequencies with theoretical probability measure . Solution. \\(P(\\{k\\}) = \\frac{\\frac{1}{2}^{|m - k|}}{\\sum_{i=1}^6 \\frac{1}{2}^{|m - i|}}\\) n <- 10000 m <- 4 my_pmf <- function (k, m) { denom <- sum(0.5^abs(m - 1:6)) return (0.5^abs(m - k) / denom) } samps <- c() for (i in 1:n) { a <- sample(1:6, 1) a_val <- my_pmf(a, m) prob <- runif(1) if (prob < a_val) { samps <- c(samps, a) } } samps <- data.frame(x = samps) %>% count(x) %>% mutate(n = n / length(samps), type = "empirical_frequencies") %>% bind_rows(data.frame(x = 1:6, n = my_pmf(1:6, m), type = "theoretical_measure")) my_plot <- ggplot(data = samps, aes(x = x, y = n, fill = type)) + geom_bar(stat="identity", position = "dodge") plot(my_plot) "],["uprobspaces.html", "Chapter 2 Uncountable probability spaces 2.1 Borel sets 2.2 Lebesgue measure", " Chapter 2 Uncountable probability spaces This chapter deals with uncountable probability spaces. The students are expected to acquire the following knowledge: Theoretical Understand Borel sets and identify them. Estimate Lebesgue measure for different sets. Know when sets are Borel-measurable. Understanding of countable and uncountable sets. R Uniform sampling. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 2.1 Borel sets Exercise 2.1 Prove that the intersection of two sigma algebras on \\(\\Omega\\) is a sigma algebra. Prove that the collection of all open subsets \\((a,b)\\) on \\((0,1]\\) is not a sigma algebra of \\((0,1]\\). Solution. Empty set: \\[\\begin{equation} \\emptyset \\in \\mathcal{A} \\wedge \\emptyset \\in \\mathcal{B} \\Rightarrow \\emptyset \\in \\mathcal{A} \\cap \\mathcal{B} \\end{equation}\\] Complement: \\[\\begin{equation} \\text{Let } A \\in \\mathcal{A} \\cap \\mathcal{B} \\Rightarrow A \\in \\mathcal{A} \\wedge A \\in \\mathcal{B} \\Rightarrow A^c \\in \\mathcal{A} \\wedge A^c \\in \\mathcal{B} \\Rightarrow A^c \\in \\mathcal{A} \\cap \\mathcal{B} \\end{equation}\\] Countable additivity: Let \\(\\{A_i\\}\\) be a countable sequence of subsets in \\(\\mathcal{A} \\cap \\mathcal{B}\\). \\[\\begin{equation} \\forall i: A_i \\in \\mathcal{A} \\cap \\mathcal{B} \\Rightarrow A_i \\in \\mathcal{A} \\wedge A_i \\in \\mathcal{B} \\Rightarrow \\cup A_i \\in \\mathcal{A} \\wedge \\cup A_i \\in \\mathcal{B} \\Rightarrow \\cup A_i \\in \\mathcal{A} \\cap \\mathcal{B} \\end{equation}\\] Let \\(A\\) denote the collection of all open subsets \\((a,b)\\) on \\((0,1]\\). Then \\((0,1) \\in A\\). But \\((0,1)^c = 1 \\notin A\\). Exercise 2.2 Show that \\(\\mathcal{C} = \\sigma(\\mathcal{C})\\) if and only if \\(\\mathcal{C}\\) is a sigma algebra. Solution. “\\(\\Rightarrow\\)” This follows from the definition of a generated sigma algebra. “\\(\\Leftarrow\\)” Let \\(\\mathcal{F} = \\cap_i F_i\\) be the intersection of all sigma algebras that contain \\(\\mathcal{C}\\). Then \\(\\sigma(\\mathcal{C}) = \\mathcal{F}\\). Additionally, \\(\\forall i: \\mathcal{C} \\in F_i\\). So each \\(F_i\\) can be written as \\(F_i = \\mathcal{C} \\cup D\\), where \\(D\\) are the rest of the elements in the sigma algebra. In other words, each sigma algebra in the collection contains at least \\(\\mathcal{C}\\), but can contain other elements. Now for some \\(j\\), \\(F_j = \\mathcal{C}\\) as \\(\\{F_i\\}\\) contains all sigma algebras that contain \\(\\mathcal{C}\\) and \\(\\mathcal{C}\\) is such a sigma algebra. Since this is the smallest subset in the intersection it follows that \\(\\sigma(\\mathcal{C}) = \\mathcal{F} = \\mathcal{C}\\). Exercise 2.3 Let \\(\\mathcal{C}\\) and \\(\\mathcal{D}\\) be two collections of subsets on \\(\\Omega\\) such that \\(\\mathcal{C} \\subset \\mathcal{D}\\). Prove that \\(\\sigma(\\mathcal{C}) \\subseteq \\sigma(\\mathcal{D})\\). Solution. \\(\\sigma(\\mathcal{D})\\) is a sigma algebra that contains \\(\\mathcal{D}\\). It follows that \\(\\sigma(\\mathcal{D})\\) is a sigma algebra that contains \\(\\mathcal{C}\\). Let us write \\(\\sigma(\\mathcal{C}) = \\cap_i F_i\\), where \\(\\{F_i\\}\\) is the collection of all sigma algebras that contain \\(\\mathcal{C}\\). Since \\(\\sigma(\\mathcal{D})\\) is such a sigma algebra, there exists an index \\(j\\), so that \\(F_j = \\sigma(\\mathcal{D})\\). Then we can write \\[\\begin{align} \\sigma(\\mathcal{C}) &= (\\cap_{i \\neq j} F_i) \\cap \\sigma(\\mathcal{D}) \\\\ &\\subseteq \\sigma(\\mathcal{D}). \\end{align}\\] Exercise 2.4 Prove that the following subsets of \\((0,1]\\) are Borel-measurable by finding their measure. Any countable set. The set of numbers in (0,1] whose decimal expansion does not contain 7. Solution. This follows directly from the fact that every countable set is a union of singletons, whose measure is 0. Let us first look at numbers which have a 7 as the first decimal numbers. Their measure is 0.1. Then we take all the numbers with a 7 as the second decimal number (excluding those who already have it as the first). These have the measure 0.01, and there are 9 of them, so their total measure is 0.09. We can continue to do so infinitely many times. At each \\(n\\), we have the measure of the intervals which is \\(10^n\\) and the number of those intervals is \\(9^{n-1}\\). Now \\[\\begin{align} \\lambda(A) &= 1 - \\sum_{n = 0}^{\\infty} \\frac{9^n}{10^{n+1}} \\\\ &= 1 - \\frac{1}{10} \\sum_{n = 0}^{\\infty} (\\frac{9}{10})^n \\\\ &= 1 - \\frac{1}{10} \\frac{10}{1} \\\\ &= 0. \\end{align}\\] Since we have shown that the measure of the set is \\(0\\), we have also shown that the set is measurable. Exercise 2.5 Let \\(\\Omega = [0,1]\\), and let \\(\\mathcal{F}_3\\) consist of all countable subsets of \\(\\Omega\\), and all subsets of \\(\\Omega\\) having a countable complement. Show that \\(\\mathcal{F}_3\\) is a sigma algebra. Let us define \\(P(A)=0\\) if \\(A\\) is countable, and \\(P(A) = 1\\) if \\(A\\) has a countable complement. Is \\((\\Omega, \\mathcal{F}_3, P)\\) a legitimate probability space? Solution. The empty set is countable, therefore it is in \\(\\mathcal{F}_3\\). For any \\(A \\in \\mathcal{F}_3\\). If \\(A\\) is countable, then \\(A^c\\) has a countable complement and is in \\(\\mathcal{F}_3\\). If \\(A\\) is uncountable, then it has a countable complement \\(A^c\\) which is therefore also in \\(\\mathcal{F}_3\\). We are left with showing countable additivity. Let \\(\\{A_i\\}\\) be an arbitrary collection of sets in \\(\\mathcal{F}_3\\). We will look at two possibilities. First let all \\(A_i\\) be countable. A countable union of countable sets is countable, and therefore in \\(\\mathcal{F}_3\\). Second, let at least one \\(A_i\\) be uncountable. It follows that it has a countable complement. We can write \\[\\begin{equation} (\\cup_{i=1}^{\\infty} A_i)^c = \\cap_{i=1}^{\\infty} A_i^c. \\end{equation}\\] Since at least one \\(A_i^c\\) on the right side is countable, the whole intersection is countable, and therefore the union has a countable complement. It follows that the union is in \\(\\mathcal{F}_3\\). The tuple \\((\\Omega, \\mathcal{F}_3)\\) is a measurable space. Therefore, we only need to check whether \\(P\\) is a probability measure. The measure of the empty set is zero as it is countable. We have to check for countable additivity. Let us look at three situations. Let \\(A_i\\) be disjoint sets. First, let all \\(A_i\\) be countable. \\[\\begin{equation} P(\\cup_{i=1}^{\\infty} A_i) = \\sum_{i=1}^{\\infty}P( A_i)) = 0. \\end{equation}\\] Since the union is countable, the above equation holds. Second, let exactly one \\(A_i\\) be uncountable. W.L.O.G. let that be \\(A_1\\). Then \\[\\begin{equation} P(\\cup_{i=1}^{\\infty} A_i) = 1 + \\sum_{i=2}^{\\infty}P( A_i)) = 1. \\end{equation}\\] Since the union is uncountable, the above equation holds. Third, let at least two \\(A_i\\) be uncountable. We have to check whether it is possible for two uncountable sets in \\(\\mathcal{F}_3\\) to be disjoint. If that is possible, then their measures would sum to more than one and \\(P\\) would not be a probability measure. W.L.O.G. let \\(A_1\\) and \\(A_2\\) be uncountable. Then we have \\[\\begin{equation} A_1 \\cap A_2 = (A_1^c \\cup A_2^c)^c. \\end{equation}\\] Now \\(A_1^c\\) and \\(A_2^c\\) are countable and their union is therefore countable. Let \\(B = A_1^c \\cup A_2^c\\). So the intersection of \\(A_1\\) and \\(A_2\\) equals the complement of \\(B\\), which is countable. For the intersection to be the empty set, \\(B\\) would have to equal to \\(\\Omega\\). But \\(\\Omega\\) is uncountable and therefore \\(B\\) can not equal to \\(\\Omega\\). It follows that two uncountable sets in \\(\\mathcal{F}_3\\) can not have an empty intersection. Therefore the tuple is a legitimate probability space. 2.2 Lebesgue measure Exercise 2.6 Show that the Lebesgue measure of rational numbers on \\([0,1]\\) is 0. R: Implement a random number generator, which generates uniform samples of irrational numbers in \\([0,1]\\) by uniformly sampling from \\([0,1]\\) and rejecting a sample if it is rational. Solution. There are a countable number of rational numbers. Therefore, we can write \\[\\begin{align} \\lambda(\\mathbb{Q}) &= \\lambda(\\cup_{i = 1}^{\\infty} q_i) &\\\\ &= \\sum_{i = 1}^{\\infty} \\lambda(q_i) &\\text{ (countable additivity)} \\\\ &= \\sum_{i = 1}^{\\infty} 0 &\\text{ (Lebesgue measure of a singleton)} \\\\ &= 0. \\end{align}\\] Exercise 2.7 Prove that the Lebesgue measure of \\(\\mathbb{R}\\) is infinity. Paradox. Show that the cardinality of \\(\\mathbb{R}\\) and \\((0,1)\\) is the same, while their Lebesgue measures are infinity and one respectively. Solution. Let \\(a_i\\) be the \\(i\\)-th integer for \\(i \\in \\mathbb{Z}\\). We can write \\(\\mathbb{R} = \\cup_{-\\infty}^{\\infty} (a_i, a_{i + 1}]\\). \\[\\begin{align} \\lambda(\\mathbb{R}) &= \\lambda(\\cup_{i = -\\infty}^{\\infty} (a_i, a_{i + 1}]) \\\\ &= \\lambda(\\lim_{n \\rightarrow \\infty} \\cup_{i = -n}^{n} (a_i, a_{i + 1}]) \\\\ &= \\lim_{n \\rightarrow \\infty} \\lambda(\\cup_{i = -n}^{n} (a_i, a_{i + 1}]) \\\\ &= \\lim_{n \\rightarrow \\infty} \\sum_{i = -n}^{n} \\lambda((a_i, a_{i + 1}]) \\\\ &= \\lim_{n \\rightarrow \\infty} \\sum_{i = -n}^{n} 1 \\\\ &= \\lim_{n \\rightarrow \\infty} 2n \\\\ &= \\infty. \\end{align}\\] We need to find a bijection between \\(\\mathbb{R}\\) and \\((0,1)\\). A well-known function that maps from a bounded interval to \\(\\mathbb{R}\\) is the tangent. To make the bijection easier to achieve, we will take the inverse, which maps from \\(\\mathbb{R}\\) to \\((-\\frac{\\pi}{2}, \\frac{\\pi}{2})\\). However, we need to change the function so it maps to \\((0,1)\\). First we add \\(\\frac{\\pi}{2}\\), so that we move the function above zero. Then we only have to divide by the max value, which in this case is \\(\\pi\\). So our bijection is \\[\\begin{equation} f(x) = \\frac{\\tan^{-1}(x) + \\frac{\\pi}{2}}{\\pi}. \\end{equation}\\] Exercise 2.8 Take the measure space \\((\\Omega_1 = (0,1], B_{(0,1]}, \\lambda)\\) (we know that this is a probability space on \\((0,1]\\)). Define a map (function) from \\(\\Omega_1\\) to \\(\\Omega_2 = \\{1,2,3,4,5,6\\}\\) such that the measure space \\((\\Omega_2, 2^{\\Omega_2}, \\lambda(f^{-1}()))\\) will be a discrete probability space with uniform probabilities (\\(P(\\omega) = \\frac{1}{6}, \\forall \\omega \\in \\Omega_2)\\). Is the map that you defined in (a) the only such map? How would you in the same fashion define a map that would result in a probability space that can be interpreted as a coin toss with probability \\(p\\) of heads? R: Use the map in (a) as a basis for a random generator for this fair die. Solution. In other words, we have to assign disjunct intervals of the same size to each element of \\(\\Omega_2\\). Therefore \\[\\begin{equation} f(x) = \\lceil 6x \\rceil. \\end{equation}\\] No, we could for example rearrange the order in which the intervals are mapped to integers. Additionally, we could have several disjoint intervals that mapped to the same integer, as long as the Lebesgue measure of their union would be \\(\\frac{1}{6}\\) and the function would remain injective. We have \\(\\Omega_3 = \\{0,1\\}\\), where zero represents heads and one represents tails. Then \\[\\begin{equation} f(x) = 0^{I_{A}(x)}, \\end{equation}\\] where \\(A = \\{y \\in (0,1] : y < p\\}\\). set.seed(1) unif_s <- runif(1000) die_s <- ceiling(6 * unif_s) summary(as.factor(die_s)) ## 1 2 3 4 5 6 ## 166 154 200 146 166 168 "],["condprob.html", "Chapter 3 Conditional probability 3.1 Calculating conditional probabilities 3.2 Conditional independence 3.3 Monty Hall problem", " Chapter 3 Conditional probability This chapter deals with conditional probability. The students are expected to acquire the following knowledge: Theoretical Identify whether variables are independent. Calculation of conditional probabilities. Understanding of conditional dependence and independence. How to apply Bayes’ theorem to solve difficult probabilistic questions. R Simulating conditional probabilities. cumsum. apply. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 3.1 Calculating conditional probabilities Exercise 3.1 A military officer is in charge of identifying enemy aircraft and shooting them down. He is able to positively identify an enemy airplane 95% of the time and positively identify a friendly airplane 90% of the time. Furthermore, 99% of the airplanes are friendly. When the officer identifies an airplane as an enemy airplane, what is the probability that it is not and they will shoot at a friendly airplane? Solution. Let \\(E = 0\\) denote that the observed plane is friendly and \\(E=1\\) that it is an enemy. Let \\(I = 0\\) denote that the officer identified it as friendly and \\(I = 1\\) as enemy. Then \\[\\begin{align} P(E = 0 | I = 1) &= \\frac{P(I = 1 | E = 0)P(E = 0)}{P(I = 1)} \\\\ &= \\frac{P(I = 1 | E = 0)P(E = 0)}{P(I = 1 | E = 0)P(E = 0) + P(I = 1 | E = 1)P(E = 1)} \\\\ &= \\frac{0.1 \\times 0.99}{0.1 \\times 0.99 + 0.95 \\times 0.01} \\\\ &= 0.91. \\end{align}\\] Exercise 3.2 R: Consider tossing a fair die. Let \\(A = \\{2,4,6\\}\\) and \\(B = \\{1,2,3,4\\}\\). Then \\(P(A) = \\frac{1}{2}\\), \\(P(B) = \\frac{2}{3}\\) and \\(P(AB) = \\frac{1}{3}\\). Since \\(P(AB) = P(A)P(B)\\), the events \\(A\\) and \\(B\\) are independent. Simulate draws from the sample space and verify that the proportions are the same. Then find two events \\(C\\) and \\(D\\) that are not independent and repeat the simulation. set.seed(1) nsamps <- 10000 tosses <- sample(1:6, nsamps, replace = TRUE) PA <- sum(tosses %in% c(2,4,6)) / nsamps PB <- sum(tosses %in% c(1,2,3,4)) / nsamps PA * PB ## [1] 0.3295095 sum(tosses %in% c(2,4)) / nsamps ## [1] 0.3323 # Let C = {1,2} and D = {2,3} PC <- sum(tosses %in% c(1,2)) / nsamps PD <- sum(tosses %in% c(2,3)) / nsamps PC * PD ## [1] 0.1067492 sum(tosses %in% c(2)) / nsamps ## [1] 0.1622 Exercise 3.3 A machine reports the true value of a thrown 12-sided die 5 out of 6 times. If the machine reports a 1 has been tossed, what is the probability that it is actually a 1? Now let the machine only report whether a 1 has been tossed or not. Does the probability change? R: Use simulation to check your answers to a) and b). Solution. Let \\(T = 1\\) denote that the toss is 1 and \\(M = 1\\) that the machine reports a 1. \\[\\begin{align} P(T = 1 | M = 1) &= \\frac{P(M = 1 | T = 1)P(T = 1)}{P(M = 1)} \\\\ &= \\frac{P(M = 1 | T = 1)P(T = 1)}{\\sum_{k=1}^{12} P(M = 1 | T = k)P(T = k)} \\\\ &= \\frac{\\frac{5}{6}\\frac{1}{12}}{\\frac{5}{6}\\frac{1}{12} + 11 \\frac{1}{6} \\frac{1}{11} \\frac{1}{12}} \\\\ &= \\frac{5}{6}. \\end{align}\\] Yes. \\[\\begin{align} P(T = 1 | M = 1) &= \\frac{P(M = 1 | T = 1)P(T = 1)}{P(M = 1)} \\\\ &= \\frac{P(M = 1 | T = 1)P(T = 1)}{\\sum_{k=1}^{12} P(M = 1 | T = k)P(T = k)} \\\\ &= \\frac{\\frac{5}{6}\\frac{1}{12}}{\\frac{5}{6}\\frac{1}{12} + 11 \\frac{1}{6} \\frac{1}{12}} \\\\ &= \\frac{5}{16}. \\end{align}\\] set.seed(1) nsamps <- 10000 report_a <- vector(mode = "numeric", length = nsamps) report_b <- vector(mode = "logical", length = nsamps) truths <- vector(mode = "logical", length = nsamps) for (i in 1:10000) { toss <- sample(1:12, size = 1) truth <- sample(c(TRUE, FALSE), size = 1, prob = c(5/6, 1/6)) truths[i] <- truth if (truth) { report_a[i] <- toss report_b[i] <- toss == 1 } else { remaining <- (1:12)[1:12 != toss] report_a[i] <- sample(remaining, size = 1) report_b[i] <- toss != 1 } } truth_a1 <- truths[report_a == 1] sum(truth_a1) / length(truth_a1) ## [1] 0.8300733 truth_b1 <- truths[report_b] sum(truth_b1) / length(truth_b1) ## [1] 0.3046209 Exercise 3.4 A coin is tossed independently \\(n\\) times. The probability of heads at each toss is \\(p\\). At each time \\(k\\), \\((k = 2,3,...,n)\\) we get a reward at time \\(k+1\\) if \\(k\\)-th toss was a head and the previous toss was a tail. Let \\(A_k\\) be the event that a reward is obtained at time \\(k\\). Are events \\(A_k\\) and \\(A_{k+1}\\) independent? Are events \\(A_k\\) and \\(A_{k+2}\\) independent? R: simulate 10 tosses 10000 times, where \\(p = 0.7\\). Check your answers to a) and b) by counting the frequencies of the events \\(A_5\\), \\(A_6\\), and \\(A_7\\). Solution. For \\(A_k\\) to happen, we need the tosses \\(k-2\\) and \\(k-1\\) be tails and heads respectively. For \\(A_{k+1}\\) to happen, we need tosses \\(k-1\\) and \\(k\\) be tails and heads respectively. As the toss \\(k-1\\) need to be heads for one and tails for the other, these two events can not happen simultaneously. Therefore the probability of their intersection is 0. But the probability of each of them separately is \\(p(1-p) > 0\\). Therefore, they are not independent. For \\(A_k\\) to happen, we need the tosses \\(k-2\\) and \\(k-1\\) be tails and heads respectively. For \\(A_{k+2}\\) to happen, we need tosses \\(k\\) and \\(k+1\\) be tails and heads respectively. So the probability of intersection is \\(p^2(1-p)^2\\). And the probability of each separately is again \\(p(1-p)\\). Therefore, they are independent. set.seed(1) nsamps <- 10000 p <- 0.7 rewardA_5 <- vector(mode = "logical", length = nsamps) rewardA_6 <- vector(mode = "logical", length = nsamps) rewardA_7 <- vector(mode = "logical", length = nsamps) rewardA_56 <- vector(mode = "logical", length = nsamps) rewardA_57 <- vector(mode = "logical", length = nsamps) for (i in 1:nsamps) { samps <- sample(c(0,1), size = 10, replace = TRUE, prob = c(0.7, 0.3)) rewardA_5[i] <- (samps[4] == 0 & samps[3] == 1) rewardA_6[i] <- (samps[5] == 0 & samps[4] == 1) rewardA_7[i] <- (samps[6] == 0 & samps[5] == 1) rewardA_56[i] <- (rewardA_5[i] & rewardA_6[i]) rewardA_57[i] <- (rewardA_5[i] & rewardA_7[i]) } sum(rewardA_5) / nsamps ## [1] 0.2141 sum(rewardA_6) / nsamps ## [1] 0.2122 sum(rewardA_7) / nsamps ## [1] 0.2107 sum(rewardA_56) / nsamps ## [1] 0 sum(rewardA_57) / nsamps ## [1] 0.0454 Exercise 3.5 A drawer contains two coins. One is an unbiased coin, the other is a biased coin, which will turn up heads with probability \\(p\\) and tails with probability \\(1-p\\). One coin is selected uniformly at random. The selected coin is tossed \\(n\\) times. The coin turns up heads \\(k\\) times and tails \\(n-k\\) times. What is the probability that the coin is biased? The selected coin is tossed repeatedly until it turns up heads \\(k\\) times. Given that it is tossed \\(n\\) times in total, what is the probability that the coin is biased? Solution. Let \\(B = 1\\) denote that the coin is biased and let \\(H = k\\) denote that we’ve seen \\(k\\) heads. \\[\\begin{align} P(B = 1 | H = k) &= \\frac{P(H = k | B = 1)P(B = 1)}{P(H = k)} \\\\ &= \\frac{P(H = k | B = 1)P(B = 1)}{P(H = k | B = 1)P(B = 1) + P(H = k | B = 0)P(B = 0)} \\\\ &= \\frac{p^k(1-p)^{n-k} 0.5}{p^k(1-p)^{n-k} 0.5 + 0.5^{n+1}} \\\\ &= \\frac{p^k(1-p)^{n-k}}{p^k(1-p)^{n-k} + 0.5^n}. \\end{align}\\] The same results as in a). The only difference between these two scenarios is that in b) the last throw must be heads. However, this holds for the biased and the unbiased coin and therefore does not affect the probability of the coin being biased. Exercise 3.6 Judy goes around the company for Women’s day and shares flowers. In every office she leaves a flower, if there is at least one woman inside. The probability that there’s a woman in the office is \\(\\frac{3}{5}\\). What is the probability that Judy leaves her first flower in the fourth office? Given that she has given away exactly three flowers in the first four offices, what is the probability that she gives her fourth flower in the eighth office? What is the probability that she leaves the second flower in the fifth office? What is the probability that she leaves the second flower in the fifth office, given that she did not leave the second flower in the second office? Judy needs a new supply of flowers immediately after the office, where she gives away her last flower. What is the probability that she visits at least five offices, if she starts with two flowers? R: simulate Judy’s walk 10000 times to check your answers a) - e). Solution. Let \\(X_i = k\\) denote the event that … \\(i\\)-th sample on the \\(k\\)-th run. Since the events are independent, we can multiply their probabilities to get \\[\\begin{equation} P(X_1 = 4) = 0.4^3 \\times 0.6 = 0.0384. \\end{equation}\\] Same as in a) as we have a fresh start after first four offices. For this to be possible, she had to leave the first flower in one of the first four offices. Therefore there are four possibilities, and for each of those the probability is \\(0.4^3 \\times 0.6\\). Additionally, the probability that she leaves a flower in the fifth office is \\(0.6\\). So \\[\\begin{equation} P(X_2 = 5) = \\binom{4}{1} \\times 0.4^3 \\times 0.6^2 = 0.09216. \\end{equation}\\] We use Bayes’ theorem. \\[\\begin{align} P(X_2 = 5 | X_2 \\neq 2) &= \\frac{P(X_2 \\neq 2 | X_2 = 5)P(X_2 = 5)}{P(X_2 \\neq 2)} \\\\ &= \\frac{0.09216}{0.64} \\\\ &= 0.144. \\end{align}\\] The denominator in the second equation can be calculated as follows. One of three things has to happen for the second not to be dealt in the second round. First, both are zero, so \\(0.4^2\\). Second, first is zero, and second is one, so \\(0.4 \\times 0.6\\). Third, the first is one and the second one zero, so \\(0.6 \\times 0.4\\). Summing these values we get \\(0.64\\). We will look at the complement, so the events that she gave away exactly two flowers after two, three and four offices. \\[\\begin{equation} P(X_2 \\geq 5) = 1 - 0.6^2 - 2 \\times 0.4 \\times 0.6^2 - 3 \\times 0.4^2 \\times 0.6^2 = 0.1792. \\end{equation}\\] The multiplying parts represent the possibilities of the first flower. set.seed(1) nsamps <- 100000 Judyswalks <- matrix(data = NA, nrow = nsamps, ncol = 8) for (i in 1:nsamps) { thiswalk <- sample(c(0,1), size = 8, replace = TRUE, prob = c(0.4, 0.6)) Judyswalks[i, ] <- thiswalk } csJudy <- t(apply(Judyswalks, 1, cumsum)) # a sum(csJudy[ ,4] == 1 & csJudy[ ,3] == 0) / nsamps ## [1] 0.03848 # b csJsubset <- csJudy[csJudy[ ,4] == 3 & csJudy[ ,3] == 2, ] sum(csJsubset[ ,8] == 4 & csJsubset[ ,7] == 3) / nrow(csJsubset) ## [1] 0.03665893 # c sum(csJudy[ ,5] == 2 & csJudy[ ,4] == 1) / nsamps ## [1] 0.09117 # d sum(csJudy[ ,5] == 2 & csJudy[ ,4] == 1) / sum(csJudy[ ,2] != 2) ## [1] 0.1422398 # e sum(csJudy[ ,4] < 2) / nsamps ## [1] 0.17818 3.2 Conditional independence Exercise 3.7 Describe: A real-world example of two events \\(A\\) and \\(B\\) that are dependent but become conditionally independent if conditioned on a third event \\(C\\). A real-world example of two events \\(A\\) and \\(B\\) that are independent, but become dependent if conditioned on some third event \\(C\\). Solution. Let \\(A\\) be the height of a person and let \\(B\\) be the person’s knowledge of the Dutch language. These events are dependent since the Dutch are known to be taller than average. However if \\(C\\) is the nationality of the person, then \\(A\\) and \\(B\\) are independent given \\(C\\). Let \\(A\\) be the event that Mary passes the exam and let \\(B\\) be the event that John passes the exam. These events are independent. However, if the event \\(C\\) is that Mary and John studied together, then \\(A\\) and \\(B\\) are conditionally dependent given \\(C\\). Exercise 3.8 We have two coins of identical appearance. We know that one is a fair coin and the other flips heads 80% of the time. We choose one of the two coins uniformly at random. We discard the coin that was not chosen. We now flip the chosen coin independently 10 times, producing a sequence \\(Y_1 = y_1\\), \\(Y_2 = y_2\\), …, \\(Y_{10} = y_{10}\\). Intuitively, without doing and computation, are these random variables independent? Compute the probability \\(P(Y_1 = 1)\\). Compute the probabilities \\(P(Y_2 = 1 | Y_1 = 1)\\) and \\(P(Y_{10} = 1 | Y_1 = 1,...,Y_9 = 1)\\). Given your answers to b) and c), would you now change your answer to a)? If so, discuss why your intuition had failed. Solution. \\(P(Y_1 = 1) = 0.5 * 0.8 + 0.5 * 0.5 = 0.65\\). Since we know that \\(Y_1 = 1\\) this should change our view of the probability of the coin being biased or not. Let \\(B = 1\\) denote the event that the coin is biased and let \\(B = 0\\) denote that the coin is unbiased. By using marginal probability, we can write \\[\\begin{align} P(Y_2 = 1 | Y_1 = 1) &= P(Y_2 = 1, B = 1 | Y_1 = 1) + P(Y_2 = 1, B = 0 | Y_1 = 1) \\\\ &= \\sum_{k=1}^2 P(Y_2 = 1 | B = k, Y_1 = 1)P(B = k | Y_1 = 1) \\\\ &= 0.8 \\frac{P(Y_1 = 1 | B = 1)P(B = 1)}{P(Y_1 = 1)} + 0.5 \\frac{P(Y_1 = 1 | B = 0)P(B = 0)}{P(Y_1 = 1)} \\\\ &= 0.8 \\frac{0.8 \\times 0.5}{0.65} + 0.5 \\frac{0.5 \\times 0.5}{0.65} \\\\ &\\approx 0.68. \\end{align}\\] For the other calculation we follow the same procedure. Let \\(X = 1\\) denote that first nine tosses are all heads (equivalent to \\(Y_1 = 1\\),…, \\(Y_9 = 1\\)). \\[\\begin{align} P(Y_{10} = 1 | X = 1) &= P(Y_2 = 1, B = 1 | X = 1) + P(Y_2 = 1, B = 0 | X = 1) \\\\ &= \\sum_{k=1}^2 P(Y_2 = 1 | B = k, X = 1)P(B = k | X = 1) \\\\ &= 0.8 \\frac{P(X = 1 | B = 1)P(B = 1)}{P(X = 1)} + 0.5 \\frac{P(X = 1 | B = 0)P(B = 0)}{P(X = 1)} \\\\ &= 0.8 \\frac{0.8^9 \\times 0.5}{0.5 \\times 0.8^9 + 0.5 \\times 0.5^9} + 0.5 \\frac{0.5^9 \\times 0.5}{0.5 \\times 0.8^9 + 0.5 \\times 0.5^9} \\\\ &\\approx 0.8. \\end{align}\\] 3.3 Monty Hall problem The Monty Hall problem is a famous probability puzzle with non-intuitive outcome. Many established mathematicians and statisticians had problems solving it and many even disregarded the correct solution until they’ve seen the proof by simulation. Here we will show how it can be solved relatively simply with the use of Bayes’ theorem if we select the variables in a smart way. Exercise 3.9 (Monty Hall problem) A prize is placed at random behind one of three doors. You pick a door. Now Monty Hall chooses one of the other two doors, opens it and shows you that it is empty. He then gives you the opportunity to keep your door or switch to the other unopened door. Should you stay or switch? Use Bayes’ theorem to calculate the probability of winning if you switch and if you do not. R: Check your answers in R. Solution. W.L.O.G. assume we always pick the first door. The host can only open door 2 or door 3, as he can not open the door we picked. Let \\(k \\in \\{2,3\\}\\). Let us first look at what happens if we do not change. Then we have \\[\\begin{align} P(\\text{car in 1} | \\text{open $k$}) &= \\frac{P(\\text{open $k$} | \\text{car in 1})P(\\text{car in 1})}{P(\\text{open $k$})} \\\\ &= \\frac{P(\\text{open $k$} | \\text{car in 1})P(\\text{car in 1})}{\\sum_{n=1}^3 P(\\text{open $k$} | \\text{car in $n$})P(\\text{car in $n$)}}. \\end{align}\\] The probability that he opened \\(k\\) if the car is in 1 is \\(\\frac{1}{2}\\), as he can choose between door 2 and 3 as both have a goat behind it. Let us look at the normalization constant. When \\(n = 1\\) we get the value in the nominator. When \\(n=k\\), we get 0, as he will not open the door if there’s a prize behind. The remaining option is that we select 1, the car is behind \\(k\\) and he opens the only door left. Since he can’t open 1 due to it being our pick and \\(k\\) due to having the prize, the probability of opening the remaining door is 1, and the prior probability of the car being behind this door is \\(\\frac{1}{3}\\). So we have \\[\\begin{align} P(\\text{car in 1} | \\text{open $k$}) &= \\frac{\\frac{1}{2}\\frac{1}{3}}{\\frac{1}{2}\\frac{1}{3} + \\frac{1}{3}} \\\\ &= \\frac{1}{3}. \\end{align}\\] Now let us look at what happens if we do change. Let \\(k' \\in \\{2,3\\}\\) be the door that is not opened. If we change, we select this door, so we have \\[\\begin{align} P(\\text{car in $k'$} | \\text{open $k$}) &= \\frac{P(\\text{open $k$} | \\text{car in $k'$})P(\\text{car in $k'$})}{P(\\text{open $k$})} \\\\ &= \\frac{P(\\text{open $k$} | \\text{car in $k'$})P(\\text{car in $k'$})}{\\sum_{n=1}^3 P(\\text{open $k$} | \\text{car in $n$})P(\\text{car in $n$)}}. \\end{align}\\] The denominator stays the same, the only thing that is different from before is \\(P(\\text{open $k$} | \\text{car in $k'$})\\). We have a situation where we initially selected door 1 and the car is in door \\(k'\\). The probability that the host will open door \\(k\\) is then 1, as he can not pick any other door. So we have \\[\\begin{align} P(\\text{car in $k'$} | \\text{open $k$}) &= \\frac{\\frac{1}{3}}{\\frac{1}{2}\\frac{1}{3} + \\frac{1}{3}} \\\\ &= \\frac{2}{3}. \\end{align}\\] Therefore it makes sense to change the door. set.seed(1) nsamps <- 1000 ifchange <- vector(mode = "logical", length = nsamps) ifstay <- vector(mode = "logical", length = nsamps) for (i in 1:nsamps) { where_car <- sample(c(1:3), 1) where_player <- sample(c(1:3), 1) open_samp <- (1:3)[where_car != (1:3) & where_player != (1:3)] if (length(open_samp) == 1) { where_open <- open_samp } else { where_open <- sample(open_samp, 1) } ifstay[i] <- where_car == where_player where_ifchange <- (1:3)[where_open != (1:3) & where_player != (1:3)] ifchange[i] <- where_ifchange == where_car } sum(ifstay) / nsamps ## [1] 0.328 sum(ifchange) / nsamps ## [1] 0.672 "],["rvs.html", "Chapter 4 Random variables 4.1 General properties and calculations 4.2 Discrete random variables 4.3 Continuous random variables 4.4 Singular random variables 4.5 Transformations", " Chapter 4 Random variables This chapter deals with random variables and their distributions. The students are expected to acquire the following knowledge: Theoretical Identification of random variables. Convolutions of random variables. Derivation of PDF, PMF, CDF, and quantile function. Definitions and properties of common discrete random variables. Definitions and properties of common continuous random variables. Transforming univariate random variables. R Familiarize with PDF, PMF, CDF, and quantile functions for several distributions. Visual inspection of probability distributions. Analytical and empirical calculation of probabilities based on distributions. New R functions for plotting (for example, facet_wrap). Creating random number generators based on the Uniform distribution. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 4.1 General properties and calculations Exercise 4.1 Which of the functions below are valid CDFs? Find their respective densities. R: Plot the three functions. \\[\\begin{equation} F(x) = \\begin{cases} 1 - e^{-x^2} & x \\geq 0 \\\\ 0 & x < 0. \\end{cases} \\end{equation}\\] \\[\\begin{equation} F(x) = \\begin{cases} e^{-\\frac{1}{x}} & x > 0 \\\\ 0 & x \\leq 0. \\end{cases} \\end{equation}\\] \\[\\begin{equation} F(x) = \\begin{cases} 0 & x \\leq 0 \\\\ \\frac{1}{3} & 0 < x \\leq \\frac{1}{2} \\\\ 1 & x > \\frac{1}{2}. \\end{cases} \\end{equation}\\] Solution. Yes. First, let us check the limits. \\(\\lim_{x \\rightarrow -\\infty} (0) = 0\\). \\(\\lim_{x \\rightarrow \\infty} (1 - e^{-x^2}) = 1 - \\lim_{x \\rightarrow \\infty} e^{-x^2} = 1 - 0 = 1\\). Second, let us check whether the function is increasing. Let \\(x > y \\geq 0\\). Then \\(1 - e^{-x^2} \\geq 1 - e^{-y^2}\\). We only have to check right continuity for the point zero. \\(F(0) = 0\\) and \\(\\lim_{\\epsilon \\downarrow 0}F (0 + \\epsilon) = \\lim_{\\epsilon \\downarrow 0} 1 - e^{-\\epsilon^2} = 1 - \\lim_{\\epsilon \\downarrow 0} e^{-\\epsilon^2} = 1 - 1 = 0\\). We get the density by differentiating the CDF. \\(p(x) = \\frac{d}{dx} 1 - e^{-x^2} = 2xe^{-x^2}.\\) Students are encouraged to check that this is a proper PDF. Yes. First, let us check the limits. $_{x -} (0) = 0 and \\(\\lim_{x \\rightarrow \\infty} (e^{-\\frac{1}{x}}) = 1\\). Second, let us check whether the function is increasing. Let \\(x > y \\geq 0\\). Then \\(e^{-\\frac{1}{x}} \\geq e^{-\\frac{1}{y}}\\). We only have to check right continuity for the point zero. \\(F(0) = 0\\) and \\(\\lim_{\\epsilon \\downarrow 0}F (0 + \\epsilon) = \\lim_{\\epsilon \\downarrow 0} e^{-\\frac{1}{\\epsilon}} = 0\\). We get the density by differentiating the CDF. \\(p(x) = \\frac{d}{dx} e^{-\\frac{1}{x}} = \\frac{1}{x^2}e^{-\\frac{1}{x}}.\\) Students are encouraged to check that this is a proper PDF. No. The function is not right continuous as \\(F(\\frac{1}{2}) = \\frac{1}{3}\\), but \\(\\lim_{\\epsilon \\downarrow 0} F(\\frac{1}{2} + \\epsilon) = 1\\). f1 <- function (x) { tmp <- 1 - exp(-x^2) tmp[x < 0] <- 0 return(tmp) } f2 <- function (x) { tmp <- exp(-(1 / x)) tmp[x <= 0] <- 0 return(tmp) } f3 <- function (x) { tmp <- x tmp[x == x] <- 1 tmp[x <= 0.5] <- 1/3 tmp[x <= 0] <- 0 return(tmp) } cdf_data <- tibble(x = seq(-1, 20, by = 0.001), f1 = f1(x), f2 = f2(x), f3 = f3(x)) %>% melt(id.vars = "x") cdf_plot <- ggplot(data = cdf_data, aes(x = x, y = value, color = variable)) + geom_hline(yintercept = 1) + geom_line() plot(cdf_plot) Exercise 4.2 Let \\(X\\) be a random variable with CDF \\[\\begin{equation} F(x) = \\begin{cases} 0 & x < 0 \\\\ \\frac{x^2}{2} & 0 \\leq x < 1 \\\\ \\frac{1}{2} + \\frac{p}{2} & 1 \\leq x < 2 \\\\ \\frac{1}{2} + \\frac{p}{2} + \\frac{1 - p}{2} & x \\geq 2 \\end{cases} \\end{equation}\\] R: Plot this CDF for \\(p = 0.3\\). Is it a discrete, continuous, or mixed random varible? Find the probability density/mass of \\(X\\). f1 <- function (x, p) { tmp <- x tmp[x >= 2] <- 0.5 + (p * 0.5) + ((1-p) * 0.5) tmp[x < 2] <- 0.5 + (p * 0.5) tmp[x < 1] <- (x[x < 1])^2 / 2 tmp[x < 0] <- 0 return(tmp) } cdf_data <- tibble(x = seq(-1, 5, by = 0.001), y = f1(x, 0.3)) cdf_plot <- ggplot(data = cdf_data, aes(x = x, y = y)) + geom_hline(yintercept = 1) + geom_line(color = "blue") plot(cdf_plot) ::: {.solution} \\(X\\) is a mixed random variable. Since \\(X\\) is a mixed random variable, we have to find the PDF of the continuous part and the PMF of the discrete part. We get the continuous part by differentiating the corresponding CDF, \\(\\frac{d}{dx}\\frac{x^2}{2} = x\\). So the PDF, when \\(0 \\leq x < 1\\), is \\(p(x) = x\\). Let us look at the discrete part now. It has two steps, so this is a discrete distribution with two outcomes – numbers 1 and 2. The first happens with probability \\(\\frac{p}{2}\\), and the second with probability \\(\\frac{1 - p}{2}\\). This reminds us of the Bernoulli distribution. The PMF for the discrete part is \\(P(X = x) = (\\frac{p}{2})^{2 - x} (\\frac{1 - p}{2})^{x - 1}\\). ::: Exercise 4.3 (Convolutions) Convolutions are probability distributions that correspond to sums of independent random variables. Let \\(X\\) and \\(Y\\) be independent discrete variables. Find the PMF of \\(Z = X + Y\\). Hint: Use the law of total probability. Let \\(X\\) and \\(Y\\) be independent continuous variables. Find the PDF of \\(Z = X + Y\\). Hint: Start with the CDF. Solution. \\[\\begin{align} P(Z = z) &= P(X + Y = z) & \\\\ &= \\sum_{k = -\\infty}^\\infty P(X + Y = z | Y = k) P(Y = k) & \\text{ (law of total probability)} \\\\ &= \\sum_{k = -\\infty}^\\infty P(X + k = z | Y = k) P(Y = k) & \\\\ &= \\sum_{k = -\\infty}^\\infty P(X + k = z) P(Y = k) & \\text{ (independence of $X$ and $Y$)} \\\\ &= \\sum_{k = -\\infty}^\\infty P(X = z - k) P(Y = k). & \\end{align}\\] Let \\(f\\) and \\(g\\) be the PDFs of \\(X\\) and \\(Y\\) respectively. \\[\\begin{align} F(z) &= P(Z < z) \\\\ &= P(X + Y < z) \\\\ &= \\int_{-\\infty}^{\\infty} P(X + Y < z | Y = y)P(Y = y)dy \\\\ &= \\int_{-\\infty}^{\\infty} P(X + y < z | Y = y)P(Y = y)dy \\\\ &= \\int_{-\\infty}^{\\infty} P(X + y < z)P(Y = y)dy \\\\ &= \\int_{-\\infty}^{\\infty} P(X < z - y)P(Y = y)dy \\\\ &= \\int_{-\\infty}^{\\infty} (\\int_{-\\infty}^{z - y} f(x) dx) g(y) dy \\end{align}\\] Now \\[\\begin{align} p(z) &= \\frac{d}{dz} F(z) & \\\\ &= \\int_{-\\infty}^{\\infty} (\\frac{d}{dz}\\int_{-\\infty}^{z - y} f(x) dx) g(y) dy & \\\\ &= \\int_{-\\infty}^{\\infty} f(z - y) g(y) dy & \\text{ (fundamental theorem of calculus)}. \\end{align}\\] 4.2 Discrete random variables Exercise 4.4 (Binomial random variable) Let \\(X_k\\), \\(k = 1,...,n\\), be random variables with the Bernoulli measure as the PMF. Let \\(X = \\sum_{k=1}^n X_k\\). We call \\(X_k\\) a Bernoulli random variable with parameter \\(p \\in (0,1)\\). Find the CDF of \\(X_k\\). Find PMF of \\(X\\). This is a Binomial random variable with support in \\(\\{0,1,2,...,n\\}\\) and parameters \\(p \\in (0,1)\\) and \\(n \\in \\mathbb{N}_0\\). We denote \\[\\begin{equation} X | n,p \\sim \\text{binomial}(n,p). \\end{equation}\\] Find CDF of \\(X\\). R: Simulate from the binomial distribution with \\(n = 10\\) and \\(p = 0.5\\), and from \\(n\\) Bernoulli distributions with \\(p = 0.5\\). Visually compare the sum of Bernoullis and the binomial. Hint: there is no standard function like rpois for a Bernoulli random variable. Check exercise 1.12 to find out how to sample from a Bernoulli distribution. Solution. There are two outcomes – zero and one. Zero happens with probability \\(1 - p\\). Therefore \\[\\begin{equation} F(k) = \\begin{cases} 0 & k < 0 \\\\ 1 - p & 0 \\leq k < 1 \\\\ 1 & k \\geq 1. \\end{cases} \\end{equation}\\] For the probability of \\(X\\) to be equal to some \\(k \\leq n\\), exactly \\(k\\) Bernoulli variables need to be one, and the others zero. So \\(p^k(1-p)^{n-k}\\). There are \\(\\binom{n}{k}\\) such possible arrangements. Therefore \\[\\begin{align} P(X = k) = \\binom{n}{k} p^k (1 - p)^{n-k}. \\end{align}\\] \\[\\begin{equation} F(k) = \\sum_{i = 0}^{\\lfloor k \\rfloor} \\binom{n}{i} p^i (1 - p)^{n - i} \\end{equation}\\] set.seed(1) nsamps <- 10000 binom_samp <- rbinom(nsamps, size = 10, prob = 0.5) bernoulli_mat <- matrix(data = NA, nrow = nsamps, ncol = 10) for (i in 1:nsamps) { bernoulli_mat[i, ] <- rbinom(10, size = 1, prob = 0.5) } bern_samp <- apply(bernoulli_mat, 1, sum) b_data <- tibble(x = c(binom_samp, bern_samp), type = c(rep("binomial", 10000), rep("Bernoulli_sum", 10000))) b_plot <- ggplot(data = b_data, aes(x = x, fill = type)) + geom_bar(position = "dodge") plot(b_plot) Exercise 4.5 (Geometric random variable) A variable with PMF \\[\\begin{equation} P(k) = p(1-p)^k \\end{equation}\\] is a geometric random variable with support in non-negative integers. It has one parameter \\(p \\in (0,1]\\). We denote \\[\\begin{equation} X | p \\sim \\text{geometric}(p) \\end{equation}\\] Derive the CDF of a geometric random variable. R: Draw 1000 samples from the geometric distribution with \\(p = 0.3\\) and compare their frequencies to theoretical values. Solution. \\[\\begin{align} P(X \\leq k) &= \\sum_{i = 0}^k p(1-p)^i \\\\ &= p \\sum_{i = 0}^k (1-p)^i \\\\ &= p \\frac{1 - (1-p)^{k+1}}{1 - (1 - p)} \\\\ &= 1 - (1-p)^{k + 1} \\end{align}\\] set.seed(1) geo_samp <- rgeom(n = 1000, prob = 0.3) geo_samp <- data.frame(x = geo_samp) %>% count(x) %>% mutate(n = n / 1000, type = "empirical_frequencies") %>% bind_rows(data.frame(x = 0:20, n = dgeom(0:20, prob = 0.3), type = "theoretical_measure")) geo_plot <- ggplot(data = geo_samp, aes(x = x, y = n, fill = type)) + geom_bar(stat="identity", position = "dodge") plot(geo_plot) Exercise 4.6 (Poisson random variable) A variable with PMF \\[\\begin{equation} P(k) = \\frac{\\lambda^k e^{-\\lambda}}{k!} \\end{equation}\\] is a Poisson random variable with support in non-negative integers. It has one positive parameter \\(\\lambda\\), which also represents its mean value and variance (a measure of the deviation of the values from the mean – more on mean and variance in the next chapter). We denote \\[\\begin{equation} X | \\lambda \\sim \\text{Poisson}(\\lambda). \\end{equation}\\] This distribution is usually the default choice for modeling counts. We have already encountered a Poisson random variable in exercise 1.13, where we also sampled from this distribution. The CDF of a Poisson random variable is \\(P(X <= x) = e^{-\\lambda} \\sum_{i=0}^x \\frac{\\lambda^{i}}{i!}\\). R: Draw 1000 samples from the Poisson distribution with \\(\\lambda = 5\\) and compare their empirical cumulative distribution function with the theoretical CDF. set.seed(1) pois_samp <- rpois(n = 1000, lambda = 5) pois_samp <- data.frame(x = pois_samp) pois_plot <- ggplot(data = pois_samp, aes(x = x, colour = "ECDF")) + stat_ecdf(geom = "step") + geom_step(data = tibble(x = 0:17, y = ppois(x, 5)), aes(x = x, y = y, colour = "CDF")) + scale_colour_manual("Lgend title", values = c("black", "red")) plot(pois_plot) Exercise 4.7 (Negative binomial random variable) A variable with PMF \\[\\begin{equation} p(k) = \\binom{k + r - 1}{k}(1-p)^r p^k \\end{equation}\\] is a negative binomial random variable with support in non-negative integers. It has two parameters \\(r > 0\\) and \\(p \\in (0,1)\\). We denote \\[\\begin{equation} X | r,p \\sim \\text{NB}(r,p). \\end{equation}\\] Let us reparameterize the negative binomial distribution with \\(q = 1 - p\\). Find the PMF of \\(X \\sim \\text{NB}(1, q)\\). Do you recognize this distribution? Show that the sum of two negative binomial random variables with the same \\(p\\) is also a negative binomial random variable. Hint: Use the fact that the number of ways to place \\(n\\) indistinct balls into \\(k\\) boxes is \\(\\binom{n + k - 1}{n}\\). R: Draw samples from \\(X \\sim \\text{NB}(5, 0.4)\\) and \\(Y \\sim \\text{NB}(3, 0.4)\\). Draw samples from \\(Z = X + Y\\), where you use the parameters calculated in b). Plot both distributions, their sum, and \\(Z\\) using facet_wrap. Be careful, as R uses a different parameterization size=\\(r\\) and prob=\\(1 - p\\). Solution. \\[\\begin{align} P(X = k) &= \\binom{k + 1 - 1}{k}q^1 (1-q)^k \\\\ &= q(1-q)^k. \\end{align}\\] This is the geometric distribution. Let \\(X \\sim \\text{NB}(r_1, p)\\) and \\(Y \\sim \\text{NB}(r_2, p)\\). Let \\(Z = X + Y\\). \\[\\begin{align} P(Z = z) &= \\sum_{k = 0}^{\\infty} P(X = z - k)P(Y = k), \\text{ if k < 0, then the probabilities are 0} \\\\ &= \\sum_{k = 0}^{z} P(X = z - k)P(Y = k), \\text{ if k > z, then the probabilities are 0} \\\\ &= \\sum_{k = 0}^{z} \\binom{z - k + r_1 - 1}{z - k}(1 - p)^{r_1} p^{z - k} \\binom{k + r_2 - 1}{k}(1 - p)^{r_2} p^{k} & \\\\ &= \\sum_{k = 0}^{z} \\binom{z - k + r_1 - 1}{z - k} \\binom{k + r_2 - 1}{k}(1 - p)^{r_1 + r_2} p^{z} & \\\\ &= (1 - p)^{r_1 + r_2} p^{z} \\sum_{k = 0}^{z} \\binom{z - k + r_1 - 1}{z - k} \\binom{k + r_2 - 1}{k}& \\end{align}\\] The part before the sum reminds us of the negative binomial distribution with parameters \\(r_1 + r_2\\) and \\(p\\). To complete this term to the negative binomial PMF we need \\(\\binom{z + r_1 + r_2 -1}{z}\\). So the only thing we need to prove is that the sum equals this term. Both terms in the sum can be interpreted as numbers of ways to place a number of balls into boxes. For the left term it is \\(z-k\\) balls into \\(r_1\\) boxes, and for the right \\(k\\) balls into \\(r_2\\) boxes. For each \\(k\\) we are distributing \\(z\\) balls in total. By summing over all \\(k\\), we actually get all the possible placements of \\(z\\) balls into \\(r_1 + r_2\\) boxes. Therefore \\[\\begin{align} P(Z = z) &= (1 - p)^{r_1 + r_2} p^{z} \\sum_{k = 0}^{z} \\binom{z - k + r_1 - 1}{z - k} \\binom{k + r_2 - 1}{k}& \\\\ &= \\binom{z + r_1 + r_2 -1}{z} (1 - p)^{r_1 + r_2} p^{z}. \\end{align}\\] From this it also follows that the sum of geometric distributions with the same parameter is a negative binomial distribution. \\(Z \\sim \\text{NB}(8, 0.4)\\). set.seed(1) nsamps <- 10000 x <- rnbinom(nsamps, size = 5, prob = 0.6) y <- rnbinom(nsamps, size = 3, prob = 0.6) xpy <- x + y z <- rnbinom(nsamps, size = 8, prob = 0.6) samps <- tibble(x, y, xpy, z) samps <- melt(samps) ggplot(data = samps, aes(x = value)) + geom_bar() + facet_wrap(~ variable) 4.3 Continuous random variables Exercise 4.8 (Exponential random variable) A variable \\(X\\) with PDF \\(\\lambda e^{-\\lambda x}\\) is an exponential random variable with support in non-negative real numbers. It has one positive parameter \\(\\lambda\\). We denote \\[\\begin{equation} X | \\lambda \\sim \\text{Exp}(\\lambda). \\end{equation}\\] Find the CDF of an exponential random variable. Find the quantile function of an exponential random variable. Calculate the probability \\(P(1 \\leq X \\leq 3)\\), where \\(X \\sim \\text{Exp(1.5)}\\). R: Check your answer to c) with a simulation (rexp). Plot the probability in a meaningful way. R: Implement PDF, CDF, and the quantile function and compare their values with corresponding R functions visually. Hint: use the size parameter to make one of the curves wider. Solution. \\[\\begin{align} F(x) &= \\int_{0}^{x} \\lambda e^{-\\lambda t} dt \\\\ &= \\lambda \\int_{0}^{x} e^{-\\lambda t} dt \\\\ &= \\lambda (\\frac{1}{-\\lambda}e^{-\\lambda t} |_{0}^{x}) \\\\ &= \\lambda(\\frac{1}{\\lambda} - \\frac{1}{\\lambda} e^{-\\lambda x}) \\\\ &= 1 - e^{-\\lambda x}. \\end{align}\\] \\[\\begin{align} F(F^{-1}(x)) &= x \\\\ 1 - e^{-\\lambda F^{-1}(x)} &= x \\\\ e^{-\\lambda F^{-1}(x)} &= 1 - x \\\\ -\\lambda F^{-1}(x) &= \\ln(1 - x) \\\\ F^{-1}(x) &= - \\frac{ln(1 - x)}{\\lambda}. \\end{align}\\] \\[\\begin{align} P(1 \\leq X \\leq 3) &= P(X \\leq 3) - P(X \\leq 1) \\\\ &= P(X \\leq 3) - P(X \\leq 1) \\\\ &= 1 - e^{-1.5 \\times 3} - 1 + e^{-1.5 \\times 1} \\\\ &\\approx 0.212. \\end{align}\\] set.seed(1) nsamps <- 1000 samps <- rexp(nsamps, rate = 1.5) sum(samps >= 1 & samps <= 3) / nsamps ## [1] 0.212 exp_plot <- ggplot(data.frame(x = seq(0, 5, by = 0.01)), aes(x = x)) + stat_function(fun = dexp, args = list(rate = 1.5)) + stat_function(fun = dexp, args = list(rate = 1.5), xlim = c(1,3), geom = "area", fill = "red") plot(exp_plot) exp_pdf <- function(x, lambda) { return (lambda * exp(-lambda * x)) } exp_cdf <- function(x, lambda) { return (1 - exp(-lambda * x)) } exp_quant <- function(q, lambda) { return (-(log(1 - q) / lambda)) } ggplot(data = data.frame(x = seq(0, 5, by = 0.01)), aes(x = x)) + stat_function(fun = dexp, args = list(rate = 1.5), aes(color = "R"), size = 2.5) + stat_function(fun = exp_pdf, args = list(lambda = 1.5), aes(color = "Mine"), size = 1.2) + scale_color_manual(values = c("red", "black")) ggplot(data = data.frame(x = seq(0, 5, by = 0.01)), aes(x = x)) + stat_function(fun = pexp, args = list(rate = 1.5), aes(color = "R"), size = 2.5) + stat_function(fun = exp_cdf, args = list(lambda = 1.5), aes(color = "Mine"), size = 1.2) + scale_color_manual(values = c("red", "black")) ggplot(data = data.frame(x = seq(0, 1, by = 0.01)), aes(x = x)) + stat_function(fun = qexp, args = list(rate = 1.5), aes(color = "R"), size = 2.5) + stat_function(fun = exp_quant, args = list(lambda = 1.5), aes(color = "Mine"), size = 1.2) + scale_color_manual(values = c("red", "black")) Exercise 4.9 (Uniform random variable) Continuous uniform random variable with parameters \\(a\\) and \\(b\\) has the PDF \\[\\begin{equation} p(x) = \\begin{cases} \\frac{1}{b - a} & x \\in [a,b] \\\\ 0 & \\text{otherwise}. \\end{cases} \\end{equation}\\] Find the CDF of the uniform random variable. Find the quantile function of the uniform random variable. Let \\(X \\sim \\text{Uniform}(a,b)\\). Find the CDF of the variable \\(Y = \\frac{X - a}{b - a}\\). This is the standard uniform random variable. Let \\(X \\sim \\text{Uniform}(-1, 3)\\). Find such \\(z\\) that \\(P(X < z + \\mu_x) = \\frac{1}{5}\\). R: Check your result from d) using simulation. Solution. \\[\\begin{align} F(x) &= \\int_{a}^x \\frac{1}{b - a} dt \\\\ &= \\frac{1}{b - a} \\int_{a}^x dt \\\\ &= \\frac{x - a}{b - a}. \\end{align}\\] \\[\\begin{align} F(F^{-1}(p)) &= p \\\\ \\frac{F^{-1}(p) - a}{b - a} &= p \\\\ F^{-1}(p) &= p(b - a) + a. \\end{align}\\] \\[\\begin{align} F_Y(y) &= P(Y < y) \\\\ &= P(\\frac{X - a}{b - a} < y) \\\\ &= P(X < y(b - a) + a) \\\\ &= F_X(y(b - a) + a) \\\\ &= \\frac{(y(b - a) + a) - a}{b - a} \\\\ &= y. \\end{align}\\] \\[\\begin{align} P(X < z + 1) &= \\frac{1}{5} \\\\ F(z + 1) &= \\frac{1}{5} \\\\ z + 1 &= F^{-1}(\\frac{1}{5}) \\\\ z &= \\frac{1}{5}4 - 1 - 1 \\\\ z &= -1.2. \\end{align}\\] set.seed(1) a <- -1 b <- 3 nsamps <- 10000 unif_samp <- runif(nsamps, a, b) mu_x <- mean(unif_samp) new_samp <- unif_samp - mu_x quantile(new_samp, probs = 1/5) ## 20% ## -1.203192 punif(-0.2, -1, 3) ## [1] 0.2 Exercise 4.10 (Beta random variable) A variable \\(X\\) with PDF \\[\\begin{equation} p(x) = \\frac{x^{\\alpha - 1} (1 - x)^{\\beta - 1}}{\\text{B}(\\alpha, \\beta)}, \\end{equation}\\] where \\(\\text{B}(\\alpha, \\beta) = \\frac{\\Gamma(\\alpha) \\Gamma(\\beta)}{\\Gamma(\\alpha + \\beta)}\\) and \\(\\Gamma(x) = \\int_0^{\\infty} x^{z - 1} e^{-x} dx\\) is a Beta random variable with support on \\([0,1]\\). It has two positive parameters \\(\\alpha\\) and \\(\\beta\\). Notation: \\[\\begin{equation} X | \\alpha, \\beta \\sim \\text{Beta}(\\alpha, \\beta) \\end{equation}\\] It is often used in modeling rates. Calculate the PDF for \\(\\alpha = 1\\) and \\(\\beta = 1\\). What do you notice? R: Plot densities of the beta distribution for parameter pairs (2, 2), (4, 1), (1, 4), (2, 5), and (0.1, 0.1). R: Sample from \\(X \\sim \\text{Beta}(2, 5)\\) and compare the histogram with Beta PDF. Solution. \\[\\begin{equation} p(x) = \\frac{x^{1 - 1} (1 - x)^{1 - 1}}{\\text{B}(1, 1)} = 1. \\end{equation}\\] This is the standard uniform distribution. set.seed(1) ggplot(data = data.frame(x = seq(0, 1, by = 0.01)), aes(x = x)) + stat_function(fun = dbeta, args = list(shape1 = 2, shape2 = 2), aes(color = "alpha = 0.5")) + stat_function(fun = dbeta, args = list(shape1 = 4, shape2 = 1), aes(color = "alpha = 4")) + stat_function(fun = dbeta, args = list(shape1 = 1, shape2 = 4), aes(color = "alpha = 1")) + stat_function(fun = dbeta, args = list(shape1 = 2, shape2 = 5), aes(color = "alpha = 25")) + stat_function(fun = dbeta, args = list(shape1 = 0.1, shape2 = 0.1), aes(color = "alpha = 0.1")) set.seed(1) nsamps <- 1000 samps <- rbeta(nsamps, 2, 5) ggplot(data = data.frame(x = samps), aes(x = x)) + geom_histogram(aes(y = ..density..), color = "black") + stat_function(data = data.frame(x = seq(0, 1, by = 0.01)), aes(x = x), fun = dbeta, args = list(shape1 = 2, shape2 = 5), color = "red", size = 1.2) Exercise 4.11 (Gamma random variable) A random variable with PDF \\[\\begin{equation} p(x) = \\frac{\\beta^\\alpha}{\\Gamma(\\alpha)} x^{\\alpha - 1}e^{-\\beta x} \\end{equation}\\] is a Gamma random variable with support on the positive numbers and parameters shape \\(\\alpha > 0\\) and rate \\(\\beta > 0\\). We write \\[\\begin{equation} X | \\alpha, \\beta \\sim \\text{Gamma}(\\alpha, \\beta) \\end{equation}\\] and it’s CDF is \\[\\begin{equation} \\frac{\\gamma(\\alpha, \\beta x)}{\\Gamma(\\alpha)}, \\end{equation}\\] where \\(\\gamma(s, x) = \\int_0^x t^{s-1} e^{-t} dt\\). It is usually used in modeling positive phenomena (for example insurance claims and rainfalls). Let \\(X \\sim \\text{Gamma}(1, \\beta)\\). Find the PDF of \\(X\\). Do you recognize this PDF? Let \\(k = \\alpha\\) and \\(\\theta = \\frac{1}{\\beta}\\). Find the PDF of \\(X | k, \\theta \\sim \\text{Gamma}(k, \\theta)\\). Random variables can be reparameterized, and sometimes a reparameterized distribution is more suitable for certain calculations. The first parameterization is for example usually used in Bayesian statistics, while this parameterization is more common in econometrics and some other applied fields. Note that you also need to pay attention to the parameters in statistical software, so diligently read the help files when using functions like rgamma to see how the function is parameterized. R: Plot gamma CDF for random variables with shape and rate parameters (1,1), (10,1), (1,10). Solution. \\[\\begin{align} p(x) &= \\frac{\\beta^1}{\\Gamma(1)} x^{1 - 1}e^{-\\beta x} \\\\ &= \\beta e^{-\\beta x} \\end{align}\\] This is the PDF of the exponential distribution with parameter \\(\\beta\\). \\[\\begin{align} p(x) &= \\frac{1}{\\Gamma(k)\\beta^k} x^{k - 1}e^{-\\frac{x}{\\theta}}. \\end{align}\\] set.seed(1) ggplot(data = data.frame(x = seq(0, 25, by = 0.01)), aes(x = x)) + stat_function(fun = pgamma, args = list(shape = 1, rate = 1), aes(color = "Gamma(1,1)")) + stat_function(fun = pgamma, args = list(shape = 10, rate = 1), aes(color = "Gamma(10,1)")) + stat_function(fun = pgamma, args = list(shape = 1, rate = 10), aes(color = "Gamma(1,10)")) Exercise 4.12 (Normal random variable) A random variable with PDF \\[\\begin{equation} p(x) = \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(x - \\mu)^2}{2 \\sigma^2}} \\end{equation}\\] is a normal random variable with support on the real axis and parameters \\(\\mu\\) in reals and \\(\\sigma^2 > 0\\). The first is the mean parameter and the second is the variance parameter. Many statistical methods assume a normal distribution. We denote \\[\\begin{equation} X | \\mu, \\sigma \\sim \\text{N}(\\mu, \\sigma^2), \\end{equation}\\] and it’s CDF is \\[\\begin{equation} F(x) = \\int_{-\\infty}^x \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(t - \\mu)^2}{2 \\sigma^2}} dt, \\end{equation}\\] which is intractable and is usually approximated. Due to its flexibility it is also one of the most researched distributions. For that reason statisticians often use transformations of variables or approximate distributions with the normal distribution. Show that a variable \\(\\frac{X - \\mu}{\\sigma} \\sim \\text{N}(0,1)\\). This transformation is called standardization, and \\(\\text{N}(0,1)\\) is a standard normal distribution. R: Plot the normal distribution with \\(\\mu = 0\\) and different values for the \\(\\sigma\\) parameter. R: The normal distribution provides a good approximation for the Poisson distribution with a large \\(\\lambda\\). Let \\(X \\sim \\text{Poisson}(50)\\). Approximate \\(X\\) with the normal distribution and compare its density with the Poisson histogram. What are the values of \\(\\mu\\) and \\(\\sigma^2\\) that should provide the best approximation? Note that R function rnorm takes standard deviation (\\(\\sigma\\)) as a parameter and not variance. Solution. \\[\\begin{align} P(\\frac{X - \\mu}{\\sigma} < x) &= P(X < \\sigma x + \\mu) \\\\ &= F(\\sigma x + \\mu) \\\\ &= \\int_{-\\infty}^{\\sigma x + \\mu} \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(t - \\mu)^2}{2\\sigma^2}} dt \\end{align}\\] Now let \\(s = f(t) = \\frac{t - \\mu}{\\sigma}\\), then \\(ds = \\frac{dt}{\\sigma}\\) and \\(f(\\sigma x + \\mu) = x\\), so \\[\\begin{align} P(\\frac{X - \\mu}{\\sigma} < x) &= \\int_{-\\infty}^{x} \\frac{1}{\\sqrt{2 \\pi}} e^{-\\frac{s^2}{2}} ds. \\end{align}\\] There is no need to evaluate this integral, as we recognize it as the CDF of a normal distribution with \\(\\mu = 0\\) and \\(\\sigma^2 = 1\\). set.seed(1) # b ggplot(data = data.frame(x = seq(-15, 15, by = 0.01)), aes(x = x)) + stat_function(fun = dnorm, args = list(mean = 0, sd = 1), aes(color = "sd = 1")) + stat_function(fun = dnorm, args = list(mean = 0, sd = 0.4), aes(color = "sd = 0.1")) + stat_function(fun = dnorm, args = list(mean = 0, sd = 2), aes(color = "sd = 2")) + stat_function(fun = dnorm, args = list(mean = 0, sd = 5), aes(color = "sd = 5")) # c mean_par <- 50 nsamps <- 100000 pois_samps <- rpois(nsamps, lambda = mean_par) norm_samps <- rnorm(nsamps, mean = mean_par, sd = sqrt(mean_par)) my_plot <- ggplot() + geom_bar(data = tibble(x = pois_samps), aes(x = x, y = (..count..)/sum(..count..))) + geom_density(data = tibble(x = norm_samps), aes(x = x), color = "red") plot(my_plot) Exercise 4.13 (Logistic random variable) A logistic random variable has CDF \\[\\begin{equation} F(x) = \\frac{1}{1 + e^{-\\frac{x - \\mu}{s}}}, \\end{equation}\\] where \\(\\mu\\) is real and \\(s > 0\\). The support is on the real axis. We denote \\[\\begin{equation} X | \\mu, s \\sim \\text{Logistic}(\\mu, s). \\end{equation}\\] The distribution of the logistic random variable resembles a normal random variable, however it has heavier tails. Find the PDF of a logistic random variable. R: Implement logistic PDF and CDF and visually compare both for \\(X \\sim \\text{N}(0, 1)\\) and \\(Y \\sim \\text{logit}(0, \\sqrt{\\frac{3}{\\pi^2}})\\). These distributions have the same mean and variance. Additionally, plot the same plot on the interval [5,10], to better see the difference in the tails. R: For the distributions in b) find the probability \\(P(|X| > 4)\\) and interpret the result. Solution. \\[\\begin{align} p(x) &= \\frac{d}{dx} \\frac{1}{1 + e^{-\\frac{x - \\mu}{s}}} \\\\ &= \\frac{- \\frac{d}{dx} (1 + e^{-\\frac{x - \\mu}{s}})}{(1 + e{-\\frac{x - \\mu}{s}})^2} \\\\ &= \\frac{e^{-\\frac{x - \\mu}{s}}}{s(1 + e{-\\frac{x - \\mu}{s}})^2}. \\end{align}\\] # b set.seed(1) logit_pdf <- function (x, mu, s) { return ((exp(-(x - mu)/(s))) / (s * (1 + exp(-(x - mu)/(s)))^2)) } nl_plot <- ggplot(data = data.frame(x = seq(-12, 12, by = 0.01)), aes(x = x)) + stat_function(fun = dnorm, args = list(mean = 0, sd = 2), aes(color = "normal")) + stat_function(fun = logit_pdf, args = list(mu = 0, s = sqrt(12/pi^2)), aes(color = "logit")) plot(nl_plot) nl_plot <- ggplot(data = data.frame(x = seq(5, 10, by = 0.01)), aes(x = x)) + stat_function(fun = dnorm, args = list(mean = 0, sd = 2), aes(color = "normal")) + stat_function(fun = logit_pdf, args = list(mu = 0, s = sqrt(12/pi^2)), aes(color = "logit")) plot(nl_plot) # c logit_cdf <- function (x, mu, s) { return (1 / (1 + exp(-(x - mu) / s))) } p_logistic <- 1 - logit_cdf(4, 0, sqrt(12/pi^2)) + logit_cdf(-4, 0, sqrt(12/pi^2)) p_norm <- 1 - pnorm(4, 0, 2) + pnorm(-4, 0, 2) p_logistic ## [1] 0.05178347 p_norm ## [1] 0.04550026 # Logistic distribution has wider tails, therefore the probability of larger # absolute values is higher. 4.4 Singular random variables Exercise 4.14 (Cantor distribution) The Cantor set is a subset of \\([0,1]\\), which we create by iteratively deleting the middle third of the interval. For example, in the first iteration, we get the sets \\([0,\\frac{1}{3}]\\) and \\([\\frac{2}{3},1]\\). In the second iteration, we get \\([0,\\frac{1}{9}]\\), \\([\\frac{2}{9},\\frac{1}{3}]\\), \\([\\frac{2}{3}, \\frac{7}{9}]\\), and \\([\\frac{8}{9}, 1]\\). On the \\(n\\)-th iteration, we have \\[\\begin{equation} C_n = \\frac{C_{n-1}}{3} \\cup \\bigg(\\frac{2}{3} + \\frac{C_{n-1}}{3} \\bigg), \\end{equation}\\] where \\(C_0 = [0,1]\\). The Cantor set is then defined as the intersection of these sets \\[\\begin{equation} C = \\cap_{n=1}^{\\infty} C_n. \\end{equation}\\] It has the same cardinality as \\([0,1]\\). Another way to define the Cantor set is the set of all numbers on \\([0,1]\\), that do not have a 1 in the ternary representation \\(x = \\sum_{n=1}^\\infty \\frac{x_i}{3^i}, x_i \\in \\{0,1,2\\}\\). A random variable follows the Cantor distribution, if its CDF is the Cantor function (below). You can find the implementations of random number generator, CDF, and quantile functions for the Cantor distributions at https://github.com/Henrygb/CantorDist.R. Show that the Lebesgue measure of the Cantor set is 0. (Jagannathan) Let us look at an infinite sequence of independent fair-coin tosses. If the outcome is heads, let \\(x_i = 2\\) and \\(x_i = 0\\), when tails. Then use these to create \\(x = \\sum_{n=1}^\\infty \\frac{x_i}{3^i}\\). This is a random variable with the Cantor distribution. Show that \\(X\\) has a singular distribution. Solution. \\[\\begin{align} \\lambda(C) &= 1 - \\lambda(C^c) \\\\ &= 1 - \\frac{1}{3}\\sum_{k = 0}^\\infty (\\frac{2}{3})^k \\\\ &= 1 - \\frac{\\frac{1}{3}}{1 - \\frac{2}{3}} \\\\ &= 0. \\end{align}\\] First, for every \\(x\\), the probability of observing it is \\(\\lim_{n \\rightarrow \\infty} \\frac{1}{2^n} = 0\\). Second, the probability that we observe one of all the possible sequences is 1. Therefore \\(P(C) = 1\\). So this is a singular variable. The CDF only increments on the elements of the Cantor set. 4.5 Transformations Exercise 4.15 Let \\(X\\) be a random variable that is uniformly distributed on \\(\\{-2, -1, 0, 1, 2\\}\\). Find the PMF of \\(Y = X^2\\). Solution. \\[\\begin{align} P_Y(y) = \\sum_{x \\in \\sqrt(y)} P_X(x) = \\begin{cases} 0 & y \\notin \\{0,1,4\\} \\\\ \\frac{1}{5} & y = 0 \\\\ \\frac{2}{5} & y \\in \\{1,4\\} \\end{cases} \\end{align}\\] Exercise 4.16 (Lognormal random variable) A lognormal random variable is a variable whose logarithm is normally distributed. In practice, we often encounter skewed data. Usually using a log transformation on such data makes it more symmetric and therefore more suitable for modeling with the normal distribution (more on why we wish to model data with the normal distribution in the following chapters). Let \\(X \\sim \\text{N}(\\mu,\\sigma)\\). Find the PDF of \\(Y: \\log(Y) = X\\). R: Sample from the lognormal distribution with parameters \\(\\mu = 5\\) and \\(\\sigma = 2\\). Plot a histogram of the samples. Then log-transform the samples and plot a histogram along with the theoretical normal PDF. Solution. \\[\\begin{align} p_Y(y) &= p_X(\\log(y)) \\frac{d}{dy} \\log(y) \\\\ &= \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{\\frac{(\\log(y) - \\mu)^2}{2 \\sigma^2}} \\frac{1}{y} \\\\ &= \\frac{1}{y \\sqrt{2 \\pi \\sigma^2}} e^{\\frac{(\\log(y) - \\mu)^2}{2 \\sigma^2}}. \\end{align}\\] set.seed(1) nsamps <- 10000 mu <- 0.5 sigma <- 0.4 ln_samps <- rlnorm(nsamps, mu, sigma) ln_plot <- ggplot(data = data.frame(x = ln_samps), aes(x = x)) + geom_histogram(color = "black") plot(ln_plot) norm_samps <- log(ln_samps) n_plot <- ggplot(data = data.frame(x = norm_samps), aes(x = x)) + geom_histogram(aes(y = ..density..), color = "black") + stat_function(fun = dnorm, args = list(mean = mu, sd = sigma), color = "red") plot(n_plot) Exercise 4.17 (Probability integral transform) This exercise is borrowed from Wasserman. Let \\(X\\) have a continuous, strictly increasing CDF \\(F\\). Let \\(Y = F(X)\\). Find the density of \\(Y\\). This is called the probability integral transform. Let \\(U \\sim \\text{Uniform}(0,1)\\) and let \\(X = F^{-1}(U)\\). Show that \\(X \\sim F\\). R: Implement a program that takes Uniform(0,1) random variables and generates random variables from an exponential(\\(\\beta\\)) distribution. Compare your implemented function with function rexp in R. Solution. \\[\\begin{align} F_Y(y) &= P(Y < y) \\\\ &= P(F(X) < y) \\\\ &= P(X < F_X^{-1}(y)) \\\\ &= F_X(F_X^{-1}(y)) \\\\ &= y. \\end{align}\\] From the above it follows that \\(p(y) = 1\\). Note that we need to know the inverse CDF to be able to apply this procedure. \\[\\begin{align} P(X < x) &= P(F^{-1}(U) < x) \\\\ &= P(U < F(x)) \\\\ &= F_U(F(x)) \\\\ &= F(x). \\end{align}\\] set.seed(1) nsamps <- 10000 beta <- 4 generate_exp <- function (n, beta) { tmp <- runif(n) X <- qexp(tmp, beta) return (X) } df <- tibble("R" = rexp(nsamps, beta), "myGenerator" = generate_exp(nsamps, beta)) %>% gather() ggplot(data = df, aes(x = value, fill = key)) + geom_histogram(position = "dodge") "],["mrvs.html", "Chapter 5 Multiple random variables 5.1 General 5.2 Bivariate distribution examples 5.3 Transformations", " Chapter 5 Multiple random variables This chapter deals with multiple random variables and their distributions. The students are expected to acquire the following knowledge: Theoretical Calculation of PDF of transformed multiple random variables. Finding marginal and conditional distributions. R Scatterplots of bivariate random variables. New R functions (for example, expand.grid). .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 5.1 General Exercise 5.1 Let \\(X \\sim \\text{N}(0,1)\\) and \\(Y \\sim \\text{N}(0,1)\\) be independent random variables. Draw 1000 samples from \\((X,Y)\\) and plot a scatterplot. Now let \\(X \\sim \\text{N}(0,1)\\) and \\(Y | X = x \\sim N(ax, 1)\\). Draw 1000 samples from \\((X,Y)\\) for \\(a = 1\\), \\(a=0\\), and \\(a=-0.5\\). Plot the scatterplots. How would you interpret parameter \\(a\\)? Plot the marginal distribution of \\(Y\\) for cases \\(a=1\\), \\(a=0\\), and \\(a=-0.5\\). Can you guess which distribution it is? set.seed(1) nsamps <- 1000 x <- rnorm(nsamps) y <- rnorm(nsamps) ggplot(data.frame(x, y), aes(x = x, y = y)) + geom_point() y1 <- rnorm(nsamps, mean = 1 * x) y2 <- rnorm(nsamps, mean = 0 * x) y3 <- rnorm(nsamps, mean = -0.5 * x) df <- tibble(x = c(x,x,x), y = c(y1,y2,y3), a = c(rep(1, nsamps), rep(0, nsamps), rep(-0.5, nsamps))) ggplot(df, aes(x = x, y = y)) + geom_point() + facet_wrap(~a) + coord_equal(ratio=1) # Parameter a controls the scale of linear dependency between X and Y. ggplot(df, aes(x = y)) + geom_density() + facet_wrap(~a) 5.2 Bivariate distribution examples Exercise 5.2 (Discrete bivariate random variable) Let \\(X\\) represent the event that a die rolls an even number and let \\(Y\\) represent the event that a die rolls one, two, or a three. Find the marginal distributions of \\(X\\) and \\(Y\\). Find the PMF of \\((X,Y)\\). Find the CDF of \\((X,Y)\\). Find \\(P(X = 1 | Y = 1)\\). Solution. \\[\\begin{align} P(X = 1) = \\frac{1}{2} \\text{ and } P(X = 0) = \\frac{1}{2} \\\\ P(Y = 1) = \\frac{1}{2} \\text{ and } P(Y = 0) = \\frac{1}{2} \\\\ \\end{align}\\] \\[\\begin{align} P(X = 1, Y = 1) = \\frac{1}{6} \\\\ P(X = 1, Y = 0) = \\frac{2}{6} \\\\ P(X = 0, Y = 1) = \\frac{2}{6} \\\\ P(X = 0, Y = 0) = \\frac{1}{6} \\end{align}\\] \\[\\begin{align} P(X \\leq x, Y \\leq y) = \\begin{cases} \\frac{1}{6} & x = 0, y = 0 \\\\ \\frac{3}{6} & x \\neq y \\\\ 1 & x = 1, y = 1 \\end{cases} \\end{align}\\] \\[\\begin{align} P(X = 1 | Y = 1) = \\frac{1}{3} \\end{align}\\] Exercise 5.3 (Continuous bivariate random variable) Let \\(p(x,y) = 6 (x - y)^2\\) be the PDF of a bivariate random variable \\((X,Y)\\), where both variables range from zero to one. Find CDF. Find marginal distributions. Find conditional distributions. R: Plot a grid of points and colour them by value – this can help us visualize the PDF. R: Implement a random number generator, which will generate numbers from \\((X,Y)\\) and visually check the results. R: Plot the marginal distribution of \\(Y\\) and the conditional distributions of \\(X | Y = y\\), where \\(y \\in \\{0, 0.1, 0.5\\}\\). Solution. \\[\\begin{align} F(x,y) &= \\int_0^{x} \\int_0^{y} 6 (t - s)^2 ds dt\\\\ &= 6 \\int_0^{x} \\int_0^{y} t^2 - 2ts + s^2 ds dt\\\\ &= 6 \\int_0^{x} t^2y - ty^2 + \\frac{y^3}{3} dt \\\\ &= 6 (\\frac{x^3 y}{3} - \\frac{x^2y^2}{2} + \\frac{x y^3}{3}) \\\\ &= 2 x^3 y - 3 t^2y^2 + 2 x y^3 \\end{align}\\] \\[\\begin{align} p(x) &= \\int_0^{1} 6 (x - y)^2 dy\\\\ &= 6 (x^2 - x + \\frac{1}{3}) \\\\ &= 6x^2 - 6x + 2 \\end{align}\\] \\[\\begin{align} p(y) &= \\int_0^{1} 6 (x - y)^2 dx\\\\ &= 6 (y^2 - y + \\frac{1}{3}) \\\\ &= 6y^2 - 6y + 2 \\end{align}\\] \\[\\begin{align} p(x|y) &= \\frac{p(xy)}{p(y)} \\\\ &= \\frac{6 (x - y)^2}{6 (y^2 - y + \\frac{1}{3})} \\\\ &= \\frac{(x - y)^2}{y^2 - y + \\frac{1}{3}} \\end{align}\\] \\[\\begin{align} p(y|x) &= \\frac{p(xy)}{p(x)} \\\\ &= \\frac{6 (x - y)^2}{6 (x^2 - x + \\frac{1}{3})} \\\\ &= \\frac{(x - y)^2}{x^2 - x + \\frac{1}{3}} \\end{align}\\] set.seed(1) # d pxy <- function (x, y) { return ((x - y)^2) } x_axis <- seq(0, 1, length.out = 100) y_axis <- seq(0, 1, length.out = 100) df <- expand.grid(x_axis, y_axis) colnames(df) <- c("x", "y") df <- cbind(df, pdf = pxy(df$x, df$y)) ggplot(data = df, aes(x = x, y = y, color = pdf)) + geom_point() # e samps <- NULL for (i in 1:10000) { xt <- runif(1, 0, 1) yt <- runif(1, 0, 1) pdft <- pxy(xt, yt) acc <- runif(1, 0, 6) if (acc <= pdft) { samps <- rbind(samps, c(xt, yt)) } } colnames(samps) <- c("x", "y") ggplot(data = as.data.frame(samps), aes(x = x, y = y)) + geom_point() # f mar_pdf <- function (x) { return (6 * x^2 - 6 * x + 2) } cond_pdf <- function (x, y) { return (((x - y)^2) / (y^2 - y + 1/3)) } df <- tibble(x = x_axis, mar = mar_pdf(x), y0 = cond_pdf(x, 0), y0.1 = cond_pdf(x, 0.1), y0.5 = cond_pdf(x, 0.5)) %>% gather(dist, value, -x) ggplot(df, aes(x = x, y = value, color = dist)) + geom_line() Exercise 5.4 (Mixed bivariate random variable) Let \\(f(x,y) = \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)y!} x^{y+ \\alpha -1} e^{-x(1 + \\beta)}\\) be the PDF of a bivariate random variable, where \\(x \\in (0, \\infty)\\) and \\(y \\in \\mathbb{N}_0\\). Find the marginal distribution of \\(X\\). Do you recognize this distribution? Find the conditional distribution of \\(Y | X\\). Do you recognize this distribution? Calculate the probability \\(P(Y = 2 | X = 2.5)\\) for \\((X,Y)\\). Find the marginal distribution of \\(Y\\). Do you recognize this distribution? R: Take 1000 random samples from \\((X,Y)\\) with parameters \\(\\beta = 1\\) and \\(\\alpha = 1\\). Plot a scatterplot. Plot a bar plot of the marginal distribution of \\(Y\\), and the theoretical PMF calculated from d) on the range from 0 to 10. Hint: Use the gamma function in R.? Solution. \\[\\begin{align} p(x) &= \\sum_{k = 0}^{\\infty} \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)k!} x^{k + \\alpha -1} e^{-x(1 + \\beta)} & \\\\ &= \\sum_{k = 0}^{\\infty} \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)k!} x^{k} x^{\\alpha -1} e^{-x} e^{-\\beta x} & \\\\ &= \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} x^{\\alpha -1} e^{-\\beta x} \\sum_{k = 0}^{\\infty} \\frac{1}{k!} x^{k} e^{-x} & \\\\ &= \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} x^{\\alpha -1} e^{-\\beta x} & \\text{the last term above sums to one} \\end{align}\\] This is the Gamma PDF. \\[\\begin{align} p(y|x) &= \\frac{p(x,y)}{p(x)} \\\\ &= \\frac{\\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)y!} x^{y+ \\alpha -1} e^{-x(1 + \\beta)}}{\\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} x^{\\alpha -1} e^{-\\beta x}} \\\\ &= \\frac{x^y e^{-x}}{y!}. \\end{align}\\] This is the Poisson PMF. \\[\\begin{align} P(Y = 2 | X = 2.5) = \\frac{2.5^2 e^{-2.5}}{2!} \\approx 0.26. \\end{align}\\] \\[\\begin{align} p(y) &= \\int_{0}^{\\infty} \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)y!} x^{y + \\alpha -1} e^{-x(1 + \\beta)} dx & \\\\ &= \\frac{1}{y!} \\int_{0}^{\\infty} \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} x^{(y + \\alpha) -1} e^{-(1 + \\beta)x} dx & \\\\ &= \\frac{1}{y!} \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} \\int_{0}^{\\infty} \\frac{\\Gamma(y + \\alpha)}{(1 + \\beta)^{y + \\alpha}} \\frac{(1 + \\beta)^{y + \\alpha}}{\\Gamma(y + \\alpha)} x^{(y + \\alpha) -1} e^{-(1 + \\beta)x} dx & \\text{complete to Gamma PDF} \\\\ &= \\frac{1}{y!} \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} \\frac{\\Gamma(y + \\alpha)}{(1 + \\beta)^{y + \\alpha}}. \\end{align}\\] We add the terms in the third equality to get a Gamma PDF inside the integral, which then integrates to one. We do not recognize this distribution. set.seed(1) px <- function (x, alpha, beta) { return((1 / factorial(x)) * (beta^alpha / gamma(alpha)) * (gamma(x + alpha) / (1 + beta)^(x + alpha))) } nsamps <- 1000 rx <- rgamma(nsamps, 1, 1) ryx <- rpois(nsamps, rx) ggplot(data = data.frame(x = rx, y = ryx), aes(x = x, y = y)) + geom_point() ggplot(data = data.frame(x = rx, y = ryx), aes(x = y)) + geom_bar(aes(y = (..count..)/sum(..count..))) + stat_function(fun = px, args = list(alpha = 1, beta = 1), color = "red") Exercise 5.5 Let \\(f(x,y) = cx^2y\\) for \\(x^2 \\leq y \\leq 1\\) and zero otherwise. Find such \\(c\\) that \\(f\\) is a PDF of a bivariate random variable. This exercise is borrowed from Wasserman. Solution. \\[\\begin{align} 1 &= \\int_{-1}^{1} \\int_{x^2}^1 cx^2y dy dx \\\\ &= \\int_{-1}^{1} cx^2 (\\frac{1}{2} - \\frac{x^4}{2}) dx \\\\ &= \\frac{c}{2} \\int_{-1}^{1} x^2 - x^6 dx \\\\ &= \\frac{c}{2} (\\frac{1}{3} + \\frac{1}{3} - \\frac{1}{7} - \\frac{1}{7}) \\\\ &= \\frac{c}{2} \\frac{8}{21} \\\\ &= \\frac{4c}{21} \\end{align}\\] It follows \\(c = \\frac{21}{4}\\). 5.3 Transformations Exercise 5.6 Let \\((X,Y)\\) be uniformly distributed on the unit ball \\(\\{(x,y,z) : x^2 + y^2 + z^2 \\leq 1\\}\\). Let \\(R = \\sqrt{X^2 + Y^2 + Z^2}\\). Find the CDF and PDF of \\(R\\). Solution. \\[\\begin{align} P(R < r) &= P(\\sqrt{X^2 + Y^2 + Z^2} < r) \\\\ &= P(X^2 + Y^2 + Z^2 < r^2) \\\\ &= \\frac{\\frac{4}{3} \\pi r^3}{\\frac{4}{3}\\pi} \\\\ &= r^3. \\end{align}\\] The second line shows us that we are looking at the probability which is represented by a smaller ball with radius \\(r\\). To get the probability, we divide it by the radius of the whole ball. We get the PDF by differentiating the CDF, so \\(p(r) = 3r^2\\). "],["integ.html", "Chapter 6 Integration 6.1 Monte Carlo integration 6.2 Lebesgue integrals", " Chapter 6 Integration This chapter deals with abstract and Monte Carlo integration. The students are expected to acquire the following knowledge: Theoretical How to calculate Lebesgue integrals for non-simple functions. R Monte Carlo integration. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 6.1 Monte Carlo integration Exercise 6.1 Let \\(X\\) and \\(Y\\) be continuous random variables on the unit interval and \\(p(x,y) = 6(x - y)^2\\). Use Monte Carlo integration to estimate the probability \\(P(0.2 \\leq X \\leq 0.5, \\: 0.1 \\leq Y \\leq 0.2)\\). Can you find the exact value? set.seed(1) nsamps <- 1000 V <- (0.5 - 0.2) * (0.2 - 0.1) x1 <- runif(nsamps, 0.2, 0.5) x2 <- runif(nsamps, 0.1, 0.2) f_1 <- function (x, y) { return (6 * (x - y)^2) } mcint <- V * (1 / nsamps) * sum(f_1(x1, x2)) sdm <- sqrt((V^2 / nsamps) * var(f_1(x1, x2))) mcint ## [1] 0.008793445 sdm ## [1] 0.0002197686 F_1 <- function (x, y) { return (2 * x^3 * y - 3 * x^2 * y^2 + 2 * x * y^3) } F_1(0.5, 0.2) - F_1(0.2, 0.2) - F_1(0.5, 0.1) + F_1(0.2, 0.1) ## [1] 0.0087 6.2 Lebesgue integrals Exercise 6.2 (borrowed from Jagannathan) Find the Lebesgue integral of the following functions on (\\(\\mathbb{R}\\), \\(\\mathcal{B}(\\mathbb{R})\\), \\(\\lambda\\)). \\[\\begin{align} f(\\omega) = \\begin{cases} \\omega, & \\text{for } \\omega = 0,1,...,n \\\\ 0, & \\text{elsewhere} \\end{cases} \\end{align}\\] \\[\\begin{align} f(\\omega) = \\begin{cases} 1, & \\text{for } \\omega = \\mathbb{Q}^c \\cap [0,1] \\\\ 0, & \\text{elsewhere} \\end{cases} \\end{align}\\] \\[\\begin{align} f(\\omega) = \\begin{cases} n, & \\text{for } \\omega = \\mathbb{Q}^c \\cap [0,n] \\\\ 0, & \\text{elsewhere} \\end{cases} \\end{align}\\] Solution. \\[\\begin{align} \\int f(\\omega) d\\lambda = \\sum_{\\omega = 0}^n \\omega \\lambda(\\omega) = 0. \\end{align}\\] \\[\\begin{align} \\int f(\\omega) d\\lambda = 1 \\times \\lambda(\\mathbb{Q}^c \\cap [0,1]) = 1. \\end{align}\\] \\[\\begin{align} \\int f(\\omega) d\\lambda = n \\times \\lambda(\\mathbb{Q}^c \\cap [0,n]) = n^2. \\end{align}\\] Exercise 6.3 (borrowed from Jagannathan) Let \\(c \\in \\mathbb{R}\\) be fixed and (\\(\\mathbb{R}\\), \\(\\mathcal{B}(\\mathbb{R})\\)) a measurable space. If for any Borel set \\(A\\), \\(\\delta_c (A) = 1\\) if \\(c \\in A\\), and \\(\\delta_c (A) = 0\\) otherwise, then \\(\\delta_c\\) is called a Dirac measure. Let \\(g\\) be a non-negative, measurable function. Show that \\(\\int g d \\delta_c = g(c)\\). Solution. \\[\\begin{align} \\int g d \\delta_c &= \\sup_{q \\in S(g)} \\int q d \\delta_c \\\\ &= \\sup_{q \\in S(g)} \\sum_{i = 1}^n a_i \\delta_c(A_i) \\\\ &= \\sup_{q \\in S(g)} \\sum_{i = 1}^n a_i \\text{I}_{A_i}(c) \\\\ &= \\sup_{q \\in S(g)} q(c) \\\\ &= g(c) \\end{align}\\] "],["ev.html", "Chapter 7 Expected value 7.1 Discrete random variables 7.2 Continuous random variables 7.3 Sums, functions, conditional expectations 7.4 Covariance", " Chapter 7 Expected value This chapter deals with expected values of random variables. The students are expected to acquire the following knowledge: Theoretical Calculation of the expected value. Calculation of variance and covariance. Cauchy distribution. R Estimation of expected value. Estimation of variance and covariance. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 7.1 Discrete random variables Exercise 7.1 (Bernoulli) Let \\(X \\sim \\text{Bernoulli}(p)\\). Find \\(E[X]\\). Find \\(Var[X]\\). R: Let \\(p = 0.4\\). Check your answers to a) and b) with a simulation. Solution. \\[\\begin{align*} E[X] = \\sum_{k=0}^1 p^k (1-p)^{1-k} k = p. \\end{align*}\\] \\[\\begin{align*} Var[X] = E[X^2] - E[X]^2 = \\sum_{k=0}^1 (p^k (1-p)^{1-k} k^2) - p^2 = p(1-p). \\end{align*}\\] set.seed(1) nsamps <- 1000 x <- rbinom(nsamps, 1, 0.4) mean(x) ## [1] 0.394 var(x) ## [1] 0.239003 0.4 * (1 - 0.4) ## [1] 0.24 Exercise 7.2 (Binomial) Let \\(X \\sim \\text{Binomial}(n,p)\\). Find \\(E[X]\\). Find \\(Var[X]\\). Solution. Let \\(X = \\sum_{i=0}^n X_i\\), where \\(X_i \\sim \\text{Bernoulli}(p)\\). Then, due to linearity of expectation \\[\\begin{align*} E[X] = E[\\sum_{i=0}^n X_i] = \\sum_{i=0}^n E[X_i] = np. \\end{align*}\\] Again let \\(X = \\sum_{i=0}^n X_i\\), where \\(X_i \\sim \\text{Bernoulli}(p)\\). Since the Bernoulli variables \\(X_i\\) are independent we have \\[\\begin{align*} Var[X] = Var[\\sum_{i=0}^n X_i] = \\sum_{i=0}^n Var[X_i] = np(1-p). \\end{align*}\\] Exercise 7.3 (Poisson) Let \\(X \\sim \\text{Poisson}(\\lambda)\\). Find \\(E[X]\\). Find \\(Var[X]\\). Solution. \\[\\begin{align*} E[X] &= \\sum_{k=0}^\\infty \\frac{\\lambda^k e^{-\\lambda}}{k!} k & \\\\ &= \\sum_{k=1}^\\infty \\frac{\\lambda^k e^{-\\lambda}}{k!} k & \\text{term at $k=0$ is 0} \\\\ &= e^{-\\lambda} \\lambda \\sum_{k=1}^\\infty \\frac{\\lambda^{k-1}}{(k - 1)!} & \\\\ &= e^{-\\lambda} \\lambda \\sum_{k=0}^\\infty \\frac{\\lambda^{k}}{k!} & \\\\ &= e^{-\\lambda} \\lambda e^\\lambda & \\\\ &= \\lambda. \\end{align*}\\] \\[\\begin{align*} Var[X] &= E[X^2] - E[X]^2 & \\\\ &= e^{-\\lambda} \\lambda \\sum_{k=1}^\\infty k \\frac{\\lambda^{k-1}}{(k - 1)!} - \\lambda^2 & \\\\ &= e^{-\\lambda} \\lambda \\sum_{k=1}^\\infty (k - 1) + 1) \\frac{\\lambda^{k-1}}{(k - 1)!} - \\lambda^2 & \\\\ &= e^{-\\lambda} \\lambda \\big(\\sum_{k=1}^\\infty (k - 1) \\frac{\\lambda^{k-1}}{(k - 1)!} + \\sum_{k=1}^\\infty \\frac{\\lambda^{k-1}}{(k - 1)!}\\Big) - \\lambda^2 & \\\\ &= e^{-\\lambda} \\lambda \\big(\\lambda\\sum_{k=2}^\\infty \\frac{\\lambda^{k-2}}{(k - 2)!} + e^\\lambda\\Big) - \\lambda^2 & \\\\ &= e^{-\\lambda} \\lambda \\big(\\lambda e^\\lambda + e^\\lambda\\Big) - \\lambda^2 & \\\\ &= \\lambda^2 + \\lambda - \\lambda^2 & \\\\ &= \\lambda. \\end{align*}\\] Exercise 7.4 (Geometric) Let \\(X \\sim \\text{Geometric}(p)\\). Find \\(E[X]\\). Hint: \\(\\frac{d}{dx} x^k = k x^{(k - 1)}\\). Solution. \\[\\begin{align*} E[X] &= \\sum_{k=0}^\\infty (1 - p)^k p k & \\\\ &= p (1 - p) \\sum_{k=0}^\\infty (1 - p)^{k-1} k & \\\\ &= p (1 - p) \\sum_{k=0}^\\infty -\\frac{d}{dp}(1 - p)^k & \\\\ &= p (1 - p) \\Big(-\\frac{d}{dp}\\Big) \\sum_{k=0}^\\infty (1 - p)^k & \\\\ &= p (1 - p) \\Big(-\\frac{d}{dp}\\Big) \\frac{1}{1 - (1 - p)} & \\text{geometric series} \\\\ &= \\frac{1 - p}{p} \\end{align*}\\] 7.2 Continuous random variables Exercise 7.5 (Gamma) Let \\(X \\sim \\text{Gamma}(\\alpha, \\beta)\\). Hint: \\(\\Gamma(z) = \\int_0^\\infty t^{z-1}e^{-t} dt\\) and \\(\\Gamma(z + 1) = z \\Gamma(z)\\). Find \\(E[X]\\). Find \\(Var[X]\\). R: Let \\(\\alpha = 10\\) and \\(\\beta = 2\\). Plot the density of \\(X\\). Add a horizontal line at the expected value that touches the density curve (geom_segment). Shade the area within a standard deviation of the expected value. Solution. \\[\\begin{align*} E[X] &= \\int_0^\\infty \\frac{\\beta^\\alpha}{\\Gamma(\\alpha)}x^\\alpha e^{-\\beta x} dx & \\\\ &= \\frac{\\beta^\\alpha}{\\Gamma(\\alpha)} \\int_0^\\infty x^\\alpha e^{-\\beta x} dx & \\text{ (let $t = \\beta x$)} \\\\ &= \\frac{\\beta^\\alpha}{\\Gamma(\\alpha) }\\int_0^\\infty \\frac{t^\\alpha}{\\beta^\\alpha} e^{-t} \\frac{dt}{\\beta} & \\\\ &= \\frac{1}{\\beta \\Gamma(\\alpha) }\\int_0^\\infty t^\\alpha e^{-t} dt & \\\\ &= \\frac{\\Gamma(\\alpha + 1)}{\\beta \\Gamma(\\alpha)} & \\\\ &= \\frac{\\alpha \\Gamma(\\alpha)}{\\beta \\Gamma(\\alpha)} & \\\\ &= \\frac{\\alpha}{\\beta}. & \\end{align*}\\] \\[\\begin{align*} Var[X] &= E[X^2] - E[X]^2 \\\\ &= \\int_0^\\infty \\frac{\\beta^\\alpha}{\\Gamma(\\alpha)}x^{\\alpha+1} e^{-\\beta x} dx - \\frac{\\alpha^2}{\\beta^2} \\\\ &= \\frac{\\Gamma(\\alpha + 2)}{\\beta^2 \\Gamma(\\alpha)} - \\frac{\\alpha^2}{\\beta^2} \\\\ &= \\frac{(\\alpha + 1)\\alpha\\Gamma(\\alpha)}{\\beta^2 \\Gamma(\\alpha)} - \\frac{\\alpha^2}{\\beta^2} \\\\ &= \\frac{\\alpha^2 + \\alpha}{\\beta^2} - \\frac{\\alpha^2}{\\beta^2} \\\\ &= \\frac{\\alpha}{\\beta^2}. \\end{align*}\\] set.seed(1) x <- seq(0, 25, by = 0.01) y <- dgamma(x, shape = 10, rate = 2) df <- data.frame(x = x, y = y) ggplot(df, aes(x = x, y = y)) + geom_line() + geom_segment(aes(x = 5, y = 0, xend = 5, yend = dgamma(5, shape = 10, rate = 2)), color = "red") + stat_function(fun = dgamma, args = list(shape = 10, rate = 2), xlim = c(5 - sqrt(10/4), 5 + sqrt(10/4)), geom = "area", fill = "gray", alpha = 0.4) Exercise 7.6 (Beta) Let \\(X \\sim \\text{Beta}(\\alpha, \\beta)\\). Find \\(E[X]\\). Hint 1: \\(\\text{B}(x,y) = \\int_0^1 t^{x-1} (1 - t)^{y-1} dt\\). Hint 2: \\(\\text{B}(x + 1, y) = \\text{B}(x,y)\\frac{x}{x + y}\\). Find \\(Var[X]\\). Solution. \\[\\begin{align*} E[X] &= \\int_0^1 \\frac{x^{\\alpha - 1} (1 - x)^{\\beta - 1}}{\\text{B}(\\alpha, \\beta)} x dx \\\\ &= \\frac{1}{\\text{B}(\\alpha, \\beta)}\\int_0^1 x^{\\alpha} (1 - x)^{\\beta - 1} dx \\\\ &= \\frac{1}{\\text{B}(\\alpha, \\beta)} \\text{B}(\\alpha + 1, \\beta) \\\\ &= \\frac{1}{\\text{B}(\\alpha, \\beta)} \\text{B}(\\alpha, \\beta) \\frac{\\alpha}{\\alpha + \\beta} \\\\ &= \\frac{\\alpha}{\\alpha + \\beta}. \\\\ \\end{align*}\\] \\[\\begin{align*} Var[X] &= E[X^2] - E[X]^2 \\\\ &= \\int_0^1 \\frac{x^{\\alpha - 1} (1 - x)^{\\beta - 1}}{\\text{B}(\\alpha, \\beta)} x^2 dx - \\frac{\\alpha^2}{(\\alpha + \\beta)^2} \\\\ &= \\frac{1}{\\text{B}(\\alpha, \\beta)}\\int_0^1 x^{\\alpha + 1} (1 - x)^{\\beta - 1} dx - \\frac{\\alpha^2}{(\\alpha + \\beta)^2} \\\\ &= \\frac{1}{\\text{B}(\\alpha, \\beta)} \\text{B}(\\alpha + 2, \\beta) - \\frac{\\alpha^2}{(\\alpha + \\beta)^2} \\\\ &= \\frac{1}{\\text{B}(\\alpha, \\beta)} \\text{B}(\\alpha + 1, \\beta) \\frac{\\alpha + 1}{\\alpha + \\beta + 1} - \\frac{\\alpha^2}{(\\alpha + \\beta)^2} \\\\ &= \\frac{\\alpha + 1}{\\alpha + \\beta + 1} \\frac{\\alpha}{\\alpha + \\beta} - \\frac{\\alpha^2}{(\\alpha + \\beta)^2}\\\\ &= \\frac{\\alpha \\beta}{(\\alpha + \\beta)^2(\\alpha + \\beta + 1)}. \\end{align*}\\] Exercise 7.7 (Exponential) Let \\(X \\sim \\text{Exp}(\\lambda)\\). Find \\(E[X]\\). Hint: \\(\\Gamma(z + 1) = z\\Gamma(z)\\) and \\(\\Gamma(1) = 1\\). Find \\(Var[X]\\). Solution. \\[\\begin{align*} E[X] &= \\int_0^\\infty \\lambda e^{-\\lambda x} x dx & \\\\ &= \\lambda \\int_0^\\infty x e^{-\\lambda x} dx & \\\\ &= \\lambda \\int_0^\\infty \\frac{t}{\\lambda} e^{-t} \\frac{dt}{\\lambda} & \\text{$t = \\lambda x$}\\\\ &= \\lambda \\lambda^{-2} \\Gamma(2) & \\text{definition of gamma function} \\\\ &= \\lambda^{-1}. \\end{align*}\\] \\[\\begin{align*} Var[X] &= E[X^2] - E[X]^2 & \\\\ &= \\int_0^\\infty \\lambda e^{-\\lambda x} x^2 dx - \\lambda^{-2} & \\\\ &= \\lambda \\int_0^\\infty \\frac{t^2}{\\lambda^2} e^{-t} \\frac{dt}{\\lambda} - \\lambda^{-2} & \\text{$t = \\lambda x$} \\\\ &= \\lambda \\lambda^{-3} \\Gamma(3) - \\lambda^{-2} & \\text{definition of gamma function} & \\\\ &= \\lambda^{-2} 2 \\Gamma(2) - \\lambda^{-2} & \\\\ &= 2 \\lambda^{-2} - \\lambda^{-2} & \\\\ &= \\lambda^{-2}. & \\\\ \\end{align*}\\] Exercise 7.8 (Normal) Let \\(X \\sim \\text{N}(\\mu, \\sigma)\\). Show that \\(E[X] = \\mu\\). Hint: Use the error function \\(\\text{erf}(x) = \\frac{1}{\\sqrt(\\pi)} \\int_{-x}^x e^{-t^2} dt\\). The statistical interpretation of this function is that if \\(Y \\sim \\text{N}(0, 0.5)\\), then the error function describes the probability of \\(Y\\) falling between \\(-x\\) and \\(x\\). Also, \\(\\text{erf}(\\infty) = 1\\). Show that \\(Var[X] = \\sigma^2\\). Hint: Start with the definition of variance. Solution. \\[\\begin{align*} E[X] &= \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt{2\\pi \\sigma^2}} e^{-\\frac{(x - \\mu)^2}{2\\sigma^2}} x dx & \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} \\int_{-\\infty}^\\infty x e^{-\\frac{(x - \\mu)^2}{2\\sigma^2}} dx & \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} \\int_{-\\infty}^\\infty \\Big(t \\sqrt{2\\sigma^2} + \\mu\\Big)e^{-t^2} \\sqrt{2 \\sigma^2} dt & t = \\frac{x - \\mu}{\\sqrt{2}\\sigma} \\\\ &= \\frac{\\sqrt{2\\sigma^2}}{\\sqrt{\\pi}} \\int_{-\\infty}^\\infty t e^{-t^2} dt + \\frac{1}{\\sqrt{\\pi}} \\int_{-\\infty}^\\infty \\mu e^{-t^2} dt & \\\\ \\end{align*}\\] Let us calculate these integrals separately. \\[\\begin{align*} \\int t e^{-t^2} dt &= -\\frac{1}{2}\\int e^{s} ds & s = -t^2 \\\\ &= -\\frac{e^s}{2} + C \\\\ &= -\\frac{e^{-t^2}}{2} + C & \\text{undoing substitution}. \\end{align*}\\] Inserting the integration limits we get \\[\\begin{align*} \\int_{-\\infty}^\\infty t e^{-t^2} dt &= 0, \\end{align*}\\] due to the integrated function being symmetric. Reordering the second integral we get \\[\\begin{align*} \\mu \\frac{1}{\\sqrt{\\pi}} \\int_{-\\infty}^\\infty e^{-t^2} dt &= \\mu \\text{erf}(\\infty) & \\text{definition of error function} \\\\ &= \\mu & \\text{probability of $Y$ falling between $-\\infty$ and $\\infty$}. \\end{align*}\\] Combining all of the above we get \\[\\begin{align*} E[X] &= \\frac{\\sqrt{2\\sigma^2}}{\\sqrt{\\pi}} \\times 0 + \\mu &= \\mu.\\\\ \\end{align*}\\] \\[\\begin{align*} Var[X] &= E[(X - E[X])^2] \\\\ &= E[(X - \\mu)^2] \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} \\int_{-\\infty}^\\infty (x - \\mu)^2 e^{-\\frac{(x - \\mu)^2}{2\\sigma^2}} dx \\\\ &= \\frac{\\sigma^2}{\\sqrt{2\\pi}} \\int_{-\\infty}^\\infty t^2 e^{-\\frac{t^2}{2}} dt \\\\ &= \\frac{\\sigma^2}{\\sqrt{2\\pi}} \\bigg(\\Big(- t e^{-\\frac{t^2}{2}} |_{-\\infty}^\\infty \\Big) + \\int_{-\\infty}^\\infty e^{-\\frac{t^2}{2}} \\bigg) dt & \\text{integration by parts} \\\\ &= \\frac{\\sigma^2}{\\sqrt{2\\pi}} \\sqrt{2 \\pi} \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt(\\pi)}e^{-s^2} \\bigg) & s = \\frac{t}{\\sqrt{2}} \\text{ and evaluating the left expression at the bounds} \\\\ &= \\frac{\\sigma^2}{\\sqrt{2\\pi}} \\sqrt{2 \\pi} \\Big(\\text{erf}(\\infty) & \\text{definition of error function} \\\\ &= \\sigma^2. \\end{align*}\\] 7.3 Sums, functions, conditional expectations Exercise 7.9 (Expectation of transformations) Let \\(X\\) follow a normal distribution with mean \\(\\mu\\) and variance \\(\\sigma^2\\). Find \\(E[2X + 4]\\). Find \\(E[X^2]\\). Find \\(E[\\exp(X)]\\). Hint: Use the error function \\(\\text{erf}(x) = \\frac{1}{\\sqrt(\\pi)} \\int_{-x}^x e^{-t^2} dt\\). Also, \\(\\text{erf}(\\infty) = 1\\). R: Check your results numerically for \\(\\mu = 0.4\\) and \\(\\sigma^2 = 0.25\\) and plot the densities of all four distributions. Solution. \\[\\begin{align} E[2X + 4] &= 2E[X] + 4 & \\text{linearity of expectation} \\\\ &= 2\\mu + 4. \\\\ \\end{align}\\] \\[\\begin{align} E[X^2] &= E[X]^2 + Var[X] & \\text{definition of variance} \\\\ &= \\mu^2 + \\sigma^2. \\end{align}\\] \\[\\begin{align} E[\\exp(X)] &= \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt{2\\pi \\sigma^2}} e^{-\\frac{(x - \\mu)^2}{2\\sigma^2}} e^x dx & \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} \\int_{-\\infty}^\\infty e^{\\frac{2 \\sigma^2 x}{2\\sigma^2} -\\frac{(x - \\mu)^2}{2\\sigma^2}} dx & \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} \\int_{-\\infty}^\\infty e^{-\\frac{x^2 - 2x(\\mu + \\sigma^2) + \\mu^2}{2\\sigma^2}} dx & \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} \\int_{-\\infty}^\\infty e^{-\\frac{(x - (\\mu + \\sigma^2))^2 + \\mu^2 - (\\mu + \\sigma^2)^2}{2\\sigma^2}} dx & \\text{complete the square} \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} e^{\\frac{- \\mu^2 + (\\mu + \\sigma^2)^2}{2\\sigma^2}} \\int_{-\\infty}^\\infty e^{-\\frac{(x - (\\mu + \\sigma^2))^2}{2\\sigma^2}} dx & \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} e^{\\frac{- \\mu^2 + (\\mu + \\sigma^2)^2}{2\\sigma^2}} \\sigma \\sqrt{2 \\pi} \\text{erf}(\\infty) & \\\\ &= e^{\\frac{2\\mu + \\sigma^2}{2}}. \\end{align}\\] set.seed(1) mu <- 0.4 sigma <- 0.5 x <- rnorm(100000, mean = mu, sd = sigma) mean(2*x + 4) ## [1] 4.797756 2 * mu + 4 ## [1] 4.8 mean(x^2) ## [1] 0.4108658 mu^2 + sigma^2 ## [1] 0.41 mean(exp(x)) ## [1] 1.689794 exp((2 * mu + sigma^2) / 2) ## [1] 1.690459 Exercise 7.10 (Sum of independent random variables) Borrowed from Wasserman. Let \\(X_1, X_2,...,X_n\\) be IID random variables with expected value \\(E[X_i] = \\mu\\) and variance \\(Var[X_i] = \\sigma^2\\). Find the expected value and variance of \\(\\bar{X} = \\frac{1}{n} \\sum_{i=1}^n X_i\\). \\(\\bar{X}\\) is called a statistic (a function of the values in a sample). It is itself a random variable and its distribution is called a sampling distribution. R: Take \\(n = 5, 10, 100, 1000\\) samples from the N(\\(2\\), \\(6\\)) distribution 10000 times. Plot the theoretical density and the densities of \\(\\bar{X}\\) statistic for each \\(n\\). Intuitively, are the results in correspondence with your calculations? Check them numerically. Solution. Let us start with the expectation of \\(\\bar{X}\\). \\[\\begin{align} E[\\bar{X}] &= E[\\frac{1}{n} \\sum_{i=1}^n X_i] & \\\\ &= \\frac{1}{n} E[\\sum_{i=1}^n X_i] & \\text{ (multiplication with a scalar)} \\\\ &= \\frac{1}{n} \\sum_{i=1}^n E[X_i] & \\text{ (linearity)} \\\\ &= \\frac{1}{n} n \\mu & \\\\ &= \\mu. \\end{align}\\] Now the variance \\[\\begin{align} Var[\\bar{X}] &= Var[\\frac{1}{n} \\sum_{i=1}^n X_i] & \\\\ &= \\frac{1}{n^2} Var[\\sum_{i=1}^n X_i] & \\text{ (multiplication with a scalar)} \\\\ &= \\frac{1}{n^2} \\sum_{i=1}^n Var[X_i] & \\text{ (independence of samples)} \\\\ &= \\frac{1}{n^2} n \\sigma^2 & \\\\ &= \\frac{1}{n} \\sigma^2. \\end{align}\\] set.seed(1) nsamps <- 10000 mu <- 2 sigma <- sqrt(6) N <- c(5, 10, 100, 500) X <- matrix(data = NA, nrow = nsamps, ncol = length(N)) ind <- 1 for (n in N) { for (i in 1:nsamps) { X[i,ind] <- mean(rnorm(n, mu, sigma)) } ind <- ind + 1 } colnames(X) <- N X <- melt(as.data.frame(X)) ggplot(data = X, aes(x = value, colour = variable)) + geom_density() + stat_function(data = data.frame(x = seq(-2, 6, by = 0.01)), aes(x = x), fun = dnorm, args = list(mean = mu, sd = sigma), color = "black") Exercise 7.11 (Conditional expectation) Let \\(X \\in \\mathbb{R}_0^+\\) and \\(Y \\in \\mathbb{N}_0\\) be random variables with joint distribution \\(p_{XY}(X,Y) = \\frac{1}{y + 1} e^{-\\frac{x}{y + 1}} 0.5^{y + 1}\\). Find \\(E[X | Y = y]\\) by first finding \\(p_Y\\) and then \\(p_{X|Y}\\). Find \\(E[X]\\). R: check your answers to a) and b) by drawing 10000 samples from \\(p_Y\\) and \\(p_{X|Y}\\). Solution. \\[\\begin{align} p(y) &= \\int_0^\\infty \\frac{1}{y + 1} e^{-\\frac{x}{y + 1}} 0.5^{y + 1} dx \\\\ &= \\frac{0.5^{y + 1}}{y + 1} \\int_0^\\infty e^{-\\frac{x}{y + 1}} dx \\\\ &= \\frac{0.5^{y + 1}}{y + 1} (y + 1) \\\\ &= 0.5^{y + 1} \\\\ &= 0.5(1 - 0.5)^y. \\end{align}\\] We recognize this as the geometric distribution. \\[\\begin{align} p(x|y) &= \\frac{p(x,y)}{p(y)} \\\\ &= \\frac{1}{y + 1} e^{-\\frac{x}{y + 1}}. \\end{align}\\] We recognize this as the exponential distribution. \\[\\begin{align} E[X | Y = y] &= \\int_0^\\infty x \\frac{1}{y + 1} e^{-\\frac{x}{y + 1}} dx \\\\ &= y + 1 & \\text{expected value of the exponential distribution} \\end{align}\\] Use the law of iterated expectation. \\[\\begin{align} E[X] &= E[E[X | Y]] \\\\ &= E[Y + 1] \\\\ &= E[Y] + 1 \\\\ &= \\frac{1 - 0.5}{0.5} + 1 \\\\ &= 2. \\end{align}\\] set.seed(1) y <- rgeom(100000, 0.5) x <- rexp(100000, rate = 1 / (y + 1)) x2 <- x[y == 3] mean(x2) ## [1] 4.048501 3 + 1 ## [1] 4 mean(x) ## [1] 2.007639 (1 - 0.5) / 0.5 + 1 ## [1] 2 Exercise 7.12 (Cauchy distribution) Let \\(p(x | x_0, \\gamma) = \\frac{1}{\\pi \\gamma \\Big(1 + \\big(\\frac{x - x_0}{\\gamma}\\big)^2\\Big)}\\). A random variable with this PDF follows a Cauchy distribution. This distribution is symmetric and has wider tails than the normal distribution. R: Draw \\(n = 1,...,1000\\) samples from a standard normal and \\(\\text{Cauchy}(0, 1)\\). For each \\(n\\) plot the mean and the median of the sample using facets. Interpret the results. To get a mathematical explanation of the results in a), evaluate the integral \\(\\int_0^\\infty \\frac{x}{1 + x^2} dx\\) and consider that \\(E[X] = \\int_{-\\infty}^\\infty \\frac{x}{1 + x^2}dx\\). set.seed(1) n <- 1000 means_n <- vector(mode = "numeric", length = n) means_c <- vector(mode = "numeric", length = n) medians_n <- vector(mode = "numeric", length = n) medians_c <- vector(mode = "numeric", length = n) for (i in 1:n) { tmp_n <- rnorm(i) tmp_c <- rcauchy(i) means_n[i] <- mean(tmp_n) means_c[i] <- mean(tmp_c) medians_n[i] <- median(tmp_n) medians_c[i] <- median(tmp_c) } df <- data.frame("distribution" = c(rep("normal", 2 * n), rep("Cauchy", 2 * n)), "type" = c(rep("mean", n), rep("median", n), rep("mean", n), rep("median", n)), "value" = c(means_n, medians_n, means_c, medians_c), "n" = rep(1:n, times = 4)) ggplot(df, aes(x = n, y = value)) + geom_line(alpha = 0.5) + facet_wrap(~ type + distribution , scales = "free") Solution. \\[\\begin{align} \\int_0^\\infty \\frac{x}{1 + x^2} dx &= \\frac{1}{2} \\int_1^\\infty \\frac{1}{u} du & u = 1 + x^2 \\\\ &= \\frac{1}{2} \\ln(x) |_0^\\infty. \\end{align}\\] This integral is not finite. The same holds for the negative part. Therefore, the expectation is undefined, as \\(E[|X|] = \\infty\\). Why can we not just claim that \\(f(x) = x / (1 + x^2)\\) is odd and \\(\\int_{-\\infty}^\\infty f(x) = 0\\)? By definition of the Lebesgue integral \\(\\int_{-\\infty}^{\\infty} f= \\int_{-\\infty}^{\\infty} f_+-\\int_{-\\infty}^{\\infty} f_-\\). At least one of the two integrals needs to be finite for \\(\\int_{-\\infty}^{\\infty} f\\) to be well-defined. However \\(\\int_{-\\infty}^{\\infty} f_+=\\int_0^{\\infty} x/(1+x^2)\\) and \\(\\int_{-\\infty}^{\\infty} f_-=\\int_{-\\infty}^{0} |x|/(1+x^2)\\). We have just shown that both of these integrals are infinite, which implies that their sum is also infinite. 7.4 Covariance Exercise 7.13 Below is a table of values for random variables \\(X\\) and \\(Y\\). X Y 2.1 8 -0.5 11 1 10 -2 12 4 9 Find sample covariance of \\(X\\) and \\(Y\\). Find sample variances of \\(X\\) and \\(Y\\). Find sample correlation of \\(X\\) and \\(Y\\). Find sample variance of \\(Z = 2X - 3Y\\). Solution. \\(\\bar{X} = 0.92\\) and \\(\\bar{Y} = 10\\). \\[\\begin{align} s(X, Y) &= \\frac{1}{n - 1} \\sum_{i=1}^5 (X_i - 0.92) (Y_i - 10) \\\\ &= -3.175. \\end{align}\\] \\[\\begin{align} s(X) &= \\frac{\\sum_{i=1}^5(X_i - 0.92)^2}{5 - 1} \\\\ &= 5.357. \\end{align}\\] \\[\\begin{align} s(Y) &= \\frac{\\sum_{i=1}^5(Y_i - 10)^2}{5 - 1} \\\\ &= 2.5. \\end{align}\\] \\[\\begin{align} r(X,Y) &= \\frac{Cov(X,Y)}{\\sqrt{Var[X]Var[Y]}} \\\\ &= \\frac{-3.175}{\\sqrt{5.357 \\times 2.5}} \\\\ &= -8.68. \\end{align}\\] \\[\\begin{align} s(Z) &= 2^2 s(X) + 3^2 s(Y) + 2 \\times 2 \\times 3 s(X, Y) \\\\ &= 4 \\times 5.357 + 9 \\times 2.5 + 12 \\times 3.175 \\\\ &= 82.028. \\end{align}\\] Exercise 7.14 Let \\(X \\sim \\text{Uniform}(0,1)\\) and \\(Y | X = x \\sim \\text{Uniform(0,x)}\\). Find the covariance of \\(X\\) and \\(Y\\). Find the correlation of \\(X\\) and \\(Y\\). R: check your answers to a) and b) with simulation. Plot \\(X\\) against \\(Y\\) on a scatterplot. Solution. The joint PDF is \\(p(x,y) = p(x)p(y|x) = \\frac{1}{x}\\). \\[\\begin{align} Cov(X,Y) &= E[XY] - E[X]E[Y] \\\\ \\end{align}\\] Let us first evaluate the first term: \\[\\begin{align} E[XY] &= \\int_0^1 \\int_0^x x y \\frac{1}{x} dy dx \\\\ &= \\int_0^1 \\int_0^x y dy dx \\\\ &= \\int_0^1 \\frac{x^2}{2} dx \\\\ &= \\frac{1}{6}. \\end{align}\\] Now let us find \\(E[Y]\\), \\(E[X]\\) is trivial. \\[\\begin{align} E[Y] = E[E[Y | X]] = E[\\frac{X}{2}] = \\frac{1}{2} \\int_0^1 x dx = \\frac{1}{4}. \\end{align}\\] Combining all: \\[\\begin{align} Cov(X,Y) &= \\frac{1}{6} - \\frac{1}{2} \\frac{1}{4} = \\frac{1}{24}. \\end{align}\\] \\[\\begin{align} \\rho(X,Y) &= \\frac{Cov(X,Y)}{\\sqrt{Var[X]Var[Y]}} \\\\ \\end{align}\\] Let us calculate \\(Var[X]\\). \\[\\begin{align} Var[X] &= E[X^2] - \\frac{1}{4} \\\\ &= \\int_0^1 x^2 - \\frac{1}{4} \\\\ &= \\frac{1}{3} - \\frac{1}{4} \\\\ &= \\frac{1}{12}. \\end{align}\\] Let us calculate \\(E[E[Y^2|X]]\\). \\[\\begin{align} E[E[Y^2|X]] &= E[\\frac{x^2}{3}] \\\\ &= \\frac{1}{9}. \\end{align}\\] Then \\(Var[Y] = \\frac{1}{9} - \\frac{1}{16} = \\frac{7}{144}\\). Combining all \\[\\begin{align} \\rho(X,Y) &= \\frac{\\frac{1}{24}}{\\sqrt{\\frac{1}{12}\\frac{7}{144}}} \\\\ &= 0.65. \\end{align}\\] set.seed(1) nsamps <- 10000 x <- runif(nsamps) y <- runif(nsamps, 0, x) cov(x, y) ## [1] 0.04274061 1/24 ## [1] 0.04166667 cor(x, y) ## [1] 0.6629567 (1 / 24) / (sqrt(7 / (12 * 144))) ## [1] 0.6546537 ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) + geom_point(alpha = 0.2) + geom_smooth(method = "lm") "],["mrv.html", "Chapter 8 Multivariate random variables 8.1 Multinomial random variables 8.2 Multivariate normal random variables 8.3 Transformations", " Chapter 8 Multivariate random variables This chapter deals with multivariate random variables. The students are expected to acquire the following knowledge: Theoretical Multinomial distribution. Multivariate normal distribution. Cholesky decomposition. Eigendecomposition. R Sampling from the multinomial distribution. Sampling from the multivariate normal distribution. Matrix decompositions. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 8.1 Multinomial random variables Exercise 8.1 Let \\(X_i\\), \\(i = 1,...,k\\) represent \\(k\\) events, and \\(p_i\\) the probabilities of these events happening in a trial. Let \\(n\\) be the number of trials, and \\(X\\) a multivariate random variable, the collection of \\(X_i\\). Then \\(p(x) = \\frac{n!}{x_1!x_2!...x_k!} p_1^{x_1} p_2^{x_2}...p_k^{x_k}\\) is the PMF of a multinomial distribution, where \\(n = \\sum_{i = 1}^k x_i\\). Show that the marginal distribution of \\(X_i\\) is a binomial distribution. Take 1000 samples from the multinomial distribution with \\(n=4\\) and probabilities \\(p = (0.2, 0.2, 0.5, 0.1)\\). Then take 1000 samples from four binomial distributions with the same parameters. Inspect the results visually. Solution. We will approach this proof from the probabilistic point of view. W.L.O.G. let \\(x_1\\) be the marginal distribution we are interested in. The term \\(p^{x_1}\\) denotes the probability that event 1 happened \\(x_1\\) times. For this event not to happen, one of the other events needs to happen. So for each of the remaining trials, the probability of another event is \\(\\sum_{i=2}^k p_i = 1 - p_1\\), and there were \\(n - x_1\\) such trials. What is left to do is to calculate the number of permutations of event 1 happening and event 1 not happening. We choose \\(x_1\\) trials, from \\(n\\) trials. Therefore \\(p(x_1) = \\binom{n}{x_1} p_1^{x_1} (1 - p_1)^{n - x_1}\\), which is the binomial PMF. Interested students are encouraged to prove this mathematically. set.seed(1) nsamps <- 1000 samps_mult <- rmultinom(nsamps, 4, prob = c(0.2, 0.2, 0.5, 0.1)) samps_mult <- as_tibble(t(samps_mult)) %>% gather() samps <- tibble( V1 = rbinom(nsamps, 4, 0.2), V2 = rbinom(nsamps, 4, 0.2), V3 = rbinom(nsamps, 4, 0.5), V4 = rbinom(nsamps, 4, 0.1) ) %>% gather() %>% bind_rows(samps_mult) %>% bind_cols("dist" = c(rep("binomial", 4*nsamps), rep("multinomial", 4*nsamps))) ggplot(samps, aes(x = value, fill = dist)) + geom_bar(position = "dodge") + facet_wrap(~ key) Exercise 8.2 (Multinomial expected value) Find the expected value, variance and covariance of the multinomial distribution. Hint: First find the expected value for \\(n = 1\\) and then use the fact that the trials are independent. Solution. Let us first calculate the expected value of \\(X_1\\), when \\(n = 1\\). \\[\\begin{align} E[X_1] &= \\sum_{n_1 = 0}^1 \\sum_{n_2 = 0}^1 ... \\sum_{n_k = 0}^1 \\frac{1}{n_1!n_2!...n_k!}p_1^{n_1}p_2^{n_2}...p_k^{n_k}n_1 \\\\ &= \\sum_{n_1 = 0}^1 \\frac{p_1^{n_1} n_1}{n_1!} \\sum_{n_2 = 0}^1 ... \\sum_{n_k = 0}^1 \\frac{1}{n_2!...n_k!}p_2^{n_2}...p_k^{n_k} \\end{align}\\] When \\(n_1 = 0\\) then the whole terms is zero, so we do not need to evaluate other sums. When \\(n_1 = 1\\), all other \\(n_i\\) must be zero, as we have \\(1 = \\sum_{i=1}^k n_i\\). Therefore the other sums equal \\(1\\). So \\(E[X_1] = p_1\\) and \\(E[X_i] = p_i\\) for \\(i = 1,...,k\\). Now let \\(Y_j\\), \\(j = 1,...,n\\), have a multinomial distribution with \\(n = 1\\), and let \\(X\\) have a multinomial distribution with an arbitrary \\(n\\). Then we can write \\(X = \\sum_{j=1}^n Y_j\\). And due to independence \\[\\begin{align} E[X] &= E[\\sum_{j=1}^n Y_j] \\\\ &= \\sum_{j=1}^n E[Y_j] \\\\ &= np. \\end{align}\\] For the variance, we need \\(E[X^2]\\). Let us follow the same procedure as above and first calculate \\(E[X_i]\\) for \\(n = 1\\). The only thing that changes is that the term \\(n_i\\) becomes \\(n_i^2\\). Since we only have \\(0\\) and \\(1\\) this does not change the outcome. So \\[\\begin{align} Var[X_i] &= E[X_i^2] - E[X_i]^2\\\\ &= p_i(1 - p_i). \\end{align}\\] Analogous to above for arbitrary \\(n\\) \\[\\begin{align} Var[X] &= E[X^2] - E[X]^2 \\\\ &= \\sum_{j=1}^n E[Y_j^2] - \\sum_{j=1}^n E[Y_j]^2 \\\\ &= \\sum_{j=1}^n E[Y_j^2] - E[Y_j]^2 \\\\ &= \\sum_{j=1}^n p(1-p) \\\\ &= np(1-p). \\end{align}\\] To calculate the covariance, we need \\(E[X_i X_j]\\). Again, let us start with \\(n = 1\\). Without loss of generality, let us assume \\(i = 1\\) and \\(j = 2\\). \\[\\begin{align} E[X_1 X_2] = \\sum_{n_1 = 0}^1 \\sum_{n_2 = 0}^1 \\frac{p_1^{n_1} n_1}{n_1!} \\frac{p_2^{n_2} n_2}{n_2!} \\sum_{n_3 = 0}^1 ... \\sum_{n_k = 0}^1 \\frac{1}{n_3!...n_k!}p_3^{n_3}...p_k^{n_k}. \\end{align}\\] In the above expression, at each iteration we multiply with \\(n_1\\) and \\(n_2\\). Since \\(n = 1\\), one of these always has to be zero. Therefore \\(E[X_1 X_2] = 0\\) and \\[\\begin{align} Cov(X_i, X_j) &= E[X_i X_j] - E[X_i]E[X_j] \\\\ &= - p_i p_j. \\end{align}\\] For arbitrary \\(n\\), let \\(X = \\sum_{t = 1}^n Y_t\\) be the sum of independent multinomial random variables \\(Y_t = [X_{1t}, X_{2t},...,X_{kt}]^T\\) with \\(n=1\\). Then \\(X_1 = \\sum_{t = 1}^n X_{1t}\\) and \\(X_2 = \\sum_{l = 1}^n X_{2l}\\). \\[\\begin{align} Cov(X_1, X_2) &= E[X_1 X_2] - E[X_1] E[X_2] \\\\ &= E[\\sum_{t = 1}^n X_{1t} \\sum_{l = 1}^n X_{2l}] - n^2 p_1 p_2 \\\\ &= \\sum_{t = 1}^n \\sum_{l = 1}^n E[X_{1t} X_{2l}] - n^2 p_1 p_2. \\end{align}\\] For \\(X_{1t}\\) and \\(X_{2l}\\) the expected value is zero when \\(t = l\\). When \\(t \\neq l\\) then they are independent, so the expected value is the product \\(p_1 p_2\\). There are \\(n^2\\) total terms, and for \\(n\\) of them \\(t = l\\) holds. So \\(E[X_1 X_2] = (n^2 - n) p_1 p_2\\). Inserting into the above \\[\\begin{align} Cov(X_1, X_2) &= (n^2 - n) p_1 p_2 - n^2 p_1 p_2 \\\\ &= - n p_1 p_2. \\end{align}\\] 8.2 Multivariate normal random variables Exercise 8.3 (Cholesky decomposition) Let \\(X\\) be a random vector of length \\(k\\) with \\(X_i \\sim \\text{N}(0, 1)\\) and \\(LL^*\\) the Cholesky decomposition of a Hermitian positive-definite matrix \\(A\\). Let \\(\\mu\\) be a vector of length \\(k\\). Find the distribution of the random vector \\(Y = \\mu + L X\\). Find the Cholesky decomposition of \\(A = \\begin{bmatrix} 2 & 1.2 \\\\ 1.2 & 1 \\end{bmatrix}\\). R: Use the results from a) and b) to sample from the MVN distribution \\(\\text{N}(\\mu, A)\\), where \\(\\mu = [1.5, -1]^T\\). Plot a scatterplot and compare it to direct samples from the multivariate normal distribution (rmvnorm). Solution. \\(X\\) has an independent normal distribution of dimension \\(k\\). Then \\[\\begin{align} Y = \\mu + L X &\\sim \\text{N}(\\mu, LL^T) \\\\ &\\sim \\text{N}(\\mu, A). \\end{align}\\] Solve \\[\\begin{align} \\begin{bmatrix} a & 0 \\\\ b & c \\end{bmatrix} \\begin{bmatrix} a & b \\\\ 0 & c \\end{bmatrix} = \\begin{bmatrix} 2 & 1.2 \\\\ 1.2 & 1 \\end{bmatrix} \\end{align}\\] # a set.seed(1) nsamps <- 1000 X <- matrix(data = rnorm(nsamps * 2), ncol = 2) mu <- c(1.5, -1) L <- matrix(data = c(sqrt(2), 0, 1.2 / sqrt(2), sqrt(1 - 1.2^2/2)), ncol = 2, byrow = TRUE) Y <- t(mu + L %*% t(X)) plot_df <- data.frame(rbind(X, Y), c(rep("X", nsamps), rep("Y", nsamps))) colnames(plot_df) <- c("D1", "D2", "var") ggplot(data = plot_df, aes(x = D1, y = D2, colour = as.factor(var))) + geom_point() Exercise 8.4 (Eigendecomposition) R: Let \\(\\Sigma = U \\Lambda U^T\\) be the eigendecomposition of covariance matrix \\(\\Sigma\\). Follow the procedure below, to sample from a multivariate normal with \\(\\mu = [-2, 1]^T\\) and \\(\\Sigma = \\begin{bmatrix} 0.3, -0.5 \\\\ -0.5, 1.6 \\end{bmatrix}\\): Sample from two independent standardized normal distributions to get \\(X\\). Find the eigendecomposition of \\(X\\) (eigen). Multiply \\(X\\) by \\(\\Lambda^{\\frac{1}{2}}\\) to get \\(X2\\). Consider how the eigendecomposition for \\(X2\\) changes compared to \\(X\\). Multiply \\(X2\\) by \\(U\\) to get \\(X3\\). Consider how the eigendecomposition for \\(X3\\) changes compared to \\(X2\\). Add \\(\\mu\\) to \\(X3\\). Consider how the eigendecomposition for \\(X4\\) changes compared to \\(X3\\). Plot the data and the eigenvectors (scaled with \\(\\Lambda^{\\frac{1}{2}}\\)) at each step. Hint: Use geom_segment for the eigenvectors. # a set.seed(1) sigma <- matrix(data = c(0.3, -0.5, -0.5, 1.6), nrow = 2, byrow = TRUE) ed <- eigen(sigma) e_val <- ed$values e_vec <- ed$vectors # b set.seed(1) nsamps <- 1000 X <- matrix(data = rnorm(nsamps * 2), ncol = 2) vec1 <- matrix(c(1,0,0,1), nrow = 2) X2 <- t(sqrt(diag(e_val)) %*% t(X)) vec2 <- sqrt(diag(e_val)) %*% vec1 X3 <- t(e_vec %*% t(X2)) vec3 <- e_vec %*% vec2 X4 <- t(c(-2, 1) + t(X3)) vec4 <- c(-2, 1) + vec3 vec_mat <- data.frame(matrix(c(0,0,0,0,0,0,0,0,0,0,0,0,-2,1,-2,1), ncol = 2, byrow = TRUE), t(cbind(vec1, vec2, vec3, vec4)), c(1,1,2,2,3,3,4,4)) df <- data.frame(rbind(X, X2, X3, X4), c(rep(1, nsamps), rep(2, nsamps), rep(3, nsamps), rep(4, nsamps))) colnames(df) <- c("D1", "D2", "wh") colnames(vec_mat) <- c("D1", "D2", "E1", "E2", "wh") ggplot(data = df, aes(x = D1, y = D2)) + geom_point() + geom_segment(data = vec_mat, aes(xend = E1, yend = E2), color = "red") + facet_wrap(~ wh) + coord_fixed() Exercise 8.5 (Marginal and conditional distributions) Let \\(X \\sim \\text{N}(\\mu, \\Sigma)\\), where \\(\\mu = [2, 0, -1]^T\\) and \\(\\Sigma = \\begin{bmatrix} 1 & -0.2 & 0.5 \\\\ -0.2 & 1.4 & -1.2 \\\\ 0.5 & -1.2 & 2 \\\\ \\end{bmatrix}\\). Let \\(A\\) represent the first two random variables and \\(B\\) the third random variable. R: For the calculation in the following points, you can use R. Find the marginal distribution of \\(B\\). Find the conditional distribution of \\(B | A = [a_1, a_2]^T\\). Find the marginal distribution of \\(A\\). Find the conditional distribution of \\(A | B = b\\). R: Visually compare the distributions of a) and b), and c) and d) at three different conditional values. mu <- c(2, 0, -1) Sigma <- matrix(c(1, -0.2, 0.5, -0.2, 1.4, -1.2, 0.5, -1.2, 2), nrow = 3, byrow = TRUE) mu_A <- c(2, 0) mu_B <- -1 Sigma_A <- Sigma[1:2, 1:2] Sigma_B <- Sigma[3, 3] Sigma_AB <- Sigma[1:2, 3] # b tmp_b <- t(Sigma_AB) %*% solve(Sigma_A) mu_b <- mu_B - tmp_b %*% mu_A Sigma_b <- Sigma_B - t(Sigma_AB) %*% solve(Sigma_A) %*% Sigma_AB mu_b ## [,1] ## [1,] -1.676471 tmp_b ## [,1] [,2] ## [1,] 0.3382353 -0.8088235 Sigma_b ## [,1] ## [1,] 0.8602941 # d tmp_a <- Sigma_AB * (1 / Sigma_B) mu_a <- mu_A - tmp_a * mu_B Sigma_d <- Sigma_A - (Sigma_AB * (1 / Sigma_B)) %*% t(Sigma_AB) mu_a ## [1] 2.25 -0.60 tmp_a ## [1] 0.25 -0.60 Sigma_d ## [,1] [,2] ## [1,] 0.875 0.10 ## [2,] 0.100 0.68 Solution. \\(B \\sim \\text{N}(-1, 2)\\). \\(B | A = a \\sim \\text{N}(-1.68 + [0.34, -0.81] a, 0.86)\\). \\(\\mu_A = [2, 0]^T\\) and \\(\\Sigma_A = \\begin{bmatrix} 1 & -0.2 & \\\\ -0.2 & 1.4 \\\\ \\end{bmatrix}\\). \\[\\begin{align} A | B = b &\\sim \\text{N}(\\mu_t, \\Sigma_t), \\\\ \\mu_t &= [2.25, -0.6]^T + [0.25, -0.6]^T b, \\\\ \\Sigma_t &= \\begin{bmatrix} 0.875 & 0.1 \\\\ 0.1 & 0.68 \\\\ \\end{bmatrix} \\end{align}\\] library(mvtnorm) set.seed(1) nsamps <- 1000 # a and b samps <- as.data.frame(matrix(data = NA, nrow = 4 * nsamps, ncol = 2)) samps[1:nsamps,1] <- rnorm(nsamps, mu_B, Sigma_B) samps[1:nsamps,2] <- "marginal" for (i in 1:3) { a <- rmvnorm(1, mu_A, Sigma_A) samps[(i*nsamps + 1):((i + 1) * nsamps), 1] <- rnorm(nsamps, mu_b + tmp_b %*% t(a), Sigma_b) samps[(i*nsamps + 1):((i + 1) * nsamps), 2] <- paste0(# "cond", round(a, digits = 2), collapse = "-") } colnames(samps) <- c("x", "dist") ggplot(samps, aes(x = x)) + geom_density() + facet_wrap(~ dist) # c and d samps <- as.data.frame(matrix(data = NA, nrow = 4 * nsamps, ncol = 3)) samps[1:nsamps,1:2] <- rmvnorm(nsamps, mu_A, Sigma_A) samps[1:nsamps,3] <- "marginal" for (i in 1:3) { b <- rnorm(1, mu_B, Sigma_B) samps[(i*nsamps + 1):((i + 1) * nsamps), 1:2] <- rmvnorm(nsamps, mu_a + tmp_a * b, Sigma_d) samps[(i*nsamps + 1):((i + 1) * nsamps), 3] <- b } colnames(samps) <- c("x", "y", "dist") ggplot(samps, aes(x = x, y = y)) + geom_point() + geom_smooth(method = "lm") + facet_wrap(~ dist) 8.3 Transformations Exercise 8.6 Let \\((U,V)\\) be a random variable with PDF \\(p(u,v) = \\frac{1}{4 \\sqrt{u}}\\), \\(U \\in [0,4]\\) and \\(V \\in [\\sqrt{U}, \\sqrt{U} + 1]\\). Let \\(X = \\sqrt{U}\\) and \\(Y = V - \\sqrt{U}\\). Find PDF of \\((X,Y)\\). What can you tell about distributions of \\(X\\) and \\(Y\\)? This exercise shows how we can simplify a probabilistic problem with a clever use of transformations. R: Take 1000 samples from \\((X,Y)\\) and transform them with inverses of the above functions to get samples from \\((U,V)\\). Plot both sets of samples. Solution. First we need to find the inverse functions. Since \\(x = \\sqrt{u}\\) it follows that \\(u = x^2\\), and that \\(x \\in [0,2]\\). Similarly \\(v = y + x\\) and \\(y \\in [0,1]\\). Let us first find the Jacobian. \\[\\renewcommand\\arraystretch{1.6} J(x,y) = \\begin{bmatrix} \\frac{\\partial u}{\\partial x} & \\frac{\\partial v}{\\partial x} \\\\%[1ex] % <-- 1ex more space between rows of matrix \\frac{\\partial u}{\\partial y} & \\frac{\\partial v}{\\partial y} \\end{bmatrix} = \\begin{bmatrix} 2x & 1 \\\\%[1ex] % <-- 1ex more space between rows of matrix 0 & 1 \\end{bmatrix}, \\] and the determinant is \\(|J(x,y)| = 2x\\). Putting everything together, we get \\[\\begin{align} p_{X,Y}(x,y) = p_{U,V}(x^2, y + x) |J(x,y)| = \\frac{1}{4 \\sqrt{x^2}} 2x = \\frac{1}{2}. \\end{align}\\] This reminds us of the Uniform distribution. Indeed we can see that \\(p_X(x) = \\frac{1}{2}\\) and \\(p_Y(y) = 1\\). So instead of dealing with an awkward PDF of \\((U,V)\\) and the corresponding dynamic bounds, we are now looking at two independent Uniform random variables. In practice, this could make modeling much easier. set.seed(1) nsamps <- 2000 x <- runif(nsamps, min = 0, max = 2) y <- runif(nsamps) orig <- tibble(x = x, y = y, vrs = "original") u <- x^2 v <- y + x transf <- tibble(x = u, y = v, vrs = "transformed") df <- bind_rows(orig, transf) ggplot(df, aes(x = x, y = y, color = vrs)) + geom_point(alpha = 0.3) Exercise 8.7 R: Write a function that will calculate the probability density of an arbitraty multivariate normal distribution, based on independent standardized normal PDFs. Compare with dmvnorm from the mvtnorm package. library(mvtnorm) set.seed(1) mvn_dens <- function (y, mu, Sigma) { L <- chol(Sigma) L_inv <- solve(t(L)) g_inv <- L_inv %*% t(y - mu) J <- L_inv J_det <- det(J) return(prod(dnorm(g_inv)) * J_det) } mu_v <- c(-2, 0, 1) cov_m <- matrix(c(1, -0.2, 0.5, -0.2, 2, 0.3, 0.5, 0.3, 1.6), ncol = 3, byrow = TRUE) n_comp <- 20 for (i in 1:n_comp) { x <- rmvnorm(1, mean = mu_v, sigma = cov_m) print(paste0("My function: ", mvn_dens(x, mu_v, cov_m), ", dmvnorm: ", dmvnorm(x, mu_v, cov_m))) } ## [1] "My function: 0.0229514237156383, dmvnorm: 0.0229514237156383" ## [1] "My function: 0.00763138915406231, dmvnorm: 0.00763138915406231" ## [1] "My function: 0.0230688881105741, dmvnorm: 0.0230688881105741" ## [1] "My function: 0.0113616213114731, dmvnorm: 0.0113616213114731" ## [1] "My function: 0.00151808500121907, dmvnorm: 0.00151808500121907" ## [1] "My function: 0.0257658045974509, dmvnorm: 0.0257658045974509" ## [1] "My function: 0.0157963825730805, dmvnorm: 0.0157963825730805" ## [1] "My function: 0.00408856287529248, dmvnorm: 0.00408856287529248" ## [1] "My function: 0.0327793540101256, dmvnorm: 0.0327793540101256" ## [1] "My function: 0.0111606542967978, dmvnorm: 0.0111606542967978" ## [1] "My function: 0.0147636757585684, dmvnorm: 0.0147636757585684" ## [1] "My function: 0.0142948300412207, dmvnorm: 0.0142948300412207" ## [1] "My function: 0.0203093820657542, dmvnorm: 0.0203093820657542" ## [1] "My function: 0.0287533273357481, dmvnorm: 0.0287533273357481" ## [1] "My function: 0.0213402305128623, dmvnorm: 0.0213402305128623" ## [1] "My function: 0.0218356957993885, dmvnorm: 0.0218356957993885" ## [1] "My function: 0.0250750113961771, dmvnorm: 0.0250750113961771" ## [1] "My function: 0.0166498666348048, dmvnorm: 0.0166498666348048" ## [1] "My function: 0.00189725106874659, dmvnorm: 0.00189725106874659" ## [1] "My function: 0.0196697814975113, dmvnorm: 0.0196697814975113" "],["ard.html", "Chapter 9 Alternative representation of distributions 9.1 Probability generating functions (PGFs) 9.2 Moment generating functions (MGFs)", " Chapter 9 Alternative representation of distributions This chapter deals with alternative representation of distributions. The students are expected to acquire the following knowledge: Theoretical Probability generating functions. Moment generating functions. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 9.1 Probability generating functions (PGFs) Exercise 9.1 Show that the sum of independent Poisson random variables is itself a Poisson random variable. R: Let \\(X\\) be a sum of three Poisson distributions with \\(\\lambda_i \\in \\{2, 5.2, 10\\}\\). Take 1000 samples and plot the three distributions and the sum. Then take 1000 samples from the theoretical distribution of \\(X\\) and compare them to the sum. Solution. Let \\(X_i \\sim \\text{Poisson}(\\lambda_i)\\) for \\(i = 1,...,n\\), and let \\(X = \\sum_{i=1}^n X_i\\). \\[\\begin{align} \\alpha_X(t) &= \\prod_{i=1}^n \\alpha_{X_i}(t) \\\\ &= \\prod_{i=1}^n \\bigg( \\sum_{j=0}^\\infty t^j \\frac{\\lambda_i^j e^{-\\lambda_i}}{j!} \\bigg) \\\\ &= \\prod_{i=1}^n \\bigg( e^{-\\lambda_i} \\sum_{j=0}^\\infty \\frac{(t\\lambda_i)^j }{j!} \\bigg) \\\\ &= \\prod_{i=1}^n \\bigg( e^{-\\lambda_i} e^{t \\lambda_i} \\bigg) & \\text{power series} \\\\ &= \\prod_{i=1}^n \\bigg( e^{\\lambda_i(t - 1)} \\bigg) \\\\ &= e^{\\sum_{i=1}^n \\lambda_i(t - 1)} \\\\ &= e^{t \\sum_{i=1}^n \\lambda_i - \\sum_{i=1}^n \\lambda_i} \\\\ &= e^{-\\sum_{i=1}^n \\lambda_i} \\sum_{j=0}^\\infty \\frac{(t \\sum_{i=1}^n \\lambda_i)^j}{j!}\\\\ &= \\sum_{j=0}^\\infty \\frac{e^{-\\sum_{i=1}^n \\lambda_i} (t \\sum_{i=1}^n \\lambda_i)^j}{j!}\\\\ \\end{align}\\] The last term is the PGF of a Poisson random variable with parameter \\(\\sum_{i=1}^n \\lambda_i\\). Because the PGF is unique, \\(X\\) is a Poisson random variable. set.seed(1) library(tidyr) nsamps <- 1000 samps <- matrix(data = NA, nrow = nsamps, ncol = 4) samps[ ,1] <- rpois(nsamps, 2) samps[ ,2] <- rpois(nsamps, 5.2) samps[ ,3] <- rpois(nsamps, 10) samps[ ,4] <- samps[ ,1] + samps[ ,2] + samps[ ,3] colnames(samps) <- c(2, 2.5, 10, "sum") gsamps <- as_tibble(samps) gsamps <- gather(gsamps, key = "dist", value = "value") ggplot(gsamps, aes(x = value)) + geom_bar() + facet_wrap(~ dist) samps <- cbind(samps, "theoretical" = rpois(nsamps, 2 + 5.2 + 10)) gsamps <- as_tibble(samps[ ,4:5]) gsamps <- gather(gsamps, key = "dist", value = "value") ggplot(gsamps, aes(x = value, fill = dist)) + geom_bar(position = "dodge") Exercise 9.2 Find the expected value and variance of the negative binomial distribution. Hint: Find the Taylor series of \\((1 - y)^{-r}\\) at point 0. Solution. Let \\(X \\sim \\text{NB}(r, p)\\). \\[\\begin{align} \\alpha_X(t) &= E[t^X] \\\\ &= \\sum_{j=0}^\\infty t^j \\binom{j + r - 1}{j} (1 - p)^r p^j \\\\ &= (1 - p)^r \\sum_{j=0}^\\infty \\binom{j + r - 1}{j} (tp)^j \\\\ &= (1 - p)^r \\sum_{j=0}^\\infty \\frac{(j + r - 1)(j + r - 2)...r}{j!} (tp)^j. \\\\ \\end{align}\\] Let us look at the Taylor series of \\((1 - y)^{-r}\\) at 0 \\[\\begin{align} (1 - y)^{-r} = &1 + \\frac{-r(-1)}{1!}y + \\frac{-r(-r - 1)(-1)^2}{2!}y^2 + \\\\ &\\frac{-r(-r - 1)(-r - 2)(-1)^3}{3!}y^3 + ... \\\\ \\end{align}\\] How does the \\(k\\)-th term look like? We have \\(k\\) derivatives of our function so \\[\\begin{align} \\frac{d^k}{d^k y} (1 - y)^{-r} &= \\frac{-r(-r - 1)...(-r - k + 1)(-1)^k}{k!}y^k \\\\ &= \\frac{r(r + 1)...(r + k - 1)}{k!}y^k. \\end{align}\\] We observe that this equals to the \\(j\\)-th term in the sum of NB PGF. Therefore \\[\\begin{align} \\alpha_X(t) &= (1 - p)^r (1 - tp)^{-r} \\\\ &= \\Big(\\frac{1 - p}{1 - tp}\\Big)^r \\end{align}\\] To find the expected value, we need to differentiate \\[\\begin{align} \\frac{d}{dt} \\Big(\\frac{1 - p}{1 - tp}\\Big)^r &= r \\Big(\\frac{1 - p}{1 - tp}\\Big)^{r-1} \\frac{d}{dt} \\frac{1 - p}{1 - tp} \\\\ &= r \\Big(\\frac{1 - p}{1 - tp}\\Big)^{r-1} \\frac{p(1 - p)}{(1 - tp)^2}. \\\\ \\end{align}\\] Evaluating this at 1, we get: \\[\\begin{align} E[X] = \\frac{rp}{1 - p}. \\end{align}\\] For the variance we need the second derivative. \\[\\begin{align} \\frac{d^2}{d^2t} \\Big(\\frac{1 - p}{1 - tp}\\Big)^r &= \\frac{p^2 r (r + 1) (\\frac{1 - p}{1 - tp})^r}{(tp - 1)^2} \\end{align}\\] Evaluating this at 1 and inserting the first derivatives, we get: \\[\\begin{align} Var[X] &= \\frac{d^2}{dt^2} \\alpha_X(1) + \\frac{d}{dt}\\alpha_X(1) - \\Big(\\frac{d}{dt}\\alpha_X(t) \\Big)^2 \\\\ &= \\frac{p^2 r (r + 1)}{(1 - p)^2} + \\frac{rp}{1 - p} - \\frac{r^2p^2}{(1 - p)^2} \\\\ &= \\frac{rp}{(1 - p)^2}. \\end{align}\\] library(tidyr) set.seed(1) nsamps <- 100000 find_p <- function (mu, r) { return (10 / (r + 10)) } r <- c(1,2,10,20) p <- find_p(10, r) sigma <- rep(sqrt(p*r / (1 - p)^2), each = nsamps) samps <- cbind("r=1" = rnbinom(nsamps, size = r[1], prob = 1 - p[1]), "r=2" = rnbinom(nsamps, size = r[2], prob = 1 - p[2]), "r=4" = rnbinom(nsamps, size = r[3], prob = 1 - p[3]), "r=20" = rnbinom(nsamps, size = r[4], prob = 1 - p[4])) gsamps <- gather(as.data.frame(samps)) iw <- (gsamps$value > sigma + 10) | (gsamps$value < sigma - 10) ggplot(gsamps, aes(x = value, fill = iw)) + geom_bar() + # geom_density() + facet_wrap(~ key) 9.2 Moment generating functions (MGFs) Exercise 9.3 Find the variance of the geometric distribution. Solution. Let \\(X \\sim \\text{Geometric}(p)\\). The MGF of the geometric distribution is \\[\\begin{align} M_X(t) &= E[e^{tX}] \\\\ &= \\sum_{k=0}^\\infty p(1 - p)^k e^{tk} \\\\ &= p \\sum_{k=0}^\\infty ((1 - p)e^t)^k. \\end{align}\\] Let us assume that \\((1 - p)e^t < 1\\). Then, by using the geometric series we get \\[\\begin{align} M_X(t) &= \\frac{p}{1 - e^t + pe^t}. \\end{align}\\] The first derivative of the above expression is \\[\\begin{align} \\frac{d}{dt}M_X(t) &= \\frac{-p(-e^t + pe^t)}{(1 - e^t + pe^t)^2}, \\end{align}\\] and evaluating at \\(t = 0\\), we get \\(\\frac{1 - p}{p}\\), which we already recognize as the expected value of the geometric distribution. The second derivative is \\[\\begin{align} \\frac{d^2}{dt^2}M_X(t) &= \\frac{(p-1)pe^t((p-1)e^t - 1)}{((p - 1)e^t + 1)^3}, \\end{align}\\] and evaluating at \\(t = 0\\), we get \\(\\frac{(p - 1)(p - 2)}{p^2}\\). Combining we get the variance \\[\\begin{align} Var(X) &= \\frac{(p - 1)(p - 2)}{p^2} - \\frac{(1 - p)^2}{p^2} \\\\ &= \\frac{(p-1)(p-2) - (1-p)^2}{p^2} \\\\ &= \\frac{1 - p}{p^2}. \\end{align}\\] Exercise 9.4 Find the distribution of sum of two normal random variables \\(X\\) and \\(Y\\), by comparing \\(M_{X+Y}(t)\\) to \\(M_X(t)\\). R: To illustrate the result draw random samples from N\\((-3, 1)\\) and N\\((5, 1.2)\\) and calculate the empirical mean and variance of \\(X+Y\\). Plot all three histograms in one plot. Solution. Let \\(X \\sim \\text{N}(\\mu_X, 1)\\) and \\(Y \\sim \\text{N}(\\mu_Y, 1)\\). The MGF of the sum is \\[\\begin{align} M_{X+Y}(t) &= M_X(t) M_Y(t). \\end{align}\\] Let us calculate \\(M_X(t)\\), the MGF for \\(Y\\) then follows analogously. \\[\\begin{align} M_X(t) &= \\int_{-\\infty}^\\infty e^{tx} \\frac{1}{\\sqrt{2 \\pi \\sigma_X^2}} e^{-\\frac{(x - mu_X)^2}{2\\sigma_X^2}} dx \\\\ &= \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt{2 \\pi \\sigma_X^2}} e^{-\\frac{(x - mu_X)^2 - 2\\sigma_X tx}{2\\sigma_X^2}} dx \\\\ &= \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt{2 \\pi \\sigma_X^2}} e^{-\\frac{x^2 - 2\\mu_X x + \\mu_X^2 - 2\\sigma_X tx}{2\\sigma_X^2}} dx \\\\ &= \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt{2 \\pi \\sigma_X^2}} e^{-\\frac{(x - (\\mu_X + \\sigma_X^2 t))^2 + \\mu_X^2 - (\\mu_X + \\sigma_X^2 t)^2}{2\\sigma_X^2}} dx & \\text{complete the square}\\\\ &= e^{-\\frac{\\mu_X^2 - (\\mu_X + \\sigma_X^2 t)^2}{2\\sigma_X^2}} \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt{2 \\pi \\sigma_X^2}} e^{-\\frac{(x - (\\mu_X + \\sigma_X^2 t))^2}{2\\sigma_X^2}} dx & \\\\ &= e^{-\\frac{\\mu_X^2 - (\\mu_X + \\sigma_X^2 t)^2}{2\\sigma_X^2}} & \\text{normal PDF} \\\\ &= e^{-\\frac{\\mu_X^2 - \\mu_X^2 - \\mu_X \\sigma_X^2 t - 2 \\sigma_X^4 t^2}{2\\sigma_X^2}} \\\\ &= e^{\\sigma_X^2 t^2 + \\frac{\\mu_X t}{2}}. \\\\ \\end{align}\\] The MGF of the sum is then \\[\\begin{align} M_{X+Y}(t) &= e^{\\sigma_X^2 t^2 + 0.5\\mu_X t} e^{\\sigma_Y^2 t^2 + 0.5\\mu_Y t} \\\\ &= e^{t^2(\\sigma_X^2 + \\sigma_Y^2) + 0.5 t(\\mu_X + \\mu_Y)}. \\end{align}\\] By comparing \\(M_{X+Y}(t)\\) and \\(M_X(t)\\) we observe that both have two terms. The first is \\(2t^2\\) multiplied by the variance, and the second is \\(2t\\) multiplied by the mean. Since MGFs are unique, we conclude that \\(Z = X + Y \\sim \\text{N}(\\mu_X + \\mu_Y, \\sigma_X^2 + \\sigma_Y^2)\\). library(tidyr) library(ggplot2) set.seed(1) nsamps <- 1000 x <- rnorm(nsamps, -3, 1) y <- rnorm(nsamps, 5, 1.2) z <- x + y mean(z) ## [1] 1.968838 var(z) ## [1] 2.645034 df <- data.frame(x = x, y = y, z = z) %>% gather() ggplot(df, aes(x = value, fill = key)) + geom_histogram(position = "dodge") "],["ci.html", "Chapter 10 Concentration inequalities 10.1 Comparison 10.2 Practical", " Chapter 10 Concentration inequalities This chapter deals with concentration inequalities. The students are expected to acquire the following knowledge: Theoretical More assumptions produce closer bounds. R Optimization. Estimating probability inequalities. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 10.1 Comparison Exercise 10.1 R: Let \\(X\\) be geometric random variable with \\(p = 0.7\\). Visually compare the Markov bound, Chernoff bound, and the theoretical probabilities for \\(x = 1,...,12\\). To get the best fitting Chernoff bound, you will need to optimize the bound depending on \\(t\\). Use either analytical or numerical optimization. bound_chernoff <- function (t, p, a) { return ((p / (1 - exp(t) + p * exp(t))) / exp(a * t)) } set.seed(1) p <- 0.7 a <- seq(1, 12, by = 1) ci_markov <- (1 - p) / p / a t <- vector(mode = "numeric", length = length(a)) for (i in 1:length(t)) { t[i] <- optimize(bound_chernoff, interval = c(0, log(1 / (1 - p))), p = p, a = a[i])$minimum } t ## [1] 0.5108267 0.7984981 0.9162927 0.9808238 1.0216635 1.0498233 1.0704327 ## [8] 1.0861944 1.0986159 1.1086800 1.1169653 1.1239426 ci_chernoff <- (p / (1 - exp(t) + p * exp(t))) / exp(a * t) actual <- 1 - pgeom(a, 0.7) plot_df <- rbind( data.frame(x = a, y = ci_markov, type = "Markov"), data.frame(x = a, y = ci_chernoff, type = "Chernoff"), data.frame(x = a, y = actual, type = "Actual") ) ggplot(plot_df, aes(x = x, y = y, color = type)) + geom_line() Exercise 10.2 R: Let \\(X\\) be a sum of 100 Beta distributions with random parameters. Take 1000 samples and plot the Chebyshev bound, Hoeffding bound, and the empirical probabilities. set.seed(1) nvars <- 100 nsamps <- 1000 samps <- matrix(data = NA, nrow = nsamps, ncol = nvars) Sn_mean <- 0 Sn_var <- 0 for (i in 1:nvars) { alpha1 <- rgamma(1, 10, 1) beta1 <- rgamma(1, 10, 1) X <- rbeta(nsamps, alpha1, beta1) Sn_mean <- Sn_mean + alpha1 / (alpha1 + beta1) Sn_var <- Sn_var + alpha1 * beta1 / ((alpha1 + beta1)^2 * (alpha1 + beta1 + 1)) samps[ ,i] <- X } mean(apply(samps, 1, sum)) ## [1] 51.12511 Sn_mean ## [1] 51.15723 var(apply(samps, 1, sum)) ## [1] 1.170652 Sn_var ## [1] 1.166183 a <- 1:30 b <- a / sqrt(Sn_var) ci_chebyshev <- 1 / b^2 ci_hoeffding <- 2 * exp(- 2 * a^2 / nvars) empirical <- NULL for (i in 1:length(a)) { empirical[i] <- sum(abs((apply(samps, 1, sum)) - Sn_mean) >= a[i])/ nsamps } plot_df <- rbind( data.frame(x = a, y = ci_chebyshev, type = "Chebyshev"), data.frame(x = a, y = ci_hoeffding, type = "Hoeffding"), data.frame(x = a, y = empirical, type = "Empirical") ) ggplot(plot_df, aes(x = x, y = y, color = type)) + geom_line() ggplot(plot_df, aes(x = x, y = y, color = type)) + geom_line() + coord_cartesian(xlim = c(15, 25), ylim = c(0, 0.05)) 10.2 Practical Exercise 10.3 From Jagannathan. Let \\(X_i\\), \\(i = 1,...n\\), be a random sample of size \\(n\\) of a random variable \\(X\\). Let \\(X\\) have mean \\(\\mu\\) and variance \\(\\sigma^2\\). Find the size of the sample \\(n\\) required so that the probability that the difference between sample mean and true mean is smaller than \\(\\frac{\\sigma}{10}\\) is at least 0.95. Hint: Derive a version of the Chebyshev inequality for \\(P(|X - \\mu| \\geq a)\\) using Markov inequality. Solution. Let \\(\\bar{X} = \\sum_{i=1}^n X_i\\). Then \\(E[\\bar{X}] = \\mu\\) and \\(Var[\\bar{X}] = \\frac{\\sigma^2}{n}\\). Let us first derive another representation of Chebyshev inequality. \\[\\begin{align} P(|X - \\mu| \\geq a) = P(|X - \\mu|^2 \\geq a^2) \\leq \\frac{E[|X - \\mu|^2]}{a^2} = \\frac{Var[X]}{a^2}. \\end{align}\\] Let us use that on our sampling distribution: \\[\\begin{align} P(|\\bar{X} - \\mu| \\geq \\frac{\\sigma}{10}) \\leq \\frac{100 Var[\\bar{X}]}{\\sigma^2} = \\frac{100 Var[X]}{n \\sigma^2} = \\frac{100}{n}. \\end{align}\\] We are interested in the difference being smaller, therefore \\[\\begin{align} P(|\\bar{X} - \\mu| < \\frac{\\sigma}{10}) = 1 - P(|\\bar{X} - \\mu| \\geq \\frac{\\sigma}{10}) \\geq 1 - \\frac{100}{n} \\geq 0.95. \\end{align}\\] It follows that we need a sample size of \\(n \\geq \\frac{100}{0.05} = 2000\\). "],["crv.html", "Chapter 11 Convergence of random variables", " Chapter 11 Convergence of random variables This chapter deals with convergence of random variables. The students are expected to acquire the following knowledge: Theoretical Finding convergences of random variables. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } Exercise 11.1 Let \\(X_1\\), \\(X_2\\),…, \\(X_n\\) be a sequence of Bernoulli random variables. Let \\(Y_n = \\frac{X_1 + X_2 + ... + X_n}{n^2}\\). Show that this sequence converges point-wise to the zero random variable. R: Use a simulation to check your answer. Solution. Let \\(\\epsilon\\) be arbitrary. We need to find such \\(n_0\\), that for every \\(n\\) greater than \\(n_0\\) \\(|Y_n| < \\epsilon\\) holds. \\[\\begin{align} |Y_n| &= |\\frac{X_1 + X_2 + ... + X_n}{n^2}| \\\\ &\\leq |\\frac{n}{n^2}| \\\\ &= \\frac{1}{n}. \\end{align}\\] So we need to find such \\(n_0\\), that for every \\(n > n_0\\) we will have \\(\\frac{1}{n} < \\epsilon\\). So \\(n_0 > \\frac{1}{\\epsilon}\\). x <- 1:1000 X <- matrix(data = NA, nrow = length(x), ncol = 100) y <- vector(mode = "numeric", length = length(x)) for (i in 1:length(x)) { X[i, ] <- rbinom(100, size = 1, prob = 0.5) } X <- apply(X, 2, cumsum) tmp_mat <- matrix(data = (1:1000)^2, nrow = 1000, ncol = 100) X <- X / tmp_mat y <- apply(X, 1, mean) ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) + geom_line() Exercise 11.2 Let \\(\\Omega = [0,1]\\) and let \\(X_n\\) be a sequence of random variables, defined as \\[\\begin{align} X_n(\\omega) = \\begin{cases} \\omega^3, &\\omega = \\frac{i}{n}, &0 \\leq i \\leq 1 \\\\ 1, & \\text{otherwise.} \\end{cases} \\end{align}\\] Show that \\(X_n\\) converges almost surely to \\(X \\sim \\text{Uniform}(0,1)\\). Solution. We need to show \\(P(\\{\\omega: X_n(\\omega) \\rightarrow X(\\omega)\\}) = 1\\). Let \\(\\omega \\neq \\frac{i}{n}\\). Then for any \\(\\omega\\), \\(X_n\\) converges pointwise to \\(X\\): \\[\\begin{align} X_n(\\omega) = 1 \\implies |X_n(\\omega) - X(s)| = |1 - 1| < \\epsilon. \\end{align}\\] The above is independent of \\(n\\). Since there are countably infinite number of elements in the complement (\\(\\frac{i}{n}\\)), the probability of this set is 1. Exercise 11.3 Borrowed from Wasserman. Let \\(X_n \\sim \\text{N}(0, \\frac{1}{n})\\) and let \\(X\\) be a random variable with CDF \\[\\begin{align} F_X(x) = \\begin{cases} 0, &x < 0 \\\\ 1, &x \\geq 0. \\end{cases} \\end{align}\\] Does \\(X_n\\) converge to \\(X\\) in distribution? How about in probability? Prove or disprove these statement. R: Plot the CDF of \\(X_n\\) for \\(n = 1, 2, 5, 10, 100, 1000\\). Solution. Let us first check convergence in distribution. \\[\\begin{align} \\lim_{n \\rightarrow \\infty} F_{X_n}(x) &= \\lim_{n \\rightarrow \\infty} \\phi (\\sqrt(n) x). \\end{align}\\] We have two cases, for \\(x < 0\\) and \\(x > 0\\). We do not need to check for \\(x = 0\\), since \\(F_X\\) is not continuous in that point. \\[\\begin{align} \\lim_{n \\rightarrow \\infty} \\phi (\\sqrt(n) x) = \\begin{cases} 0, & x < 0 \\\\ 1, & x > 0. \\end{cases} \\end{align}\\] This is the same as \\(F_X\\). Let us now check convergence in probability. Since \\(X\\) is a point-mass distribution at zero, we have \\[\\begin{align} \\lim_{n \\rightarrow \\infty} P(|X_n| > \\epsilon) &= \\lim_{n \\rightarrow \\infty} (P(X_n > \\epsilon) + P(X_n < -\\epsilon)) \\\\ &= \\lim_{n \\rightarrow \\infty} (1 - P(X_n < \\epsilon) + P(X_n < -\\epsilon)) \\\\ &= \\lim_{n \\rightarrow \\infty} (1 - \\phi(\\sqrt{n} \\epsilon) + \\phi(- \\sqrt{n} \\epsilon)) \\\\ &= 0. \\end{align}\\] n <- c(1,2,5,10,100,1000) ggplot(data = data.frame(x = seq(-5, 5, by = 0.01)), aes(x = x)) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/1), aes(color = "sd = 1/1")) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/2), aes(color = "sd = 1/2")) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/5), aes(color = "sd = 1/5")) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/10), aes(color = "sd = 1/10")) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/100), aes(color = "sd = 1/100")) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/1000), aes(color = "sd = 1/1000")) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/10000), aes(color = "sd = 1/10000")) Exercise 11.4 Let \\(X_i\\) be i.i.d. and \\(\\mu = E(X_1)\\). Let variance of \\(X_1\\) be finite. Show that the mean of \\(X_i\\), \\(\\bar{X}_n = \\frac{1}{n}\\sum_{i=1}^n X_i\\) converges in quadratic mean to \\(\\mu\\). Solution. \\[\\begin{align} \\lim_{n \\rightarrow \\infty} E(|\\bar{X_n} - \\mu|^2) &= \\lim_{n \\rightarrow \\infty} E(\\bar{X_n}^2 - 2 \\bar{X_n} \\mu + \\mu^2) \\\\ &= \\lim_{n \\rightarrow \\infty} (E(\\bar{X_n}^2) - 2 \\mu E(\\frac{\\sum_{i=1}^n X_i}{n}) + \\mu^2) \\\\ &= \\lim_{n \\rightarrow \\infty} E(\\bar{X_n})^2 + \\lim_{n \\rightarrow \\infty} Var(\\bar{X_n}) - 2 \\mu^2 + \\mu^2 \\\\ &= \\lim_{n \\rightarrow \\infty} \\frac{n^2 \\mu^2}{n^2} + \\lim_{n \\rightarrow \\infty} \\frac{\\sigma^2}{n} - \\mu^2 \\\\ &= \\mu^2 - \\mu^2 + \\lim_{n \\rightarrow \\infty} \\frac{\\sigma^2}{n} \\\\ &= 0. \\end{align}\\] "],["lt.html", "Chapter 12 Limit theorems", " Chapter 12 Limit theorems This chapter deals with limit theorems. The students are expected to acquire the following knowledge: Theoretical Monte Carlo integration convergence. Difference between weak and strong law of large numbers. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } Exercise 12.1 Show that Monte Carlo integration converges almost surely to the true integral of a bounded function. Solution. Let \\(g\\) be a function defined on \\(\\Omega\\). Let \\(X_i\\), \\(i = 1,...,n\\) be i.i.d. (multivariate) uniform random variables with bounds defined on \\(\\Omega\\). Let \\(Y_i\\) = \\(g(X_i)\\). Then it follows that \\(Y_i\\) are also i.i.d. random variables and their expected value is \\(E[g(X)] = \\int_{\\Omega} g(x) f_X(x) dx = \\frac{1}{V_{\\Omega}} \\int_{\\Omega} g(x) dx\\). By the strong law of large numbers, we have \\[\\begin{equation} \\frac{1}{n}\\sum_{i=1}^n Y_i \\xrightarrow{\\text{a.s.}} E[g(X)]. \\end{equation}\\] It follows that \\[\\begin{equation} V_{\\Omega} \\frac{1}{n}\\sum_{i=1}^n Y_i \\xrightarrow{\\text{a.s.}} \\int_{\\Omega} g(x) dx. \\end{equation}\\] Exercise 12.2 Let \\(X\\) be a geometric random variable with probability 0.5 and support in positive integers. Let \\(Y = 2^X (-1)^X X^{-1}\\). Find the expected value of \\(Y\\) by using conditional convergence (this variable does not have an expected value in the conventional sense – the series is not absolutely convergent). R: Draw \\(10000\\) samples from a geometric distribution with probability 0.5 and support in positive integers to get \\(X\\). Then calculate \\(Y\\) and plot the means at each iteration (sample). Additionally, plot the expected value calculated in a. Try it with different seeds. What do you notice? Solution. \\[\\begin{align*} E[Y] &= \\sum_{x=1}^{\\infty} \\frac{2^x (-1)^x}{x} 0.5^x \\\\ &= \\sum_{x=1}^{\\infty} \\frac{(-1)^x}{x} \\\\ &= - \\sum_{x=1}^{\\infty} \\frac{(-1)^{x+1}}{x} \\\\ &= - \\ln(2) \\end{align*}\\] set.seed(3) x <- rgeom(100000, prob = 0.5) + 1 y <- 2^x * (-1)^x * x^{-1} y_means <- cumsum(y) / seq_along(y) df <- data.frame(x = 1:length(y_means), y = y_means) ggplot(data = df, aes(x = x, y = y)) + geom_line() + geom_hline(yintercept = -log(2)) "],["eb.html", "Chapter 13 Estimation basics 13.1 ECDF 13.2 Properties of estimators", " Chapter 13 Estimation basics This chapter deals with estimation basics. The students are expected to acquire the following knowledge: Biased and unbiased estimators. Consistent estimators. Empirical cumulative distribution function. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 13.1 ECDF Exercise 13.1 (ECDF intuition) Take any univariate continuous distribution that is readily available in R and plot its CDF (\\(F\\)). Draw one sample (\\(n = 1\\)) from the chosen distribution and draw the ECDF (\\(F_n\\)) of that one sample. Use the definition of the ECDF, not an existing function in R. Implementation hint: ECDFs are always piecewise constant - they only jump at the sampled values and by \\(1/n\\). Repeat (b) for \\(n = 5, 10, 100, 1000...\\) Theory says that \\(F_n\\) should converge to \\(F\\). Can you observe that? For \\(n = 100\\) repeat the process \\(m = 20\\) times and plot every \\(F_n^{(m)}\\). Theory says that \\(F_n\\) will converge to \\(F\\) the slowest where \\(F\\) is close to 0.5 (where the variance is largest). Can you observe that? library(ggplot2) set.seed(1) ggplot(data = data.frame(x = seq(-5, 5, by = 0.01))) + # stat_function(aes(x = x), fun = pbeta, args = list(shape1 = 1, shape2 = 2)) stat_function(aes(x = x), fun = pnorm, args = list(mean = 0, sd = 1)) one_samp <- rnorm(1) X <- data.frame(x = c(-5, one_samp, 5), y = c(0,1,1)) ggplot(data = data.frame(x = seq(-5, 5, by = 0.01))) + # stat_function(aes(x = x), fun = pbeta, args = list(shape1 = 1, shape2 = 2)) stat_function(aes(x = x), fun = pnorm, args = list(mean = 0, sd = 1)) + geom_step(data = X, aes(x = x, y = y)) N <- c(5, 10, 100, 1000) X <- NULL for (n in N) { tmp <- rnorm(n) tmp_X <- data.frame(x = c(-5, sort(tmp), 5), y = c(0, seq(1/n, 1, by = 1/n), 1), n = n) X <- rbind(X, tmp_X) } ggplot(data = data.frame(x = seq(-5, 5, by = 0.01))) + # stat_function(aes(x = x), fun = pbeta, args = list(shape1 = 1, shape2 = 2)) stat_function(aes(x = x), fun = pnorm, args = list(mean = 0, sd = 1)) + geom_step(data = X, aes(x = x, y = y, color = as.factor(n))) + labs(color = "N") 13.2 Properties of estimators Exercise 13.2 Show that the sample average is, as an estimator of the mean: unbiased, consistent, asymptotically normal. Solution. \\[\\begin{align*} E[\\frac{1}{n} \\sum_{i=1}^n X_i] &= \\frac{1}{n} \\sum_{i=i}^n E[X_i] \\\\ &= E[X]. \\end{align*}\\] \\[\\begin{align*} \\lim_{n \\rightarrow \\infty} P(|\\frac{1}{n} \\sum_{i=1}^n X_i - E[X]| > \\epsilon) &= \\lim_{n \\rightarrow \\infty} P((\\frac{1}{n} \\sum_{i=1}^n X_i - E[X])^2 > \\epsilon^2) \\\\ & \\leq \\lim_{n \\rightarrow \\infty} \\frac{E[(\\frac{1}{n} \\sum_{i=1}^n X_i - E[X])^2]}{\\epsilon^2} & \\text{Markov inequality} \\\\ & = \\lim_{n \\rightarrow \\infty} \\frac{E[(\\frac{1}{n} \\sum_{i=1}^n X_i)^2 - 2 \\frac{1}{n} \\sum_{i=1}^n X_i E[X] + E[X]^2]}{\\epsilon^2} \\\\ & = \\lim_{n \\rightarrow \\infty} \\frac{E[(\\frac{1}{n} \\sum_{i=1}^n X_i)^2] - 2 E[X]^2 + E[X]^2}{\\epsilon^2} \\\\ &= 0 \\end{align*}\\] For the last equality see the solution to ??. Follows directly from the CLT. Exercise 13.3 (Consistent but biased estimator) Show that sample variance (the plug-in estimator of variance) is a biased estimator of variance. Show that sample variance is a consistent estimator of variance. Show that the estimator with (\\(N-1\\)) (Bessel correction) is unbiased. Solution. \\[\\begin{align*} E[\\frac{1}{n} \\sum_{i=1}^n (Y_i - \\bar{Y})^2] &= \\frac{1}{n} \\sum_{i=1}^n E[(Y_i - \\bar{Y})^2] \\\\ &= \\frac{1}{n} \\sum_{i=1}^n E[Y_i^2] - 2 E[Y_i \\bar{Y}] + \\bar{Y}^2)] \\\\ &= \\frac{1}{n} \\sum_{i=1}^n E[Y_i^2 - 2 Y_i \\bar{Y} + \\bar{Y}^2] \\\\ &= \\frac{1}{n} \\sum_{i=1}^n E[Y_i^2 - \\frac{2}{n} Y_i^2 - \\frac{2}{n} \\sum_{i \\neq j} Y_i Y_j + \\frac{1}{n^2}\\sum_j \\sum_{k \\neq j} Y_j Y_k + \\frac{1}{n^2} \\sum_j Y_j^2] \\\\ &= \\frac{1}{n} \\sum_{i=1}^n \\frac{n - 2}{n} (\\sigma^2 + \\mu^2) - \\frac{2}{n} (n - 1) \\mu^2 + \\frac{1}{n^2}n(n-1)\\mu^2 + \\frac{1}{n^2}n(\\sigma^2 + \\mu^2) \\\\ &= \\frac{n-1}{n}\\sigma^2 \\\\ < \\sigma^2. \\end{align*}\\] Let \\(S_n\\) denote the sample variance. Then we can write it as \\[\\begin{align*} S_n &= \\frac{1}{n} \\sum_{i=1}^n (X_i - \\bar{X})^2 = \\frac{1}{n} \\sum_{i=1}^n (X_i - \\mu)^2 + 2(X_i - \\mu)(\\mu - \\bar{X}) + (\\mu - \\bar{X})^2. \\end{align*}\\] Now \\(\\bar{X}\\) converges in probability (by WLLN) to \\(\\mu\\) therefore the right terms converge in probability to zero. The left term converges in probability to \\(\\sigma^2\\), also by WLLN. Therefore the sample variance is a consistent estimatior of the variance. The denominator changes in the second-to-last line of a., therefore the last line is now equality. Exercise 13.4 (Estimating the median) Show that the sample median is an unbiased estimator of the median for N\\((\\mu, \\sigma^2)\\). Show that the sample median is an unbiased estimator of the mean for any distribution with symmetric density. Hint 1: The pdf of an order statistic is \\(f_{X_{(k)}}(x) = \\frac{n!}{(n - k)!(k - 1)!}f_X(x)\\Big(F_X(x)^{k-1} (1 - F_X(x)^{n - k}) \\Big)\\). Hint 2: A distribution is symmetric when \\(X\\) and \\(2a - X\\) have the same distribution for some \\(a\\). Solution. Let \\(Z_i\\), \\(i = 1,...,n\\) be i.i.d. variables with a symmetric distribution and let \\(Z_{k:n}\\) denote the \\(k\\)-th order statistic. We will distinguish two cases, when \\(n\\) is odd and when \\(n\\) is even. Let first \\(n = 2m + 1\\) be odd. Then the sample median is \\(M = Z_{m+1:2m+1}\\). Its PDF is \\[\\begin{align*} f_M(x) = (m+1)\\binom{2m + 1}{m}f_Z(x)\\Big(F_Z(x)^m (1 - F_Z(x)^m) \\Big). \\end{align*}\\] For every symmetric distribution, it holds that \\(F_X(x) = 1 - F(2a - x)\\). Let \\(a = \\mu\\), the population mean. Plugging this into the PDF, we get that \\(f_M(x) = f_M(2\\mu -x)\\). It follows that \\[\\begin{align*} E[M] &= E[2\\mu - M] \\\\ 2E[M] &= 2\\mu \\\\ E[M] &= \\mu. \\end{align*}\\] Now let \\(n = 2m\\) be even. Then the sample median is \\(M = \\frac{Z_{m:2m} + Z_{m+1:2m}}{2}\\). It can be shown, that the joint PDF of these terms is also symmetric. Therefore, similar to the above \\[\\begin{align*} E[M] &= E[\\frac{Z_{m:2m} + Z_{m+1:2m}}{2}] \\\\ &= E[\\frac{2\\mu - M + 2\\mu - M}{2}] \\\\ &= E[2\\mu - M]. \\end{align*}\\] The above also proves point a. as the median and the mean are the same in normal distribution. Exercise 13.5 (Matrix trace estimation) The Hutchinson trace estimator [1] is an estimator of the trace of a symmetric positive semidefinite matrix A that relies on Monte Carlo sampling. The estimator is defined as \\[\\begin{align*} \\textrm{tr}(A) \\approx \\frac{1}{n} \\Sigma_{i=1}^n z_i^T A z_i, &\\\\ z_i \\sim_{\\mathrm{IID}} \\textrm{Uniform}(\\{-1, 1\\}^m), & \\end{align*}\\] where \\(A \\in \\mathbb{R}^{m \\times m}\\) is a symmetric positive semidefinite matrix. Elements of each vector \\(z_i\\) are either \\(-1\\) or \\(1\\) with equal probability. This is also called a Rademacher distribution. Data scientists often want the trace of a Hessian to obtain valuable curvature information for a loss function. Per [2], an example is classifying ten digits based on \\((28,28)\\) grayscale images (i.e. MNIST data) using logistic regression. The number of parameters is \\(m = 28^2 \\cdot 10 = 7840\\) and the size of the Hessian is \\(m^2\\), roughly \\(6 \\cdot 10^6\\). The diagonal average is equal to the average eigenvalue, which may be useful for optimization; in MCMC contexts, this would be useful for preconditioners and step size optimization. Computing Hessians (as a means of getting eigenvalue information) is often intractable, but Hessian-vector products can be computed faster by autodifferentiation (with e.g. Tensorflow, Pytorch, Jax). This is one motivation for the use of a stochastic trace estimator as outlined above. References: A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines (Hutchinson, 1990) A Modern Analysis of Hutchinson’s Trace Estimator (Skorski, 2020) Prove that the Hutchinson trace estimator is an unbiased estimator of the trace. Solution. We first simplify our task: \\[\\begin{align} \\mathbb{E}\\left[\\frac{1}{n} \\Sigma_{i=1}^n z_i^T A z_i \\right] &= \\frac{1}{n} \\Sigma_{i=1}^n \\mathbb{E}\\left[z_i^T A z_i \\right] \\\\ &= \\mathbb{E}\\left[z_i^T A z_i \\right], \\end{align}\\] where the second equality is due to having \\(n\\) IID vectors \\(z_i\\). We now only need to show that \\(\\mathbb{E}\\left[z^T A z \\right] = \\mathrm{tr}(A)\\). We omit the index due to all vectors being IID: \\[\\begin{align} \\mathrm{tr}(A) &= \\mathrm{tr}(AI) \\\\ &= \\mathrm{tr}(A\\mathbb{E}[zz^T]) \\\\ &= \\mathbb{E}[\\mathrm{tr}(Azz^T)] \\\\ &= \\mathbb{E}[\\mathrm{tr}(z^TAz)] \\\\ &= \\mathbb{E}[z^TAz]. \\end{align}\\] This concludes the proof. We clarify some equalities below. The second equality assumes that \\(\\mathbb{E}[zz^T] = I\\). By noting that the mean of the Rademacher distribution is 0, we have \\[\\begin{align} \\mathrm{Cov}[z, z] &= \\mathbb{E}[(z - \\mathbb{E}[z])(z - \\mathbb{E}[z])^T] \\\\ &= \\mathbb{E}[zz^T]. \\end{align}\\] Dimensions of \\(z\\) are independent, so \\(\\mathrm{Cov}[z, z]_{ij} = 0\\) for \\(i \\neq j\\). The diagonal will contain variances, which are equal to \\(1\\) for all dimensions \\(k = 1 \\dots m\\): \\(\\mathrm{Var}[z^{(k)}] = \\mathbb{E}[z^{(k)}z^{(k)}] - \\mathbb{E}[z^{(k)}]^2 = 1 - 0 = 1\\). It follows that the covariance is an identity matrix. Note that this is a general result for vectors with IID dimensions sampled from a distribution with mean 0 and variance 1. We could therefore use something else instead of the Rademacher, e.g. \\(z ~ N(0, I)\\). The third equality uses the fact that the expectation of a trace equals the trace of an expectation. If \\(X\\) is a random matrix, then \\(\\mathbb{E}[X]_{ij} = \\mathbb{E}[X_{ij}]\\). Therefore: \\[\\begin{align} \\mathrm{tr}(\\mathbb{E}[X]) &= \\Sigma_{i=1}^m(\\mathbb{E}[X]_{ii}) \\\\ &= \\Sigma_{i=1}^m(\\mathbb{E}[X_{ii}]) \\\\ &= \\mathbb{E}[\\Sigma_{i=1}^m(X_{ii})] \\\\ &= \\mathbb{E}[\\mathrm{tr}(X)], \\end{align}\\] where we used the linearity of the expectation in the third step. The fourth equality uses the fact that \\(\\mathrm{tr}(AB) = \\mathrm{tr}(BA)\\) for any matrices \\(A \\in \\mathbb{R}^{n \\times m}, B \\in \\mathbb{R}^{m \\times n}\\). The last inequality uses the fact that the trace of a \\(1 \\times 1\\) matrix is just its element. "],["boot.html", "Chapter 14 Bootstrap", " Chapter 14 Bootstrap This chapter deals with bootstrap. The students are expected to acquire the following knowledge: How to use bootstrap to generate coverage intervals. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } Exercise 14.1 Ideally, a \\(1-\\alpha\\) CI would have \\(1-\\alpha\\) coverage. That is, say a 95% CI should, in the long run, contain the true value of the parameter 95% of the time. In practice, it is impossible to assess the coverage of our CI method, because we rarely know the true parameter. In simulation, however, we can. Let’s assess the coverage of bootstrap percentile intervals. Pick a univariate distribution with readily available mean and one that you can easily sample from. Draw \\(n = 30\\) random samples from the chosen distribution and use the bootstrap (with large enough m) and percentile CI method to construct 95% CI. Repeat the process many times and count how many times the CI contains the true mean. That is, compute the actual coverage probability (don’t forget to include the standard error of the coverage probability!). What can you observe? Try one or two different distributions. What can you observe? Repeat (b) and (c) using BCa intervals (R package boot). How does the coverage compare to percentile intervals? As (d) but using intervals based on asymptotic normality (+/- 1.96 SE). How do results from (b), (d), and (e) change if we increase the sample size to n = 200? What about n = 5? library(boot) set.seed(0) nit <- 1000 # Repeat the process "many times" alpha <- 0.05 # CI parameter nboot <- 100 # m parameter for bootstrap ("large enough m") # f: change this to 200 or 5. nsample <- 30 # n = 30 random samples from the chosen distribution. Comment out BCa code if it breaks. covers <- matrix(nrow = nit, ncol = 3) covers_BCa <- matrix(nrow = nit, ncol = 3) covers_asymp_norm <- matrix(nrow = nit, ncol = 3) isin <- function (x, lower, upper) { (x > lower) & (x < upper) } for (j in 1:nit) { # Repeating many times # a: pick a univariate distribution - standard normal x1 <- rnorm(nsample) # c: one or two different distributions - beta and poisson x2 <- rbeta(nsample, 1, 2) x3 <- rpois(nsample, 5) X1 <- matrix(data = NA, nrow = nsample, ncol = nboot) X2 <- matrix(data = NA, nrow = nsample, ncol = nboot) X3 <- matrix(data = NA, nrow = nsample, ncol = nboot) for (i in 1:nboot) { X1[ ,i] <- sample(x1, nsample, replace = T) X2[ ,i] <- sample(x2, nsample, T) X3[ ,i] <- sample(x3, nsample, T) } X1_func <- apply(X1, 2, mean) X2_func <- apply(X2, 2, mean) X3_func <- apply(X3, 2, mean) X1_quant <- quantile(X1_func, probs = c(alpha / 2, 1 - alpha / 2)) X2_quant <- quantile(X2_func, probs = c(alpha / 2, 1 - alpha / 2)) X3_quant <- quantile(X3_func, probs = c(alpha / 2, 1 - alpha / 2)) covers[j,1] <- (0 > X1_quant[1]) & (0 < X1_quant[2]) covers[j,2] <- ((1 / 3) > X2_quant[1]) & ((1 / 3) < X2_quant[2]) covers[j,3] <- (5 > X3_quant[1]) & (5 < X3_quant[2]) mf <- function (x, i) return(mean(x[i])) bootX1 <- boot(x1, statistic = mf, R = nboot) bootX2 <- boot(x2, statistic = mf, R = nboot) bootX3 <- boot(x3, statistic = mf, R = nboot) X1_quant_BCa <- boot.ci(bootX1, type = "bca")$bca X2_quant_BCa <- boot.ci(bootX2, type = "bca")$bca X3_quant_BCa <- boot.ci(bootX3, type = "bca")$bca covers_BCa[j,1] <- (0 > X1_quant_BCa[4]) & (0 < X1_quant_BCa[5]) covers_BCa[j,2] <- ((1 / 3) > X2_quant_BCa[4]) & ((1 / 3) < X2_quant_BCa[5]) covers_BCa[j,3] <- (5 > X3_quant_BCa[4]) & (5 < X3_quant_BCa[5]) # e: estimate mean and standard error # sample mean: x1_bar <- mean(x1) x2_bar <- mean(x2) x3_bar <- mean(x3) # standard error (of the sample mean) estimate: sample standard deviation / sqrt(n) x1_bar_SE <- sd(x1) / sqrt(nsample) x2_bar_SE <- sd(x2) / sqrt(nsample) x3_bar_SE <- sd(x3) / sqrt(nsample) covers_asymp_norm[j,1] <- isin(0, x1_bar - 1.96 * x1_bar_SE, x1_bar + 1.96 * x1_bar_SE) covers_asymp_norm[j,2] <- isin(1/3, x2_bar - 1.96 * x2_bar_SE, x2_bar + 1.96 * x2_bar_SE) covers_asymp_norm[j,3] <- isin(5, x3_bar - 1.96 * x3_bar_SE, x3_bar + 1.96 * x3_bar_SE) } apply(covers, 2, mean) ## [1] 0.918 0.925 0.905 apply(covers, 2, sd) / sqrt(nit) ## [1] 0.008680516 0.008333333 0.009276910 apply(covers_BCa, 2, mean) ## [1] 0.927 0.944 0.927 apply(covers_BCa, 2, sd) / sqrt(nit) ## [1] 0.008230355 0.007274401 0.008230355 apply(covers_asymp_norm, 2, mean) ## [1] 0.939 0.937 0.930 apply(covers_asymp_norm, 2, sd) / sqrt(nit) ## [1] 0.007572076 0.007687008 0.008072494 Exercise 14.2 You are given a sample of independent observations from a process of interest: Index 1 2 3 4 5 6 7 8 X 7 2 4 6 4 5 9 10 Compute the plug-in estimate of mean and 95% symmetric CI based on asymptotic normality. Use the plug-in estimate of SE. Same as (a), but use the unbiased estimate of SE. Apply nonparametric bootstrap with 1000 bootstrap replications and estimate the 95% CI for the mean with percentile-based CI. # a x <- c(7, 2, 4, 6, 4, 5, 9, 10) n <- length(x) mu <- mean(x) SE <- sqrt(mean((x - mu)^2)) / sqrt(n) SE ## [1] 0.8915839 z <- qnorm(1 - 0.05 / 2) c(mu - z * SE, mu + z * SE) ## [1] 4.127528 7.622472 # b SE <- sd(x) / sqrt(n) SE ## [1] 0.9531433 c(mu - z * SE, mu + z * SE) ## [1] 4.006873 7.743127 # c set.seed(0) m <- 1000 T_mean <- function(x) {mean(x)} est_boot <- array(NA, m) for (i in 1:m) { x_boot <- x[sample(1:n, n, rep = T)] est_boot[i] <- T_mean(x_boot) } quantile(est_boot, p = c(0.025, 0.975)) ## 2.5% 97.5% ## 4.250 7.625 Exercise 14.3 We are given a sample of 10 independent paired (bivariate) observations: Index 1 2 3 4 5 6 7 8 9 10 X 1.26 -0.33 1.33 1.27 0.41 -1.54 -0.93 -0.29 -0.01 2.40 Y 2.64 0.33 0.48 0.06 -0.88 -2.14 -2.21 0.95 0.83 1.45 Compute Pearson correlation between X and Y. Use the cor.test() from R to estimate a 95% CI for the estimate from (a). Apply nonparametric bootstrap with 1000 bootstrap replications and estimate the 95% CI for the Pearson correlation with percentile-based CI. Compare CI from (b) and (c). Are they similar? How would the bootstrap estimation of CI change if we were interested in Spearman or Kendall correlation instead? x <- c(1.26, -0.33, 1.33, 1.27, 0.41, -1.54, -0.93, -0.29, -0.01, 2.40) y <- c(2.64, 0.33, 0.48, 0.06, -0.88, -2.14, -2.21, 0.95, 0.83, 1.45) # a cor(x, y) ## [1] 0.6991247 # b res <- cor.test(x, y) res$conf.int[1:2] ## [1] 0.1241458 0.9226238 # c set.seed(0) m <- 1000 n <- length(x) T_cor <- function(x, y) {cor(x, y)} est_boot <- array(NA, m) for (i in 1:m) { idx <- sample(1:n, n, rep = T) # !!! important to use same indices to keep dependency between x and y est_boot[i] <- T_cor(x[idx], y[idx]) } quantile(est_boot, p = c(0.025, 0.975)) ## 2.5% 97.5% ## 0.2565537 0.9057664 # d # Yes, but the bootstrap CI is more narrow. # e # We just use the functions for Kendall/Spearman coefficients instead: T_kendall <- function(x, y) {cor(x, y, method = "kendall")} T_spearman <- function(x, y) {cor(x, y, method = "spearman")} # Put this in a function that returns the CI bootstrap_95_ci <- function(x, y, t, m = 1000) { n <- length(x) est_boot <- array(NA, m) for (i in 1:m) { idx <- sample(1:n, n, rep = T) # !!! important to use same indices to keep dependency between x and y est_boot[i] <- t(x[idx], y[idx]) } quantile(est_boot, p = c(0.025, 0.975)) } bootstrap_95_ci(x, y, T_kendall) ## 2.5% 97.5% ## -0.08108108 0.78378378 bootstrap_95_ci(x, y, T_spearman) ## 2.5% 97.5% ## -0.1701115 0.8867925 Exercise 14.4 In this problem we will illustrate the use of the nonparametric bootstrap for estimating CIs of regression model coefficients. Load the longley dataset from base R with data(longley). Use lm() to apply linear regression using “Employed” as the target (dependent) variable and all other variables as the predictors (independent). Using lm() results, print the estimated regression coefficients and standard errors. Estimate 95% CI for the coefficients using +/- 1.96 * SE. Use nonparametric bootstrap with 100 replications to estimate the SE of the coefficients from (b). Compare the SE from (c) with those from (b). # a data(longley) # b res <- lm(Employed ~ . , longley) tmp <- data.frame(summary(res)$coefficients[,1:2]) tmp$LB <- tmp[,1] - 1.96 * tmp[,2] tmp$UB <- tmp[,1] + 1.96 * tmp[,2] tmp ## Estimate Std..Error LB UB ## (Intercept) -3.482259e+03 8.904204e+02 -5.227483e+03 -1.737035e+03 ## GNP.deflator 1.506187e-02 8.491493e-02 -1.513714e-01 1.814951e-01 ## GNP -3.581918e-02 3.349101e-02 -1.014616e-01 2.982320e-02 ## Unemployed -2.020230e-02 4.883997e-03 -2.977493e-02 -1.062966e-02 ## Armed.Forces -1.033227e-02 2.142742e-03 -1.453204e-02 -6.132495e-03 ## Population -5.110411e-02 2.260732e-01 -4.942076e-01 3.919994e-01 ## Year 1.829151e+00 4.554785e-01 9.364136e-01 2.721889e+00 # c set.seed(0) m <- 100 n <- nrow(longley) T_coef <- function(x) { lm(Employed ~ . , x)$coefficients } est_boot <- array(NA, c(m, ncol(longley))) for (i in 1:m) { idx <- sample(1:n, n, rep = T) est_boot[i,] <- T_coef(longley[idx,]) } SE <- apply(est_boot, 2, sd) SE ## [1] 1.826011e+03 1.605981e-01 5.693746e-02 8.204892e-03 3.802225e-03 ## [6] 3.907527e-01 9.414436e-01 # Show the standard errors around coefficients library(ggplot2) library(reshape2) df <- data.frame(index = 1:7, bootstrap_SE = SE, lm_SE = tmp$Std..Error) melted_df <- melt(df[2:nrow(df), ], id.vars = "index") # Ignore bias which has a really large magnitude ggplot(melted_df, aes(x = index, y = value, fill = variable)) + geom_bar(stat="identity", position="dodge") + xlab("Coefficient") + ylab("Standard error") # + scale_y_continuous(trans = "log") # If you want to also plot bias Exercise 14.5 This exercise shows a shortcoming of the bootstrap method when using the plug in estimator for the maximum. Compute the 95% bootstrap CI for the maximum of a standard normal distribution. Compute the 95% bootstrap CI for the maximum of a binomial distribution with n = 15 and p = 0.2. Repeat (b) using p = 0.9. Why is the result different? # bootstrap CI for maximum alpha <- 0.05 T_max <- function(x) {max(x)} # Equal to T_max = max bootstrap <- function(x, t, m = 1000) { n <- length(x) values <- rep(0, m) for (i in 1:m) { values[i] <- t(sample(x, n, replace = T)) } quantile(values, probs = c(alpha / 2, 1 - alpha / 2)) } # a # Meaningless, as the normal distribution can yield arbitrarily large values. x <- rnorm(100) bootstrap(x, T_max) ## 2.5% 97.5% ## 1.819425 2.961743 # b x <- rbinom(100, size = 15, prob = 0.2) # min = 0, max = 15 bootstrap(x, T_max) ## 2.5% 97.5% ## 6 7 # c x <- rbinom(100, size = 15, prob = 0.9) # min = 0, max = 15 bootstrap(x, T_max) ## 2.5% 97.5% ## 15 15 # Observation: to estimate the maximum, we need sufficient probability mass near the maximum value the distribution can yield. # Using bootstrap is pointless when there is too little mass near the true maximum. # In general, bootstrap will fail when estimating the CI for the maximum. "],["ml.html", "Chapter 15 Maximum likelihood 15.1 Deriving MLE 15.2 Fisher information 15.3 The German tank problem", " Chapter 15 Maximum likelihood This chapter deals with maximum likelihood estimation. The students are expected to acquire the following knowledge: How to derive MLE. Applying MLE in R. Calculating and interpreting Fisher information. Practical use of MLE. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 15.1 Deriving MLE Exercise 15.1 Derive the maximum likelihood estimator of variance for N\\((\\mu, \\sigma^2)\\). Compare with results from 13.3. What does that say about the MLE estimator? Solution. The mean is assumed constant, so we have the likelihood \\[\\begin{align} L(\\sigma^2; y) &= \\prod_{i=1}^n \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(y_i - \\mu)^2}{2 \\sigma^2}} \\\\ &= \\frac{1}{\\sqrt{2 \\pi \\sigma^2}^n} e^{\\frac{-\\sum_{i=1}^n (y_i - \\mu)^2}{2 \\sigma^2}} \\end{align}\\] We need to find the maximum of this function. We first observe that we can replace \\(\\frac{-\\sum_{i=1}^n (y_i - \\mu)^2}{2}\\) with a constant \\(c\\), since none of the terms are dependent on \\(\\sigma^2\\). Additionally, the term \\(\\frac{1}{\\sqrt{2 \\pi}^n}\\) does not affect the calculation of the maximum. So now we have \\[\\begin{align} L(\\sigma^2; y) &= (\\sigma^2)^{-\\frac{n}{2}} e^{\\frac{c}{\\sigma^2}}. \\end{align}\\] Differentiating we get \\[\\begin{align} \\frac{d}{d \\sigma^2} L(\\sigma^2; y) &= (\\sigma^2)^{-\\frac{n}{2}} \\frac{d}{d \\sigma^2} e^{\\frac{c}{\\sigma^2}} + e^{\\frac{c}{\\sigma^2}} \\frac{d}{d \\sigma^2} (\\sigma^2)^{-\\frac{n}{2}} \\\\ &= - (\\sigma^2)^{-\\frac{n}{2}} e^{\\frac{c}{\\sigma^2}} \\frac{c}{(\\sigma^2)^2} - e^{\\frac{c}{\\sigma^2}} \\frac{n}{2} (\\sigma^2)^{-\\frac{n + 2}{2}} \\\\ &= - (\\sigma^2)^{-\\frac{n + 4}{2}} e^{\\frac{c}{\\sigma^2}} c - e^{\\frac{c}{\\sigma^2}} \\frac{n}{2} (\\sigma^2)^{-\\frac{n + 2}{2}} \\\\ &= - e^{\\frac{c}{\\sigma^2}} (\\sigma^2)^{-\\frac{n + 4}{2}} \\Big(c + \\frac{n}{2}\\sigma^2 \\Big). \\end{align}\\] To get the maximum, this has to equal to 0, so \\[\\begin{align} c + \\frac{n}{2}\\sigma^2 &= 0 \\\\ \\sigma^2 &= -\\frac{2c}{n} \\\\ \\sigma^2 &= \\frac{\\sum_{i=1}^n (Y_i - \\mu)^2}{n}. \\end{align}\\] The MLE estimator is biased. Exercise 15.2 (Multivariate normal distribution) Derive the maximum likelihood estimate for the mean and covariance matrix of the multivariate normal. Simulate \\(n = 40\\) samples from a bivariate normal distribution (choose non-trivial parameters, that is, mean \\(\\neq 0\\) and covariance \\(\\neq 0\\)). Compute the MLE for the sample. Overlay the data with an ellipse that is determined by the MLE and an ellipse that is determined by the chosen true parameters. Repeat b. several times and observe how the estimates (ellipses) vary around the true value. Hint: For the derivation of MLE, these identities will be helpful: \\(\\frac{\\partial b^T a}{\\partial a} = \\frac{\\partial a^T b}{\\partial a} = b\\), \\(\\frac{\\partial a^T A a}{\\partial a} = (A + A^T)a\\), \\(\\frac{\\partial \\text{tr}(BA)}{\\partial A} = B^T\\), \\(\\frac{\\partial \\ln |A|}{\\partial A} = (A^{-1})^T\\), \\(a^T A a = \\text{tr}(a^T A a) = \\text{tr}(a a^T A) = \\text{tr}(Aaa^T)\\). Solution. The log likelihood of the MVN distribution is \\[\\begin{align*} l(\\mu, \\Sigma ; x) &= -\\frac{1}{2}\\Big(\\sum_{i=1}^n k\\ln(2\\pi) + |\\Sigma| + (x_i - \\mu)^T \\Sigma^{-1} (x_i - \\mu)\\Big) \\\\ &= -\\frac{n}{2}\\ln|\\Sigma| + -\\frac{1}{2}\\Big(\\sum_{i=1}^n(x_i - \\mu)^T \\Sigma^{-1} (x_i - \\mu)\\Big) + c, \\end{align*}\\] where \\(c\\) is a constant with respect to \\(\\mu\\) and \\(\\Sigma\\). To find the MLE we first need to find partial derivatives. Let us start with \\(\\mu\\). \\[\\begin{align*} \\frac{\\partial}{\\partial \\mu}l(\\mu, \\Sigma ; x) &= \\frac{\\partial}{\\partial \\mu} -\\frac{1}{2}\\Big(\\sum_{i=1}^n x_i^T \\Sigma^{-1} x_i - x_i^T \\Sigma^{-1} \\mu - \\mu^T \\Sigma^{-1} x_i + \\mu^T \\Sigma^{-1} \\mu \\Big) \\\\ &= -\\frac{1}{2}\\Big(\\sum_{i=1}^n - \\Sigma^{-1} x_i - \\Sigma^{-1} x_i + 2 \\Sigma^{-1} \\mu \\Big) \\\\ &= -\\Sigma^{-1}\\Big(\\sum_{i=1}^n - x_i + \\mu \\Big). \\end{align*}\\] Equating above with zero, we get \\[\\begin{align*} \\sum_{i=1}^n - x_i + \\mu &= 0 \\\\ \\hat{\\mu} = \\frac{1}{n} \\sum_{i=1}^n x_i, \\end{align*}\\] which is the dimension-wise empirical mean. Now for the covariance matrix \\[\\begin{align*} \\frac{\\partial}{\\partial \\Sigma^{-1}}l(\\mu, \\Sigma ; x) &= \\frac{\\partial}{\\partial \\Sigma^{-1}} -\\frac{n}{2}\\ln|\\Sigma| + -\\frac{1}{2}\\Big(\\sum_{i=1}^n(x_i - \\mu)^T \\Sigma^{-1} (x_i - \\mu)\\Big) \\\\ &= \\frac{\\partial}{\\partial \\Sigma^{-1}} -\\frac{n}{2}\\ln|\\Sigma| + -\\frac{1}{2}\\Big(\\sum_{i=1}^n \\text{tr}((x_i - \\mu)^T \\Sigma^{-1} (x_i - \\mu))\\Big) \\\\ &= \\frac{\\partial}{\\partial \\Sigma^{-1}} -\\frac{n}{2}\\ln|\\Sigma| + -\\frac{1}{2}\\Big(\\sum_{i=1}^n \\text{tr}((\\Sigma^{-1} (x_i - \\mu) (x_i - \\mu)^T )\\Big) \\\\ &= \\frac{n}{2}\\Sigma + -\\frac{1}{2}\\Big(\\sum_{i=1}^n (x_i - \\mu) (x_i - \\mu)^T \\Big). \\end{align*}\\] Equating above with zero, we get \\[\\begin{align*} \\hat{\\Sigma} = \\frac{1}{n}\\sum_{i=1}^n (x_i - \\mu) (x_i - \\mu)^T. \\end{align*}\\] set.seed(1) n <- 40 mu <- c(1, -2) Sigma <- matrix(data = c(2, -1.6, -1.6, 1.8), ncol = 2) X <- mvrnorm(n = n, mu = mu, Sigma = Sigma) colnames(X) <- c("X1", "X2") X <- as.data.frame(X) # plot.new() tru_ellip <- ellipse(mu, Sigma, draw = FALSE) colnames(tru_ellip) <- c("X1", "X2") tru_ellip <- as.data.frame(tru_ellip) mu_est <- apply(X, 2, mean) tmp <- as.matrix(sweep(X, 2, mu_est)) Sigma_est <- (1 / n) * t(tmp) %*% tmp est_ellip <- ellipse(mu_est, Sigma_est, draw = FALSE) colnames(est_ellip) <- c("X1", "X2") est_ellip <- as.data.frame(est_ellip) ggplot(data = X, aes(x = X1, y = X2)) + geom_point() + geom_path(data = tru_ellip, aes(x = X1, y = X2, color = "truth")) + geom_path(data = est_ellip, aes(x = X1, y = X2, color = "estimated")) + labs(color = "type") Exercise 15.3 (Logistic regression) Logistic regression is a popular discriminative model when our target variable is binary (categorical with 2 values). One of the ways of looking at logistic regression is that it is linear regression but instead of using the linear term as the mean of a normal RV, we use it as the mean of a Bernoulli RV. Of course, the mean of a Bernoulli is bounded on \\([0,1]\\), so, to avoid non-sensical values, we squeeze the linear between 0 and 1 with the inverse logit function inv_logit\\((z) = 1 / (1 + e^{-z})\\). This leads to the following model: \\(y_i | \\beta, x_i \\sim \\text{Bernoulli}(\\text{inv_logit}(\\beta x_i))\\). Explicitly write the likelihood function of beta. Implement the likelihood function in R. Use black-box box-constraint optimization (for example, optim() with L-BFGS) to find the maximum likelihood estimate for beta for \\(x\\) and \\(y\\) defined below. Plot the estimated probability as a function of the independent variable. Compare with the truth. Let \\(y2\\) be a response defined below. Will logistic regression work well on this dataset? Why not? How can we still use the model, without changing it? inv_log <- function (z) { return (1 / (1 + exp(-z))) } set.seed(1) x <- rnorm(100) y <- rbinom(100, size = 1, prob = inv_log(1.2 * x)) y2 <- rbinom(100, size = 1, prob = inv_log(1.2 * x + 1.4 * x^2)) Solution. \\[\\begin{align*} l(\\beta; x, y) &= p(y | x, \\beta) \\\\ &= \\ln(\\prod_{i=1}^n \\text{inv_logit}(\\beta x_i)^{y_i} (1 - \\text{inv_logit}(\\beta x_i))^{1 - y_i}) \\\\ &= \\sum_{i=1}^n y_i \\ln(\\text{inv_logit}(\\beta x_i)) + (1 - y_i) \\ln(1 - \\text{inv_logit}(\\beta x_i)). \\end{align*}\\] set.seed(1) inv_log <- function (z) { return (1 / (1 + exp(-z))) } x <- rnorm(100) y <- x y <- rbinom(100, size = 1, prob = inv_log(1.2 * x)) l_logistic <- function (beta, X, y) { logl <- -sum(y * log(inv_log(as.vector(beta %*% X))) + (1 - y) * log((1 - inv_log(as.vector(beta %*% X))))) return(logl) } my_optim <- optim(par = 0.5, fn = l_logistic, method = "L-BFGS-B", lower = 0, upper = 10, X = x, y = y) my_optim$par ## [1] 1.166558 truth_p <- data.frame(x = x, prob = inv_log(1.2 * x), type = "truth") est_p <- data.frame(x = x, prob = inv_log(my_optim$par * x), type = "estimated") plot_df <- rbind(truth_p, est_p) ggplot(data = plot_df, aes(x = x, y = prob, color = type)) + geom_point(alpha = 0.3) y2 <- rbinom(2000, size = 1, prob = inv_log(1.2 * x + 1.4 * x^2)) X2 <- cbind(x, x^2) my_optim2 <- optim(par = c(0, 0), fn = l_logistic, method = "L-BFGS-B", lower = c(0, 0), upper = c(2, 2), X = t(X2), y = y2) my_optim2$par ## [1] 1.153656 1.257649 tmp <- sweep(data.frame(x = x, x2 = x^2), 2, my_optim2$par, FUN = "*") tmp <- tmp[ ,1] + tmp[ ,2] truth_p <- data.frame(x = x, prob = inv_log(1.2 * x + 1.4 * x^2), type = "truth") est_p <- data.frame(x = x, prob = inv_log(tmp), type = "estimated") plot_df <- rbind(truth_p, est_p) ggplot(data = plot_df, aes(x = x, y = prob, color = type)) + geom_point(alpha = 0.3) Exercise 15.4 (Linear regression) For the data generated below, do the following: Compute the least squares (MLE) estimate of coefficients beta using the matrix exact solution. Compute the MLE by minimizing the sum of squared residuals using black-box optimization (optim()). Compute the MLE by using the output built-in linear regression (lm() ). Compare (a-c and the true coefficients). Compute 95% CI on the beta coefficients using the output of built-in linear regression. Compute 95% CI on the beta coefficients by using (a or b) and the bootstrap with percentile method for CI. Compare with d. set.seed(1) n <- 100 x1 <- rnorm(n) x2 <- rnorm(n) x3 <- rnorm(n) X <- cbind(x1, x2, x3) beta <- c(0.2, 0.6, -1.2) y <- as.vector(t(beta %*% t(X))) + rnorm(n, sd = 0.2) set.seed(1) n <- 100 x1 <- rnorm(n) x2 <- rnorm(n) x3 <- rnorm(n) X <- cbind(x1, x2, x3) beta <- c(0.2, 0.6, -1.2) y <- as.vector(t(beta %*% t(X))) + rnorm(n, sd = 0.2) LS_fun <- function (beta, X, y) { return(sum((y - beta %*% t(X))^2)) } my_optim <- optim(par = c(0, 0, 0), fn = LS_fun, lower = -5, upper = 5, X = X, y = y, method = "L-BFGS-B") my_optim$par ## [1] 0.1898162 0.5885946 -1.1788264 df <- data.frame(y = y, x1 = x1, x2 = x2, x3 = x3) my_lm <- lm(y ~ x1 + x2 + x3 - 1, data = df) my_lm ## ## Call: ## lm(formula = y ~ x1 + x2 + x3 - 1, data = df) ## ## Coefficients: ## x1 x2 x3 ## 0.1898 0.5886 -1.1788 # matrix solution beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y beta_hat ## [,1] ## x1 0.1898162 ## x2 0.5885946 ## x3 -1.1788264 out <- summary(my_lm) out$coefficients[ ,2] ## x1 x2 x3 ## 0.02209328 0.02087542 0.01934506 # bootstrap CI nboot <- 1000 beta_boot <- matrix(data = NA, ncol = length(beta), nrow = nboot) for (i in 1:nboot) { inds <- sample(1:n, n, replace = T) new_df <- df[inds, ] X_tmp <- as.matrix(new_df[ ,-1]) y_tmp <- new_df[ ,1] # print(nrow(new_df)) tmp_beta <- solve(t(X_tmp) %*% X_tmp) %*% t(X_tmp) %*% y_tmp beta_boot[i, ] <- tmp_beta } apply(beta_boot, 2, mean) ## [1] 0.1893281 0.5887068 -1.1800738 apply(beta_boot, 2, quantile, probs = c(0.025, 0.975)) ## [,1] [,2] [,3] ## 2.5% 0.1389441 0.5436911 -1.221560 ## 97.5% 0.2386295 0.6363102 -1.140416 out$coefficients[ ,2] ## x1 x2 x3 ## 0.02209328 0.02087542 0.01934506 Exercise 15.5 (Principal component analysis) Load the olympic data set from package ade4. The data show decathlon results for 33 men in 1988 Olympic Games. This data set serves as a great example of finding the latent structure in the data, as there are certain characteristics of the athletes that make them excel at different events. For example an explosive athlete will do particulary well in sprints and long jumps. Perform PCA (prcomp) on the data set and interpret the first 2 latent dimensions. Hint: Standardize the data first to get meaningful results. Use MLE to estimate the covariance of the standardized multivariate distribution. Decompose the estimated covariance matrix with the eigendecomposition. Compare the eigenvectors to the output of PCA. data(olympic) X <- olympic$tab X_scaled <- scale(X) my_pca <- prcomp(X_scaled) summary(my_pca) ## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## Standard deviation 1.8488 1.6144 0.97123 0.9370 0.74607 0.70088 0.65620 ## Proportion of Variance 0.3418 0.2606 0.09433 0.0878 0.05566 0.04912 0.04306 ## Cumulative Proportion 0.3418 0.6025 0.69679 0.7846 0.84026 0.88938 0.93244 ## PC8 PC9 PC10 ## Standard deviation 0.55389 0.51667 0.31915 ## Proportion of Variance 0.03068 0.02669 0.01019 ## Cumulative Proportion 0.96312 0.98981 1.00000 autoplot(my_pca, data = X, loadings = TRUE, loadings.colour = 'blue', loadings.label = TRUE, loadings.label.size = 3) Sigma_est <- (1 / nrow(X_scaled)) * t(X_scaled) %*% X_scaled Sigma_dec <- eigen(Sigma_est) Sigma_dec$vectors ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 0.4158823 0.1488081 -0.26747198 -0.08833244 -0.442314456 0.03071237 ## [2,] -0.3940515 -0.1520815 -0.16894945 -0.24424963 0.368913901 -0.09378242 ## [3,] -0.2691057 0.4835374 0.09853273 -0.10776276 -0.009754680 0.23002054 ## [4,] -0.2122818 0.0278985 -0.85498656 0.38794393 -0.001876311 0.07454380 ## [5,] 0.3558474 0.3521598 -0.18949642 0.08057457 0.146965351 -0.32692886 ## [6,] 0.4334816 0.0695682 -0.12616012 -0.38229029 -0.088802794 0.21049130 ## [7,] -0.1757923 0.5033347 0.04609969 0.02558404 0.019358607 0.61491241 ## [8,] -0.3840821 0.1495820 0.13687235 0.14396548 -0.716743474 -0.34776037 ## [9,] -0.1799436 0.3719570 -0.19232803 -0.60046566 0.095582043 -0.43744387 ## [10,] 0.1701426 0.4209653 0.22255233 0.48564231 0.339772188 -0.30032419 ## [,7] [,8] [,9] [,10] ## [1,] 0.2543985 0.663712826 -0.10839531 0.10948045 ## [2,] 0.7505343 0.141264141 0.04613910 0.05580431 ## [3,] -0.1106637 0.072505560 0.42247611 0.65073655 ## [4,] -0.1351242 -0.155435871 -0.10206505 0.11941181 ## [5,] 0.1413388 -0.146839303 0.65076229 -0.33681395 ## [6,] 0.2725296 -0.639003579 -0.20723854 0.25971800 ## [7,] 0.1439726 0.009400445 -0.16724055 -0.53450315 ## [8,] 0.2732665 -0.276873049 -0.01766443 -0.06589572 ## [9,] -0.3419099 0.058519366 -0.30619617 -0.13093187 ## [10,] 0.1868704 0.007310045 -0.45688227 0.24311846 my_pca$rotation ## PC1 PC2 PC3 PC4 PC5 PC6 ## 100 -0.4158823 0.1488081 0.26747198 -0.08833244 -0.442314456 0.03071237 ## long 0.3940515 -0.1520815 0.16894945 -0.24424963 0.368913901 -0.09378242 ## poid 0.2691057 0.4835374 -0.09853273 -0.10776276 -0.009754680 0.23002054 ## haut 0.2122818 0.0278985 0.85498656 0.38794393 -0.001876311 0.07454380 ## 400 -0.3558474 0.3521598 0.18949642 0.08057457 0.146965351 -0.32692886 ## 110 -0.4334816 0.0695682 0.12616012 -0.38229029 -0.088802794 0.21049130 ## disq 0.1757923 0.5033347 -0.04609969 0.02558404 0.019358607 0.61491241 ## perc 0.3840821 0.1495820 -0.13687235 0.14396548 -0.716743474 -0.34776037 ## jave 0.1799436 0.3719570 0.19232803 -0.60046566 0.095582043 -0.43744387 ## 1500 -0.1701426 0.4209653 -0.22255233 0.48564231 0.339772188 -0.30032419 ## PC7 PC8 PC9 PC10 ## 100 0.2543985 -0.663712826 0.10839531 -0.10948045 ## long 0.7505343 -0.141264141 -0.04613910 -0.05580431 ## poid -0.1106637 -0.072505560 -0.42247611 -0.65073655 ## haut -0.1351242 0.155435871 0.10206505 -0.11941181 ## 400 0.1413388 0.146839303 -0.65076229 0.33681395 ## 110 0.2725296 0.639003579 0.20723854 -0.25971800 ## disq 0.1439726 -0.009400445 0.16724055 0.53450315 ## perc 0.2732665 0.276873049 0.01766443 0.06589572 ## jave -0.3419099 -0.058519366 0.30619617 0.13093187 ## 1500 0.1868704 -0.007310045 0.45688227 -0.24311846 15.2 Fisher information Exercise 15.6 Let us assume a Poisson likelihood. Derive the MLE estimate of the mean. Derive the Fisher information. For the data below compute the MLE and construct confidence intervals. Use bootstrap to construct the CI for the mean. Compare with c) and discuss. x <- c(2, 5, 3, 1, 2, 1, 0, 3, 0, 2) Solution. The log likelihood of the Poisson is \\[\\begin{align*} l(\\lambda; x) = \\sum_{i=1}^n x_i \\ln \\lambda - n \\lambda - \\sum_{i=1}^n \\ln x_i! \\end{align*}\\] Taking the derivative and equating with 0 we get \\[\\begin{align*} \\frac{1}{\\hat{\\lambda}}\\sum_{i=1}^n x_i - n &= 0 \\\\ \\hat{\\lambda} &= \\frac{1}{n} \\sum_{i=1}^n x_i. \\end{align*}\\] Since \\(\\lambda\\) is the mean parameter, this was expected. For the Fischer information, we first need the second derivative, which is \\[\\begin{align*} - \\lambda^{-2} \\sum_{i=1}^n x_i. \\\\ \\end{align*}\\] Now taking the expectation of the negative of the above, we get \\[\\begin{align*} E[\\lambda^{-2} \\sum_{i=1}^n x_i] &= \\lambda^{-2} E[\\sum_{i=1}^n x_i] \\\\ &= \\lambda^{-2} n \\lambda \\\\ &= \\frac{n}{\\lambda}. \\end{align*}\\] set.seed(1) x <- c(2, 5, 3, 1, 2, 1, 0, 3, 0, 2) lambda_hat <- mean(x) finfo <- length(x) / lambda_hat mle_CI <- c(lambda_hat - 1.96 * sqrt(1 / finfo), lambda_hat + 1.96 * sqrt(1 / finfo)) boot_lambda <- c() nboot <- 1000 for (i in 1:nboot) { tmp_x <- sample(x, length(x), replace = T) boot_lambda[i] <- mean(tmp_x) } boot_CI <- c(quantile(boot_lambda, 0.025), quantile(boot_lambda, 0.975)) mle_CI ## [1] 1.045656 2.754344 boot_CI ## 2.5% 97.5% ## 1.0 2.7 Exercise 15.7 Find the Fisher information matrix for the Gamma distribution. Generate 20 samples from a Gamma distribution and plot a confidence ellipse of the inverse of Fisher information matrix around the ML estimates of the parameters. Also plot the theoretical values. Repeat the sampling several times. What do you observe? Discuss what a non-diagonal Fisher matrix implies. Hint: The digamma function is defined as \\(\\psi(x) = \\frac{\\frac{d}{dx} \\Gamma(x)}{\\Gamma(x)}\\). Additionally, you do not need to evaluate \\(\\frac{d}{dx} \\psi(x)\\). To calculate its value in R, use package numDeriv. Solution. The log likelihood of the Gamma is \\[\\begin{equation*} l(\\alpha, \\beta; x) = n \\alpha \\ln \\beta - n \\ln \\Gamma(\\alpha) + (\\alpha - 1) \\sum_{i=1}^n \\ln x_i - \\beta \\sum_{i=1}^n x_i. \\end{equation*}\\] Let us calculate the derivatives. \\[\\begin{align*} \\frac{\\partial}{\\partial \\alpha} l(\\alpha, \\beta; x) &= n \\ln \\beta - n \\psi(\\alpha) + \\sum_{i=1}^n \\ln x_i, \\\\ \\frac{\\partial}{\\partial \\beta} l(\\alpha, \\beta; x) &= \\frac{n \\alpha}{\\beta} - \\sum_{i=1}^n x_i, \\\\ \\frac{\\partial^2}{\\partial \\alpha \\beta} l(\\alpha, \\beta; x) &= \\frac{n}{\\beta}, \\\\ \\frac{\\partial^2}{\\partial \\alpha^2} l(\\alpha, \\beta; x) &= - n \\frac{\\partial}{\\partial \\alpha} \\psi(\\alpha), \\\\ \\frac{\\partial^2}{\\partial \\beta^2} l(\\alpha, \\beta; x) &= - \\frac{n \\alpha}{\\beta^2}. \\end{align*}\\] The Fisher information matrix is then \\[\\begin{align*} I(\\alpha, \\beta) = - E[ \\begin{bmatrix} - n \\psi'(\\alpha) & \\frac{n}{\\beta} \\\\ \\frac{n}{\\beta} & - \\frac{n \\alpha}{\\beta^2} \\end{bmatrix} ] = \\begin{bmatrix} n \\psi'(\\alpha) & - \\frac{n}{\\beta} \\\\ - \\frac{n}{\\beta} & \\frac{n \\alpha}{\\beta^2} \\end{bmatrix} \\end{align*}\\] A non-diagonal Fisher matrix implies that the parameter estimates are linearly dependent. set.seed(1) n <- 20 pars_theor <- c(5, 2) x <- rgamma(n, 5, 2) # MLE for alpha and beta log_lik <- function (pars, x) { n <- length(x) return (- (n * pars[1] * log(pars[2]) - n * log(gamma(pars[1])) + (pars[1] - 1) * sum(log(x)) - pars[2] * sum(x))) } my_optim <- optim(par = c(1,1), fn = log_lik, method = "L-BFGS-B", lower = c(0.001, 0.001), upper = c(8, 8), x = x) pars_mle <- my_optim$par fish_mat <- matrix(data = NA, nrow = 2, ncol = 2) fish_mat[1,2] <- - n / pars_mle[2] fish_mat[2,1] <- - n / pars_mle[2] fish_mat[2,2] <- (n * pars_mle[1]) / (pars_mle[2]^2) fish_mat[1,1] <- n * grad(digamma, pars_mle[1]) fish_mat_inv <- solve(fish_mat) est_ellip <- ellipse(pars_mle, fish_mat_inv, draw = FALSE) colnames(est_ellip) <- c("X1", "X2") est_ellip <- as.data.frame(est_ellip) ggplot() + geom_point(data = data.frame(x = pars_mle[1], y = pars_mle[2]), aes(x = x, y = y)) + geom_path(data = est_ellip, aes(x = X1, y = X2)) + geom_point(aes(x = pars_theor[1], y = pars_theor[2]), color = "red") + geom_text(aes(x = pars_theor[1], y = pars_theor[2], label = "Theoretical parameters"), color = "red", nudge_y = -0.2) 15.3 The German tank problem Exercise 15.8 (The German tank problem) During WWII the allied intelligence were faced with an important problem of estimating the total production of certain German tanks, such as the Panther. What turned out to be a successful approach was to estimate the maximum from the serial numbers of the small sample of captured or destroyed tanks (describe the statistical model used). What assumptions were made by using the above model? Do you think they are reasonable assumptions in practice? Show that the plug-in estimate for the maximum (i.e. the maximum of the sample) is a biased estimator. Derive the maximum likelihood estimate of the maximum. Check that the following estimator is not biased: \\(\\hat{n} = \\frac{k + 1}{k}m - 1\\). Solution. The data are the serial numbers of the tanks. The parameter is \\(n\\), the total production of the tank. The distribution of the serial numbers is a discrete uniform distribution over all serial numbers. One of the assumptions is that we have i.i.d samples, however in practice this might not be true, as some tanks produced later could be sent to the field later, therefore already in theory we would not be able to recover some values from the population. To find the expected value we first need to find the distribution of \\(m\\). Let us start with the CDF. \\[\\begin{align*} F_m(x) = P(Y_1 < x,...,Y_k < x). \\end{align*}\\] If \\(x < k\\) then \\(F_m(x) = 0\\) and if \\(x \\geq 1\\) then \\(F_m(x) = 1\\). What about between those values. So the probability that the maximum value is less than or equal to \\(m\\) is just the number of possible draws from \\(Y\\) that are all smaller than \\(m\\), divided by all possible draws. This is \\(\\frac{{x}\\choose{k}}{{n}\\choose{k}}\\). The PDF on the suitable bounds is then \\[\\begin{align*} P(m = x) = F_m(x) - F_m(x - 1) = \\frac{\\binom{x}{k} - \\binom{x - 1}{k}}{\\binom{n}{k}} = \\frac{\\binom{x - 1}{k - 1}}{\\binom{n}{k}}. \\end{align*}\\] Now we can calculate the expected value of \\(m\\) using some combinatorial identities. \\[\\begin{align*} E[m] &= \\sum_{i = k}^n i \\frac{{i - 1}\\choose{k - 1}}{{n}\\choose{k}} \\\\ &= \\sum_{i = k}^n i \\frac{\\frac{(i - 1)!}{(k - 1)!(i - k)!}}{{n}\\choose{k}} \\\\ &= \\frac{k}{\\binom{n}{k}}\\sum_{i = k}^n \\binom{i}{k} \\\\ &= \\frac{k}{\\binom{n}{k}} \\binom{n + 1}{k + 1} \\\\ &= \\frac{k(n + 1)}{k + 1}. \\end{align*}\\] The bias of this estimator is then \\[\\begin{align*} E[m] - n = \\frac{k(n + 1)}{k + 1} - n = \\frac{k - n}{k + 1}. \\end{align*}\\] The probability that we observed our sample \\(Y = {Y_1, Y_2,...,,Y_k}\\) given \\(n\\) is \\(\\frac{1}{{n}\\choose{k}}\\). We need to find such \\(n^*\\) that this function is maximized. Additionally, we have a constraint that \\(n^* \\geq m = \\max{(Y)}\\). Let us plot this function for \\(m = 10\\) and \\(k = 4\\). library(ggplot2) my_fun <- function (x, m, k) { tmp <- 1 / (choose(x, k)) tmp[x < m] <- 0 return (tmp) } x <- 1:20 y <- my_fun(x, 10, 4) df <- data.frame(x = x, y = y) ggplot(data = df, aes(x = x, y = y)) + geom_line() ::: {.solution} (continued) We observe that the maximum of this function lies at the maximum value of the sample. Therefore \\(n^* = m\\) and ML estimate equals the plug-in estimate. \\[\\begin{align*} E[\\hat{n}] &= \\frac{k + 1}{k} E[m] - 1 \\\\ &= \\frac{k + 1}{k} \\frac{k(n + 1)}{k + 1} - 1 \\\\ &= n. \\end{align*}\\] ::: "],["nhst.html", "Chapter 16 Null hypothesis significance testing", " Chapter 16 Null hypothesis significance testing This chapter deals with null hypothesis significance testing. The students are expected to acquire the following knowledge: Binomial test. t-test. Chi-squared test. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } Exercise 16.1 (Binomial test) We assume \\(y_i \\in \\{0,1\\}\\), \\(i = 1,...,n\\) and \\(y_i | \\theta = 0.5 \\sim i.i.d.\\) Bernoulli\\((\\theta)\\). The test statistic is \\(X = \\sum_{i=1}^n\\) and the rejection region R is defined as the region where the probability of obtaining such or more extreme \\(X\\) given \\(\\theta = 0.5\\) is less than 0.05. Derive and plot the power function of the test for \\(n=100\\). What is the significance level of this test if \\(H0: \\theta = 0.5\\)? At which values of X will we reject the null hypothesis? # a # First we need the rejection region, so we need to find X_min and X_max n <- 100 qbinom(0.025, n, 0.5) ## [1] 40 qbinom(0.975, n, 0.5) ## [1] 60 pbinom(40, n, 0.5) ## [1] 0.02844397 pbinom(60, n, 0.5) ## [1] 0.9823999 X_min <- 39 X_max <- 60 thetas <- seq(0, 1, by = 0.01) beta_t <- 1 - pbinom(X_max, size = n, prob = thetas) + pbinom(X_min, size = n, prob = thetas) plot(beta_t) # b # The significance level is beta_t[51] ## [1] 0.0352002 # We will reject the null hypothesis at X values below X_min and above X_max. Exercise 16.2 (Long-run guarantees of the t-test) Generate a sample of size \\(n = 10\\) from the standard normal. Use the two-sided t-test with \\(H0: \\mu = 0\\) and record the p-value. Can you reject H0 at 0.05 significance level? (before simulating) If we repeated (b) many times, what would be the relative frequency of false positives/Type I errors (rejecting the null that is true)? What would be the relative frequency of false negatives /Type II errors (retaining the null when the null is false)? (now simulate b and check if the simulation results match your answer in b) Similar to (a-c) but now we generate data from N(-0.5, 1). Similar to (a-c) but now we generate data from N(\\(\\mu\\), 1) where we every time pick a different \\(\\mu < 0\\) and use a one-sided test \\(H0: \\mu <= 0\\). set.seed(2) # a x <- rnorm(10) my_test <- t.test(x, alternative = "two.sided", mu = 0) my_test ## ## One Sample t-test ## ## data: x ## t = 0.6779, df = 9, p-value = 0.5149 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## -0.4934661 0.9157694 ## sample estimates: ## mean of x ## 0.2111516 # we can not reject the null hypothesis # b # The expected value of false positives would be 0.05. The expected value of # true negatives would be 0, as there are no negatives (the null hypothesis is # always the truth). nit <- 1000 typeIerr <- vector(mode = "logical", length = nit) typeIIerr <- vector(mode = "logical", length = nit) for (i in 1:nit) { x <- rnorm(10) my_test <- t.test(x, alternative = "two.sided", mu = 0) if (my_test$p.value < 0.05) { typeIerr[i] <- T } else { typeIerr[i] <- F } } mean(typeIerr) ## [1] 0.052 sd(typeIerr) / sqrt(nit) ## [1] 0.007024624 # d # We can not estimate the percentage of true negatives, but it will probably be # higher than 0.05. There will be no false positives as the null hypothesis is # always false. typeIIerr <- vector(mode = "logical", length = nit) for (i in 1:nit) { x <- rnorm(10, -0.5) my_test <- t.test(x, alternative = "two.sided", mu = 0) if (my_test$p.value < 0.05) { typeIIerr[i] <- F } else { typeIIerr[i] <- T } } mean(typeIIerr) ## [1] 0.719 sd(typeIIerr) / sqrt(nit) ## [1] 0.01422115 # e # The expected value of false positives would be lower than 0.05. The expected # value of true negatives would be 0, as there are no negatives (the null # hypothesis is always the truth). typeIerr <- vector(mode = "logical", length = nit) for (i in 1:nit) { u <- runif(1, -1, 0) x <- rnorm(10, u) my_test <- t.test(x, alternative = "greater", mu = 0) if (my_test$p.value < 0.05) { typeIerr[i] <- T } else { typeIerr[i] <- F } } mean(typeIerr) ## [1] 0.012 sd(typeIerr) / sqrt(nit) ## [1] 0.003444977 Exercise 16.3 (T-test, confidence intervals, and bootstrap) Sample \\(n=20\\) from a standard normal distribution and calculate the p-value using t-test, confidence intervals based on normal distribution, and bootstrap. Repeat this several times and check how many times we rejected the null hypothesis (made a type I error). Hint: For the confidence intervals you can use function CI from the Rmisc package. set.seed(1) library(Rmisc) nit <- 1000 n_boot <- 100 t_logic <- rep(F, nit) boot_logic <- rep(F, nit) norm_logic <- rep(F, nit) for (i in 1:nit) { x <- rnorm(20) my_test <- t.test(x) my_CI <- CI(x) if (my_test$p.value <= 0.05) t_logic[i] <- T boot_tmp <- vector(mode = "numeric", length = n_boot) for (j in 1:n_boot) { tmp_samp <- sample(x, size = 20, replace = T) boot_tmp[j] <- mean(tmp_samp) } if ((quantile(boot_tmp, 0.025) >= 0) | (quantile(boot_tmp, 0.975) <= 0)) { boot_logic[i] <- T } if ((my_CI[3] >= 0) | (my_CI[1] <= 0)) { norm_logic[i] <- T } } mean(t_logic) ## [1] 0.053 sd(t_logic) / sqrt(nit) ## [1] 0.007088106 mean(boot_logic) ## [1] 0.093 sd(boot_logic) / sqrt(nit) ## [1] 0.009188876 mean(norm_logic) ## [1] 0.053 sd(norm_logic) / sqrt(nit) ## [1] 0.007088106 Exercise 16.4 (Chi-squared test) Show that the \\(\\chi^2 = \\sum_{i=1}^k \\frac{(O_i - E_i)^2}{E_i}\\) test statistic is approximately \\(\\chi^2\\) distributed when we have two categories. Let us look at the US voting data here. Compare the number of voters who voted for Trump or Hillary depending on their income (less or more than 100.000 dollars per year). Manually calculate the chi-squared statistic, compare to the chisq.test in R, and discuss the results. Visualize the test. Solution. Let \\(X_i\\) be binary variables, \\(i = 1,...,n\\). We can then express the test statistic as \\[\\begin{align} \\chi^2 = &\\frac{(O_i - np)^2}{np} + \\frac{(n - O_i - n(1 - p))^2}{n(1 - p)} \\\\ &= \\frac{(O_i - np)^2}{np(1 - p)} \\\\ &= (\\frac{O_i - np}{\\sqrt{np(1 - p)}})^2. \\end{align}\\] When \\(n\\) is large, this distrbution is approximately normal with \\(\\mu = np\\) and \\(\\sigma^2 = np(1 - p)\\) (binomial converges in distribution to standard normal). By definition, the chi-squared distribution with \\(k\\) degrees of freedom is a sum of squares of \\(k\\) independent standard normal random variables. n <- 24588 less100 <- round(0.66 * n * c(0.49, 0.45, 0.06)) # some rounding, but it should not affect results more100 <- round(0.34 * n * c(0.47, 0.47, 0.06)) x <- rbind(less100, more100) colnames(x) <- c("Clinton", "Trump", "other/no answer") print(x) ## Clinton Trump other/no answer ## less100 7952 7303 974 ## more100 3929 3929 502 chisq.test(x) ## ## Pearson's Chi-squared test ## ## data: x ## X-squared = 9.3945, df = 2, p-value = 0.00912 x ## Clinton Trump other/no answer ## less100 7952 7303 974 ## more100 3929 3929 502 csum <- apply(x, 2, sum) rsum <- apply(x, 1, sum) chi2 <- (x[1,1] - csum[1] * rsum[1] / sum(x))^2 / (csum[1] * rsum[1] / sum(x)) + (x[1,2] - csum[2] * rsum[1] / sum(x))^2 / (csum[2] * rsum[1] / sum(x)) + (x[1,3] - csum[3] * rsum[1] / sum(x))^2 / (csum[3] * rsum[1] / sum(x)) + (x[2,1] - csum[1] * rsum[2] / sum(x))^2 / (csum[1] * rsum[2] / sum(x)) + (x[2,2] - csum[2] * rsum[2] / sum(x))^2 / (csum[2] * rsum[2] / sum(x)) + (x[2,3] - csum[3] * rsum[2] / sum(x))^2 / (csum[3] * rsum[2] / sum(x)) chi2 ## Clinton ## 9.394536 1 - pchisq(chi2, df = 2) ## Clinton ## 0.009120161 x <- seq(0, 15, by = 0.01) df <- data.frame(x = x) ggplot(data = df, aes(x = x)) + stat_function(fun = dchisq, args = list(df = 2)) + geom_segment(aes(x = chi2, y = 0, xend = chi2, yend = dchisq(chi2, df = 2))) + stat_function(fun = dchisq, args = list(df = 2), xlim = c(chi2, 15), geom = "area", fill = "red") "],["bi.html", "Chapter 17 Bayesian inference 17.1 Conjugate priors 17.2 Posterior sampling", " Chapter 17 Bayesian inference This chapter deals with Bayesian inference. The students are expected to acquire the following knowledge: How to set prior distribution. Compute posterior distribution. Compute posterior predictive distribution. Use sampling for inference. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 17.1 Conjugate priors Exercise 17.1 (Poisson-gamma model) Let us assume a Poisson likelihood and a gamma prior on the Poisson mean parameter (this is a conjugate prior). Derive posterior Below we have some data, which represents number of goals in a football match. Choose sensible prior for this data (draw the gamma density if necessary), justify it. Compute the posterior. Compute an interval such that the probability that the true mean is in there is 95%. What is the probability that the true mean is greater than 2.5? Back to theory: Compute prior predictive and posterior predictive. Discuss why the posterior predictive is overdispersed and not Poisson? Draw a histogram of the prior predictive and posterior predictive for the data from (b). Discuss. Generate 10 and 100 random samples from a Poisson distribution and compare the posteriors with a flat prior, and a prior concentrated away from the truth. x <- c(3, 2, 1, 1, 5, 4, 0, 0, 4, 3) Solution. \\[\\begin{align*} p(\\lambda | X) &= \\frac{p(X | \\lambda) p(\\lambda)}{\\int_0^\\infty p(X | \\lambda) p(\\lambda) d\\lambda} \\\\ &\\propto p(X | \\lambda) p(\\lambda) \\\\ &= \\Big(\\prod_{i=1}^n \\frac{1}{x_i!} \\lambda^{x_i} e^{-\\lambda}\\Big) \\frac{\\beta^\\alpha}{\\Gamma(\\alpha)} \\lambda^{\\alpha - 1} e^{-\\beta \\lambda} \\\\ &\\propto \\lambda^{\\sum_{i=1}^n x_i + \\alpha - 1} e^{- \\lambda (n + \\beta)} \\\\ \\end{align*}\\] We recognize this as the shape of a gamma distribution, therefore \\[\\begin{align*} \\lambda | X \\sim \\text{gamma}(\\alpha + \\sum_{i=1}^n x_i, \\beta + n) \\end{align*}\\] For the prior predictive, we have \\[\\begin{align*} p(x^*) &= \\int_0^\\infty p(x^*, \\lambda) d\\lambda \\\\ &= \\int_0^\\infty p(x^* | \\lambda) p(\\lambda) d\\lambda \\\\ &= \\int_0^\\infty \\frac{1}{x^*!} \\lambda^{x^*} e^{-\\lambda} \\frac{\\beta^\\alpha}{\\Gamma(\\alpha)} \\lambda^{\\alpha - 1} e^{-\\beta \\lambda} d\\lambda \\\\ &= \\frac{\\beta^\\alpha}{\\Gamma(x^* + 1)\\Gamma(\\alpha)} \\int_0^\\infty \\lambda^{x^* + \\alpha - 1} e^{-\\lambda (1 + \\beta)} d\\lambda \\\\ &= \\frac{\\beta^\\alpha}{\\Gamma(x^* + 1)\\Gamma(\\alpha)} \\frac{\\Gamma(x^* + \\alpha)}{(1 + \\beta)^{x^* + \\alpha}} \\int_0^\\infty \\frac{(1 + \\beta)^{x^* + \\alpha}}{\\Gamma(x^* + \\alpha)} \\lambda^{x^* + \\alpha - 1} e^{-\\lambda (1 + \\beta)} d\\lambda \\\\ &= \\frac{\\beta^\\alpha}{\\Gamma(x^* + 1)\\Gamma(\\alpha)} \\frac{\\Gamma(x^* + \\alpha)}{(1 + \\beta)^{x^* + \\alpha}} \\\\ &= \\frac{\\Gamma(x^* + \\alpha)}{\\Gamma(x^* + 1)\\Gamma(\\alpha)} (\\frac{\\beta}{1 + \\beta})^\\alpha (\\frac{1}{1 + \\beta})^{x^*}, \\end{align*}\\] which we recognize as the negative binomial distribution with \\(r = \\alpha\\) and \\(p = \\frac{1}{\\beta + 1}\\). For the posterior predictive, the calculation is the same, only now the parameters are \\(r = \\alpha + \\sum_{i=1}^n x_i\\) and \\(p = \\frac{1}{\\beta + n + 1}\\). There are two sources of uncertainty in the predictive distribution. First is the uncertainty about the population. Second is the variability in sampling from the population. When \\(n\\) is large, the latter is going to be very small. But when \\(n\\) is small, the latter is going to be higher, resulting in an overdispersed predictive distribution. x <- c(3, 2, 1, 1, 5, 4, 0, 0, 4, 3) # b # quick visual check of the prior ggplot(data = data.frame(x = seq(0, 5, by = 0.01)), aes(x = x)) + stat_function(fun = dgamma, args = list(shape = 1, rate = 1)) palpha <- 1 pbeta <- 1 alpha_post <- palpha + sum(x) beta_post <- pbeta + length(x) ggplot(data = data.frame(x = seq(0, 5, by = 0.01)), aes(x = x)) + stat_function(fun = dgamma, args = list(shape = alpha_post, rate = beta_post)) # probability of being higher than 2.5 1 - pgamma(2.5, alpha_post, beta_post) ## [1] 0.2267148 # interval qgamma(c(0.025, 0.975), alpha_post, beta_post) ## [1] 1.397932 3.137390 # d prior_pred <- rnbinom(1000, size = palpha, prob = 1 - 1 / (pbeta + 1)) post_pred <- rnbinom(1000, size = palpha + sum(x), prob = 1 - 1 / (pbeta + 10 + 1)) df <- data.frame(prior = prior_pred, posterior = post_pred) df <- gather(df) ggplot(df, aes(x = value, fill = key)) + geom_histogram(position = "dodge") # e set.seed(1) x1 <- rpois(10, 2.5) x2 <- rpois(100, 2.5) alpha_flat <- 1 beta_flat <- 0.1 alpha_conc <- 50 beta_conc <- 10 n <- 10000 df_flat <- data.frame(x1 = rgamma(n, alpha_flat + sum(x1), beta_flat + 10), x2 = rgamma(n, alpha_flat + sum(x2), beta_flat + 100), type = "flat") df_flat <- tidyr::gather(df_flat, key = "key", value = "value", - type) df_conc <- data.frame(x1 = rgamma(n, alpha_conc + sum(x1), beta_conc + 10), x2 = rgamma(n, alpha_conc + sum(x2), beta_conc + 100), type = "conc") df_conc <- tidyr::gather(df_conc, key = "key", value = "value", - type) df <- rbind(df_flat, df_conc) ggplot(data = df, aes(x = value, color = type)) + facet_wrap(~ key) + geom_density() 17.2 Posterior sampling Exercise 17.2 (Bayesian logistic regression) In Chapter 15 we implemented a MLE for logistic regression (see the code below). For this model, conjugate priors do not exist, which complicates the calculation of the posterior. However, we can use sampling from the numerator of the posterior, using rejection sampling. Set a sensible prior distribution on \\(\\beta\\) and use rejection sampling to find the posterior distribution. In a) you will get a distribution of parameter \\(\\beta\\). Plot the probabilities (as in exercise 15.3) for each sample of \\(\\beta\\) and compare to the truth. Hint: We can use rejection sampling even for functions which are not PDFs – they do not have to sum/integrate to 1. We just need to use a suitable envelope that we know how to sample from. For example, here we could use a uniform distribution and scale it suitably. set.seed(1) inv_log <- function (z) { return (1 / (1 + exp(-z))) } x <- rnorm(100) y <- x y <- rbinom(100, size = 1, prob = inv_log(1.2 * x)) l_logistic <- function (beta, X, y) { logl <- -sum(y * log(inv_log(as.vector(beta %*% X))) + (1 - y) * log((1 - inv_log(as.vector(beta %*% X))))) return(logl) } my_optim <- optim(par = 0.5, fn = l_logistic, method = "L-BFGS-B", lower = 0, upper = 10, X = x, y = y) my_optim$par # Let's say we believe that the mean of beta is 0.5. Since we are not very sure # about this, we will give it a relatively high variance. So a normal prior with # mean 0.5 and standard deviation 5. But there is no right solution to this, # this is basically us expressing our prior belief in the parameter values. set.seed(1) inv_log <- function (z) { return (1 / (1 + exp(-z))) } x <- rnorm(100) y <- x y <- rbinom(100, size = 1, prob = inv_log(1.2 * x)) l_logistic <- function (beta, X, y) { logl <- -sum(y * log(inv_log(as.vector(beta %*% X))) + (1 - y) * log((1 - inv_log(as.vector(beta %*% X))))) if (is.nan(logl)) logl <- Inf return(logl) } my_optim <- optim(par = 0.5, fn = l_logistic, method = "L-BFGS-B", lower = 0, upper = 10, X = x, y = y) my_optim$par ## [1] 1.166558 f_logistic <- function (beta, X, y) { logl <- prod(inv_log(as.vector(beta %*% X))^y * (1 - inv_log(as.vector(beta %*% X)))^(1 - y)) return(logl) } a <- seq(0, 3, by = 0.01) my_l <- c() for (i in a) { my_l <- c(my_l, f_logistic(i, x, y) * dnorm(i, 0.5, 5)) } plot(my_l) envlp <- 10^(-25.8) * dunif(a, -5, 5) # found by trial and error tmp <- data.frame(envel = envlp, l = my_l, t = a) tmp <- gather(tmp, key = "key", value = "value", - t) ggplot(tmp, aes(x = t, y = value, color = key)) + geom_line() # envelope OK set.seed(1) nsamps <- 1000 samps <- c() for (i in 1:nsamps) { tmp <- runif(1, -5, 5) u <- runif(1, 0, 1) if (u < (f_logistic(tmp, x, y) * dnorm(tmp, 0.5, 5)) / (10^(-25.8) * dunif(tmp, -5, 5))) { samps <- c(samps, tmp) } } plot(density(samps)) mean(samps) ## [1] 1.211578 median(samps) ## [1] 1.204279 truth_p <- data.frame(x = x, prob = inv_log(1.2 * x), type = "truth") preds <- inv_log(x %*% t(samps)) preds <- gather(cbind(as.data.frame(preds), x = x), key = "key", "value" = value, - x) ggplot(preds, aes(x = x, y = value)) + geom_line(aes(group = key), color = "gray", alpha = 0.7) + geom_point(data = truth_p, aes(y = prob), color = "red", alpha = 0.7) + theme_bw() "],["distributions-intutition.html", "Chapter 18 Distributions intutition 18.1 Discrete distributions 18.2 Continuous distributions", " Chapter 18 Distributions intutition This chapter is intended to help you familiarize yourself with the different probability distributions you will encounter in this course. You will need to use Appendix B extensively as a reference for the basic properties of distributions, so keep it close! .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 18.1 Discrete distributions Exercise 18.1 (Bernoulli intuition 1) Arguably the simplest distribution you will encounter is the Bernoulli distribution. It is a discrete probability distribution used to represent the outcome of a yes/no question. It has one parameter \\(p\\) which is the probability of success. The probability of failure is \\((1-p)\\), sometimes denoted as \\(q\\). A classic way to think about a Bernoulli trial (a yes/no experiment) is a coin flip. Real coins are fair, meaning the probability of either heads (1) or tails (0) are the same, so \\(p=0.5\\) as shown below in figure a. Alternatively we may want to represent a process that doesn’t have equal probabilities of outcomes like “Will a throw of a fair die result in a 6?”. In this case \\(p=\\frac{1}{6}\\), shown in figure b. Using your knowledge of the Bernoulli distribution use the throw of a fair die to think of events, such that: \\(p = 0.5\\) \\(p = \\frac{5}{6}\\) \\(q = \\frac{2}{3}\\) Solution. An event that is equally likely to happen or not happen i.e. \\(p = 0.5\\) would be throwing an even number. More formally we can name this event \\(A\\) and write: \\(A = \\{2,4,6\\}\\), its probability being \\(P(A) = 0.5\\) An example of an event with \\(p = \\frac{5}{6}\\) would be throwing a number greater than 1. Defined as \\(B = \\{2,3,4,5,6\\}\\). We need an event that fails \\(\\frac{2}{3}\\) of the time. Alternatively we can reverse the problem and find an event that succeeds \\(\\frac{1}{3}\\) of the time, since: \\(q = 1 - p \\implies p = 1 - q = \\frac{1}{3}\\). The event that our outcome is divisible by 3: \\(C = \\{3, 6\\}\\) satisfies this condition. Exercise 18.2 (Binomial intuition 1) The binomial distribution is a generalization of the Bernoulli distribution. Instead of considering a single Bernoulli trial, we now consider a sequence of \\(n\\) trials, which are independent and have the same parameter \\(p\\). So the binomial distribution has two parameters \\(n\\) - the number of trials and \\(p\\) - the probability of success for each trial. If we return to our coin flip representation, we now flip a coin several times. The binomial distribution will give us the probabilities of all possible outcomes. Below we show the distribution for a series of 10 coin flips with a fair coin (left) and a biased coin (right). The numbers on the x axis represent the number of times the coin landed heads. Using your knowledge of the binomial distribution: Take the pmf of the binomial distribution and plug in \\(n=1\\), check that it is in fact equivalent to a Bernoulli distribution. In our examples we show the graph of a binomial distribution over 10 trials with \\(p=0.8\\). If we take a look at the graph, it appears as though the probabilities of getting 0,1,2 or 3 heads in 10 flips are zero. Is it actually zero? Check by plugging in the values into the pmf. Solution. The pmf of a binomial distribution is \\(\\binom{n}{k} p^k (1 - p)^{n - k}\\), now we insert \\(n=1\\) to get: \\[\\binom{1}{k} p^k (1 - p)^{1 - k}\\] Not quite equivalent to a Bernoulli, however note that the support of the binomial distribution is defined as \\(k \\in \\{0,1,\\dots,n\\}\\), so in our case \\(k = \\{0,1\\}\\), then: \\[\\binom{1}{0} = \\binom{1}{1} = 1\\] we get: \\(p^k (1 - p)^{1 - k}\\) ,the Bernoulli distribution. As we already know \\(p=0.8, n=10\\), so: \\[\\binom{10}{0} 0.8^0 (1 - 0.8)^{10 - 0} = 1.024 \\cdot 10^{-7}\\] \\[\\binom{10}{1} 0.8^1 (1 - 0.8)^{10 - 1} = 4.096 \\cdot 10^{-6}\\] \\[\\binom{10}{2} 0.8^2 (1 - 0.8)^{10 - 2} = 7.3728 \\cdot 10^{-5}\\] \\[\\binom{10}{3} 0.8^3 (1 - 0.8)^{10 - 3} = 7.86432\\cdot 10^{-4}\\] So the probabilities are not zero, just very small. Exercise 18.3 (Poisson intuition 1) Below are shown 3 different graphs of the Poisson distribution. Your task is to replicate them on your own in R by varying the \\(\\lambda\\) parameter. Hint: You can use dpois() to get the probabilities. library(ggplot2) library(gridExtra) x = 0:15 # Create Poisson data data1 <- data.frame(x = x, y = dpois(x, lambda = 0.1)) data2 <- data.frame(x = x, y = dpois(x, lambda = 1)) data3 <- data.frame(x = x, y = dpois(x, lambda = 7.5)) # Create individual ggplot objects plot1 <- ggplot(data1, aes(x, y)) + geom_col() + xlab("x") + ylab("Probability") + ylim(0,1) plot2 <- ggplot(data2, aes(x, y)) + geom_col() + xlab("x") + ylab(NULL) + ylim(0,1) plot3 <- ggplot(data3, aes(x, y)) + geom_col() + xlab("x") + ylab(NULL) + ylim(0,1) # Combine the plots grid.arrange(plot1, plot2, plot3, ncol = 3) Exercise 18.4 (Poisson intuition 2) The Poisson distribution is a discrete probability distribution that models the probability of a given number of events occuring within processes where events occur at a constant mean rate and independently of each other - a Poisson process. It has a single parameter \\(\\lambda\\), which represents the constant mean rate. A classic example of a scenario that can be modeled using the Poisson distribution is the number of calls received at a call center in a day (or in fact any other time interval). Suppose you work in a call center and have some understanding of probability distributions. You overhear your supervisor mentioning that the call center receives an average of 2.5 calls per day. Using your knowledge of the Poisson distribution, calculate: The probability you will get no calls today. The probability you will get more than 5 calls today. Solution. First recall the Poisson pmf: \\[p(k) = \\frac{\\lambda^k e^{-\\lambda}}{k!}\\] as stated previously our parameter \\(\\lambda = 2.5\\) To get the probability of no calls we simply plug in \\(k = 0\\), so: \\[p(0) = \\frac{2.5^0 e^{-2.5}}{0!} = e^{-2.5} \\approx 0.082\\] The support of the Poisson distribution is non-negative integers. So if we wanted to calculate the probability of getting more than 5 calls we would need to add up the probabilities of getting 6 calls and 7 calls and so on up to infinity. Let us instead remember that the sum of all probabilties will be 1, we will reverse the problem and instead ask “What is the probability we get 5 calls or less?”. We can subtract the probability of the opposite outcome (the complement) from 1 to get the probability of our original question. \\[P(k > 5) = 1 - P(k \\leq 5)\\] \\[P(k \\leq 5) = \\sum_{i=0}^{5} p(i) = p(0) + p(1) + p(2) + p(3) + p(4) + p(5) =\\] \\[= \\frac{2.5^0 e^{-2.5}}{0!} + \\frac{2.5^1 e^{-2.5}}{1!} + \\dots =\\] \\[=0.957979\\] So the probability of geting more than 5 calls will be \\(1 - 0.957979 = 0.042021\\) Exercise 18.5 (Geometric intuition 1) The geometric distribution is a discrete distribution that models the number of failures before the first success in a sequence of independent Bernoulli trials. It has a single parameter \\(p\\), representing the probability of success and its support is all non-negative integers \\(\\{0,1,2,\\dots\\}\\). NOTE: There are two forms of this distribution, the one we just described and another that models the number of trials before the first success. The difference is subtle yet significant and you are likely to encounter both forms. The key to telling them apart is to check their support, since the number of trials has to be at least \\(1\\), for this case we have \\(\\{1,2,\\dots\\}\\). In the graph below we show the pmf of a geometric distribution with \\(p=0.5\\). This can be thought of as the number of successive failures (tails) in the flip of a fair coin. You can see that there’s a 50% chance you will have zero failures i.e. you will flip a heads on your very first attempt. But there is some smaller chance that you will flip a sequence of tails in a row, with longer sequences having ever lower probability. Create an equivalent graph that represents the probability of rolling a 6 with a fair 6-sided die. Use the formula for the mean of the geometric distribution and determine the average number of failures before you roll a 6. Look up the alternative form of the geometric distribtuion and again use the formula for the mean to determine the average number of trials before you roll a 6. Solution. Parameter p (the probability of success) for rolling a 6 is \\(p=\\frac{1}{6}\\). library(ggplot2) # Parameters p <- 1/6 x_vals <- 0:9 # Starting from 0 probs <- dgeom(x_vals, p) # Data data <- data.frame(x_vals, probs) # Plot ggplot(data, aes(x=x_vals, y=probs)) + geom_segment(aes(xend=x_vals, yend=0), color="black", size=1) + geom_point(color="red", size=2) + labs(x = "Number of trials", y = "Probability") + theme_minimal() + scale_x_continuous(breaks = x_vals) # This line ensures integer x-axis labels ::: {.solution} b) The expected value of a random variable (the mean) is denoted as \\(E[X]\\). \\[E[X] = \\frac{1-p}{p}= \\frac{1- \\frac{1}{6}}{\\frac{1}{6}} = \\frac{5}{6}\\cdot 6 = 5\\] On average we will fail 5 times before we roll our first 6. The alternative form of this distribution (with support on all positive integers) has a slightly different formula for the mean. This change reflects the difference in the way we posed our question: \\[E[X] = \\frac{1}{p} = \\frac{1}{\\frac{1}{6}} = 6\\] On average we will have to throw the die 6 times before we roll a 6. ::: 18.2 Continuous distributions Exercise 18.6 (Uniform intuition 1) The need for a randomness is a common problem. A practical solution are so-called random number generators (RNGs). The simplest RNG one would think of is choosing a set of numbers and having the generator return a number at random, where the probability of returning any number from this set is the same. If this set is an interval of real numbers, then we’ve basically described the continuous uniform distribution. It has two parameters \\(a\\) and \\(b\\), which define the beginning and end of its support respectively. Let’s think of the mean intuitively. The expected value or mean of a distribution is the pivot point on our x-axis, which “balances” the graph. Given parameters \\(a\\) and \\(b\\) what is your intuitive guess of the mean for this distribution? A special case of the uniform distribution is the standard uniform distribution with \\(a=0\\) and \\(b=1\\). Write the pdf \\(f(x)\\) of this particular distribution. Solution. It’s the midpoint between \\(a\\) and \\(b\\), so \\(\\frac{a+b}{2}\\) Inserting the parameter values we get:\\[f(x) = \\begin{cases} 1 & \\text{if } 0 \\leq x \\leq 1 \\\\ 0 & \\text{otherwise} \\end{cases} \\] Notice how the pdf is just a constant \\(1\\) across all values of \\(x \\in [0,1]\\). Here it is important to distinguish between probability and probability density. The density may be 1, but the probability is not and while discrete distributions never exceed 1 on the y-axis, continuous distributions can go as high as you like. Exercise 18.7 (Normal intuition 1) The normal distribution, also known as the Gaussian distribution, is a continuous distribution that encompasses the entire real number line. It has two parameters: the mean, denoted by \\(\\mu\\), and the variance, represented by \\(\\sigma^2\\). Its shape resembles the iconic bell curve. The position of its peak is determined by the parameter \\(\\mu\\), while the variance determines the spread or width of the curve. A smaller variance results in a sharper, narrower peak, while a larger variance leads to a broader, more spread-out curve. Below, we graph the distribution of IQ scores for two different populations. We aim to identify individuals with an IQ at or above 140 for an experiment. We can identify them reliably; however, we only have time to examine one of the two groups. Which group should we investigate to have the best chance of finding such individuals? NOTE: The graph below displays the parameter \\(\\sigma\\), which is the square root of the variance, more commonly referred to as the standard deviation. Keep this in mind when solving the problems. Insert the values of either population into the pdf of a normal distribution and determine which one has a higher density at \\(x=140\\). Generate the graph yourself and zoom into the relevant area to graphically verify your answer. To determine probability density, we can use the pdf. However, if we wish to know the proportion of the population that falls within certain parameters, we would need to integrate the pdf. Fortunately, the integrals of common distributions are well-established. This integral gives us the cumulative distribution function \\(F(x)\\) (CDF). BONUS: Look up the CDF of the normal distribution and input the appropriate values to determine the percentage of each population that comprises individuals with an IQ of 140 or higher. Solution. Group 1: \\(\\mu = 100, \\sigma=10 \\rightarrow \\sigma^2 = 100\\) \\[\\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(x - \\mu)^2}{2 \\sigma^2}} = \\frac{1}{\\sqrt{2 \\pi 100}} e^{-\\frac{(140 - 100)^2}{2 \\cdot 100}} \\approx 1.34e-05\\] Group 2: \\(\\mu = 105, \\sigma=8 \\rightarrow \\sigma^2 = 64\\) \\[\\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(x - \\mu)^2}{2 \\sigma^2}} = \\frac{1}{\\sqrt{2 \\pi 64}} e^{-\\frac{(140 - 105)^2}{2 \\cdot 64}} \\approx 3.48e-06\\] So despite the fact that group 1 has a lower average IQ, we are more likely to find 140 IQ individuals in this group. library(ggplot2) library(tidyr) # Create data x <- seq(135, 145, by = 0.01) # Adjusting the x range to account for the larger standard deviations df <- data.frame(x = x) # Define the IQ distributions df$IQ_mu100_sd10 <- dnorm(df$x, mean = 100, sd = 10) df$IQ_mu105_sd8 <- dnorm(df$x, mean = 105, sd = 8) # Convert from wide to long format for ggplot2 df_long <- gather(df, distribution, density, -x) # Ensure the levels of the 'distribution' factor match our desired order df_long$distribution <- factor(df_long$distribution, levels = c("IQ_mu100_sd10", "IQ_mu105_sd8")) # Plot ggplot(df_long, aes(x = x, y = density, color = distribution)) + geom_line() + labs(x = "IQ Score", y = "Density") + scale_color_manual( name = "IQ Distribution", values = c(IQ_mu100_sd10 = "red", IQ_mu105_sd8 = "blue"), labels = c("Group 1 (µ=100, σ=10)", "Group 2 (µ=105, σ=8)") ) + theme_minimal() ::: {.solution} c. The CDF of the normal distribution is \\(\\Phi(x) = \\frac{1}{2} \\left[ 1 + \\text{erf} \\left( \\frac{x - \\mu}{\\sigma \\sqrt{2}} \\right) \\right]\\). The CDF is defined as the integral of the distribution density up to x. So to get the total percentage of individuals with IQ at 140 or higher we will need to subtract the value from 1. Group 1: \\[1 - \\Phi(140) = \\frac{1}{2} \\left[ 1 + \\text{erf} \\left( \\frac{140 - 100}{10 \\sqrt{2}} \\right) \\right] \\approx 3.17e-05 \\] Group 2 : \\[1 - \\Phi(140) = \\frac{1}{2} \\left[ 1 + \\text{erf} \\left( \\frac{140 - 105}{8 \\sqrt{2}} \\right) \\right] \\approx 6.07e-06 \\] So roughly 0.003% and 0.0006% of individuals in groups 1 and 2 respectively have an IQ at or above 140. ::: Exercise 18.8 (Beta intuition 1) The beta distribution is a continuous distribution defined on the unit interval \\([0,1]\\). It has two strictly positive paramters \\(\\alpha\\) and \\(\\beta\\), which determine its shape. Its support makes it especially suitable to model distribtuions of percentages and proportions. Below you’ve been provided with some code that you can copy into Rstudio. Once you run the code, an interactive Shiny app will appear and you will be able to manipulate the graph of the beta distribution. Play around with the parameters to get: A straight line from (0,0) to (1,2) A straight line from (0,2) to (1,0) A symmetric bell curve A bowl-shaped curve The standard uniform distribution is actually a special case of the beta distribution. Find the exact parameters \\(\\alpha\\) and \\(\\beta\\). Once you do, prove the equality by inserting the values into our pdf. Hint: The beta function is evaluated as \\(\\text{B}(a,b) = \\frac{\\Gamma(a)\\Gamma(b)}{\\Gamma(a+b)}\\), the gamma function for positive integers \\(n\\) is evaluated as \\(\\Gamma(n)= (n-1)!\\) # Install and load necessary packages install.packages(c("shiny", "ggplot2")) library(shiny) library(ggplot2) # The Shiny App ui <- fluidPage( titlePanel("Beta Distribution Viewer"), sidebarLayout( sidebarPanel( sliderInput("alpha", "Alpha:", min = 0.1, max = 10, value = 2, step = 0.1), sliderInput("beta", "Beta:", min = 0.1, max = 10, value = 2, step = 0.1) ), mainPanel( plotOutput("betaPlot") ) ) ) server <- function(input, output) { output$betaPlot <- renderPlot({ x <- seq(0, 1, by = 0.01) y <- dbeta(x, shape1 = input$alpha, shape2 = input$beta) ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) + geom_line() + labs(x = "Value", y = "Density") + theme_minimal() }) } shinyApp(ui = ui, server = server) Solution. \\(\\alpha = 2, \\beta=1\\) \\(\\alpha = 1, \\beta=2\\) Possible solution \\(\\alpha = \\beta= 5\\) Possible solution \\(\\alpha = \\beta= 0.5\\) The correct parameters are \\(\\alpha = 1, \\beta=1\\), to prove the equality we insert them into the beta pdf: \\[\\frac{x^{\\alpha - 1} (1 - x)^{\\beta - 1}}{\\text{B}(\\alpha, \\beta)} = \\frac{x^{1 - 1} (1 - x)^{1 - 1}}{\\text{B}(1, 1)} = \\frac{1}{\\frac{\\Gamma(1)\\Gamma(1)}{\\Gamma(1+1)}}= \\frac{1}{\\frac{(1-1)!(1-1)!}{(2-1)!}} = 1\\] Exercise 18.9 (Exponential intuition 1) The exponential distribution represents the distributon of time between events in a Poisson process. It is the continuous analogue of the geometric distribution. It has a single parameter \\(\\lambda\\), which is strictly positive and represents the constant rate of the corresponding Poisson process. The support is all positive reals, since time between events is non-negative, but not bound upwards. Let’s revisit the call center from our Poisson problem. We get 2.5 calls per day on average, this is our rate parameter \\(\\lambda\\). A work day is 8 hours. What is the mean time between phone calls? The cdf \\(F(x)\\) tells us what percentage of calls occur within x amount of time of each other. You want to take an hour long lunch break but are worried about missing calls. Calculate the percentage of calls you are likely to miss if you’re gone for an hour. Hint: The cdf is \\(F(x) = \\int_{-\\infty}^{x} f(x) dx\\) Solution. Taking \\(\\lambda = \\frac{2.5 \\text{ calls}}{8 \\text{ hours}} = \\frac{1 \\text{ call}}{3.2 \\text{ hours}}\\) \\[E[X] = \\frac{1}{\\lambda} = \\frac{3.2 \\text{ hours}}{\\text{call}}\\] First we derive the CDF, we can integrate from 0 instead of \\(-\\infty\\), since we have no support in the negatives: \\[\\begin{align} F(x) &= \\int_{0}^{x} \\lambda e^{-\\lambda t} dt \\\\ &= \\lambda \\int_{0}^{x} e^{-\\lambda t} dt \\\\ &= \\lambda (\\frac{1}{-\\lambda}e^{-\\lambda t} |_{0}^{x}) \\\\ &= \\lambda(\\frac{1}{\\lambda} - \\frac{1}{\\lambda} e^{-\\lambda x}) \\\\ &= 1 - e^{-\\lambda x}. \\end{align}\\] Then we just evaluate it for a time of 1 hour: \\[F(1 \\text{ hour}) = 1 - e^{-\\frac{1 \\text{ call}}{3.2 \\text{ hours}} \\cdot 1 \\text{ hour}}= 1 - e^{-\\frac{1 \\text{ call}}{3.2 \\text{ hours}}} \\approx 0.268\\] So we have about a 27% chance of missing a call if we’re gone for an hour. Exercise 18.10 (Gamma intuition 1) The gamma distribution is a continuous distribution characterized by two parameters, \\(\\alpha\\) and \\(\\beta\\), both greater than 0. These parameters afford the distribution a broad range of shapes, leading to it being commonly referred to as a family of distributions. Given its support over the positive real numbers, it is well suited for modeling a diverse range of positive-valued phenomena. The exponential distribution is actually just a particular form of the gamma distribution. What are the values of \\(\\alpha\\) and \\(\\beta\\)? Copy the code from our beta distribution Shiny app and modify it to simulate the gamma distribution. Then get it to show the exponential. Solution. Let’s start by taking a look at the pdfs of the two distributions side by side: \\[\\frac{\\beta^\\alpha}{\\Gamma(\\alpha)} x^{\\alpha - 1}e^{-\\beta x} = \\lambda e^{-\\lambda x}\\] The \\(x^{\\alpha - 1}\\) term is not found anywhere in the pdf of the exponential so we need to eliminate it by setting \\(\\alpha = 1\\). This also makes the fraction evaluate to \\(\\frac{\\beta^1}{\\Gamma(1)} = \\beta\\), which leaves us with \\[\\beta \\cdot e^{-\\beta x}\\] Now we can see that \\(\\beta = \\lambda\\) and \\(\\alpha = 1\\). # Install and load necessary packages install.packages(c("shiny", "ggplot2")) library(shiny) library(ggplot2) # The Shiny App ui <- fluidPage( titlePanel("Gamma Distribution Viewer"), sidebarLayout( sidebarPanel( sliderInput("shape", "Shape (α):", min = 0.1, max = 10, value = 2, step = 0.1), sliderInput("scale", "Scale (β):", min = 0.1, max = 10, value = 2, step = 0.1) ), mainPanel( plotOutput("gammaPlot") ) ) ) server <- function(input, output) { output$gammaPlot <- renderPlot({ x <- seq(0, 25, by = 0.1) y <- dgamma(x, shape = input$shape, scale = input$scale) ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) + geom_line() + labs(x = "Value", y = "Density") + theme_minimal() }) } shinyApp(ui = ui, server = server) "],["A1.html", "A R programming language A.1 Basic characteristics A.2 Why R? A.3 Setting up A.4 R basics A.5 Functions A.6 Other tips A.7 Further reading and references", " A R programming language A.1 Basic characteristics R is free software for statistical computing and graphics. It is widely used by statisticians, scientists, and other professionals for software development and data analysis. It is an interpreted language and therefore the programs do not need compilation. A.2 Why R? R is one of the main two languages used for statistics and machine learning (the other being Python). Pros Libraries. Comprehensive collection of statistical and machine learning packages. Easy to code. Open source. Anyone can access R and develop new methods. Additionally, it is relatively simple to get source code of established methods. Large community. The use of R has been rising for some time, in industry and academia. Therefore a large collection of blogs and tutorials exists, along with people offering help on pages like StackExchange and CrossValidated. Integration with other languages and LaTeX. New methods. Many researchers develop R packages based on their research, therefore new methods are available soon after development. Cons Slow. Programs run slower than in other programming languages, however this can be somewhat ammended by effective coding or integration with other languages. Memory intensive. This can become a problem with large data sets, as they need to be stored in the memory, along with all the information the models produce. Some packages are not as good as they should be, or have poor documentation. Object oriented programming in R can be very confusing and complex. A.3 Setting up https://www.r-project.org/. A.3.1 RStudio RStudio is the most widely used IDE for R. It is free, you can download it from https://rstudio.com/. While console R is sufficient for the requirements of this course, we recommend the students install RStudio for its better user interface. A.3.2 Libraries for data science Listed below are some of the more useful libraries (packages) for data science. Students are also encouraged to find other useful packages. dplyr Efficient data manipulation. Part of the wider package collection called tidyverse. ggplot2 Plotting based on grammar of graphics. stats Several statistical models. rstan Bayesian inference using Hamiltonian Monte Carlo. Very flexible model building. MCMCpack Bayesian inference. rmarkdown, knitr, and bookdown Dynamic reports (for example such as this one). devtools Package development. A.4 R basics A.4.1 Variables and types Important information and tips: no type declaration define variables with <- instead of = (although both work, there is a slight difference, additionally most of the packages use the arrow) for strings use \"\" for comments use # change types with as.type() functions no special type for single character like C++ for example n <- 20 x <- 2.7 m <- n # m gets value 20 my_flag <- TRUE student_name <- "Luka" typeof(n) ## [1] "double" typeof(student_name) ## [1] "character" typeof(my_flag) ## [1] "logical" typeof(as.integer(n)) ## [1] "integer" typeof(as.character(n)) ## [1] "character" A.4.2 Basic operations n + x ## [1] 22.7 n - x ## [1] 17.3 diff <- n - x # variable diff gets the difference between n and x diff ## [1] 17.3 n * x ## [1] 54 n / x ## [1] 7.407407 x^2 ## [1] 7.29 sqrt(x) ## [1] 1.643168 n > 2 * n ## [1] FALSE n == n ## [1] TRUE n == 2 * n ## [1] FALSE n != n ## [1] FALSE paste(student_name, "is", n, "years old") ## [1] "Luka is 20 years old" A.4.3 Vectors use c() to combine elements into vectors can only contain one type of variable if different types are provided, all are transformed to the most basic type in the vector access elements by indexes or logical vectors of the same length a scalar value is regarded as a vector of length 1 1:4 # creates a vector of integers from 1 to 4 ## [1] 1 2 3 4 student_ages <- c(20, 23, 21) student_names <- c("Luke", "Jen", "Mike") passed <- c(TRUE, TRUE, FALSE) length(student_ages) ## [1] 3 # access by index student_ages[2] ## [1] 23 student_ages[1:2] ## [1] 20 23 student_ages[2] <- 24 # change values # access by logical vectors student_ages[passed == TRUE] # same as student_ages[passed] ## [1] 20 24 student_ages[student_names %in% c("Luke", "Mike")] ## [1] 20 21 student_names[student_ages > 20] ## [1] "Jen" "Mike" A.4.3.1 Operations with vectors most operations are element-wise if we operate on vectors of different lengths, the shorter vector periodically repeats its elements until it reaches the length of the longer one a <- c(1, 3, 5) b <- c(2, 2, 1) d <- c(6, 7) a + b ## [1] 3 5 6 a * b ## [1] 2 6 5 a + d ## Warning in a + d: longer object length is not a multiple of shorter object ## length ## [1] 7 10 11 a + 2 * b ## [1] 5 7 7 a > b ## [1] FALSE TRUE TRUE b == a ## [1] FALSE FALSE FALSE a %*% b # vector multiplication, not element-wise ## [,1] ## [1,] 13 A.4.4 Factors vectors of finite predetermined classes suitable for categorical variables ordinal (ordered) or nominal (unordered) car_brand <- factor(c("Audi", "BMW", "Mercedes", "BMW"), ordered = FALSE) car_brand ## [1] Audi BMW Mercedes BMW ## Levels: Audi BMW Mercedes freq <- factor(x = NA, levels = c("never","rarely","sometimes","often","always"), ordered = TRUE) freq[1:3] <- c("rarely", "sometimes", "rarely") freq ## [1] rarely sometimes rarely ## Levels: never < rarely < sometimes < often < always freq[4] <- "quite_often" # non-existing level, returns NA ## Warning in `[<-.factor`(`*tmp*`, 4, value = "quite_often"): invalid factor ## level, NA generated freq ## [1] rarely sometimes rarely <NA> ## Levels: never < rarely < sometimes < often < always A.4.5 Matrices two-dimensional generalizations of vectors my_matrix <- matrix(c(1, 2, 1, 5, 4, 2), nrow = 2, byrow = TRUE) my_matrix ## [,1] [,2] [,3] ## [1,] 1 2 1 ## [2,] 5 4 2 my_square_matrix <- matrix(c(1, 3, 2, 3), nrow = 2) my_square_matrix ## [,1] [,2] ## [1,] 1 2 ## [2,] 3 3 my_matrix[1,2] # first row, second column ## [1] 2 my_matrix[2, ] # second row ## [1] 5 4 2 my_matrix[ ,3] # third column ## [1] 1 2 A.4.5.1 Matrix functions and operations most operation element-wise mind the dimensions when using matrix multiplication %*% nrow(my_matrix) # number of matrix rows ## [1] 2 ncol(my_matrix) # number of matrix columns ## [1] 3 dim(my_matrix) # matrix dimension ## [1] 2 3 t(my_matrix) # transpose ## [,1] [,2] ## [1,] 1 5 ## [2,] 2 4 ## [3,] 1 2 diag(my_matrix) # the diagonal of the matrix as vector ## [1] 1 4 diag(1, nrow = 3) # creates a diagonal matrix ## [,1] [,2] [,3] ## [1,] 1 0 0 ## [2,] 0 1 0 ## [3,] 0 0 1 det(my_square_matrix) # matrix determinant ## [1] -3 my_matrix + 2 * my_matrix ## [,1] [,2] [,3] ## [1,] 3 6 3 ## [2,] 15 12 6 my_matrix * my_matrix # element-wise multiplication ## [,1] [,2] [,3] ## [1,] 1 4 1 ## [2,] 25 16 4 my_matrix %*% t(my_matrix) # matrix multiplication ## [,1] [,2] ## [1,] 6 15 ## [2,] 15 45 my_vec <- as.vector(my_matrix) # transform to vector my_vec ## [1] 1 5 2 4 1 2 A.4.6 Arrays multi-dimensional generalizations of matrices my_array <- array(c(1, 2, 3, 4, 5, 6, 7, 8), dim = c(2, 2, 2)) my_array[1, 1, 1] ## [1] 1 my_array[2, 2, 1] ## [1] 4 my_array[1, , ] ## [,1] [,2] ## [1,] 1 5 ## [2,] 3 7 dim(my_array) ## [1] 2 2 2 A.4.7 Data frames basic data structure for analysis differ from matrices as columns can be of different types student_data <- data.frame("Name" = student_names, "Age" = student_ages, "Pass" = passed) student_data ## Name Age Pass ## 1 Luke 20 TRUE ## 2 Jen 24 TRUE ## 3 Mike 21 FALSE colnames(student_data) <- c("name", "age", "pass") # change column names student_data[1, ] ## name age pass ## 1 Luke 20 TRUE student_data[ ,colnames(student_data) %in% c("name", "pass")] ## name pass ## 1 Luke TRUE ## 2 Jen TRUE ## 3 Mike FALSE student_data$pass # access column by name ## [1] TRUE TRUE FALSE student_data[student_data$pass == TRUE, ] ## name age pass ## 1 Luke 20 TRUE ## 2 Jen 24 TRUE A.4.8 Lists useful for storing different data structures access elements with double square brackets elements can be named first_list <- list(student_ages, my_matrix, student_data) second_list <- list(student_ages, my_matrix, student_data, first_list) first_list[[1]] ## [1] 20 24 21 second_list[[4]] ## [[1]] ## [1] 20 24 21 ## ## [[2]] ## [,1] [,2] [,3] ## [1,] 1 2 1 ## [2,] 5 4 2 ## ## [[3]] ## name age pass ## 1 Luke 20 TRUE ## 2 Jen 24 TRUE ## 3 Mike 21 FALSE second_list[[4]][[1]] # first element of the fourth element of second_list ## [1] 20 24 21 length(second_list) ## [1] 4 second_list[[length(second_list) + 1]] <- "add_me" # append an element names(first_list) <- c("Age", "Matrix", "Data") first_list$Age ## [1] 20 24 21 A.4.9 Loops mostly for loop for loop can iterate over an arbitrary vector # iterate over consecutive natural numbers my_sum <- 0 for (i in 1:10) { my_sum <- my_sum + i } my_sum ## [1] 55 # iterate over an arbirary vector my_sum <- 0 some_numbers <- c(2, 3.5, 6, 100) for (i in some_numbers) { my_sum <- my_sum + i } my_sum ## [1] 111.5 A.5 Functions for help use ?function_name A.5.1 Writing functions We can write our own functions with function(). In the brackets, we define the parameters the function gets, and in curly brackets we define what the function does. We use return() to return values. sum_first_n_elements <- function (n) { my_sum <- 0 for (i in 1:n) { my_sum <- my_sum + i } return (my_sum) } sum_first_n_elements(10) ## [1] 55 A.6 Other tips Use set.seed(arbitrary_number) at the beginning of a script to set the seed and ensure replication. To dynamically set the working directory in R Studio to the parent folder of a R script use setwd(dirname(rstudioapi::getSourceEditorContext()$path)). To avoid slow R loops use the apply family of functions. See ?apply and ?lapply. To make your data manipulation (and therefore your life) a whole lot easier, use the dplyr package. Use getAnywhere(function_name) to get the source code of any function. Use browser for debugging. See ?browser. A.7 Further reading and references Getting started with R Studio: https://www.youtube.com/watch?v=lVKMsaWju8w Official R manuals: https://cran.r-project.org/manuals.html Cheatsheets: https://www.rstudio.com/resources/cheatsheets/ Workshop on R, dplyr, ggplot2, and R Markdown: https://github.com/bstatcomp/Rworkshop "],["distributions.html", "B Probability distributions", " B Probability distributions Name parameters support pdf/pmf mean variance Bernoulli \\(p \\in [0,1]\\) \\(k \\in \\{0,1\\}\\) \\(p^k (1 - p)^{1 - k}\\) 1.12 \\(p\\) 7.1 \\(p(1-p)\\) 7.1 binomial \\(n \\in \\mathbb{N}\\), \\(p \\in [0,1]\\) \\(k \\in \\{0,1,\\dots,n\\}\\) \\(\\binom{n}{k} p^k (1 - p)^{n - k}\\) 4.4 \\(np\\) 7.2 \\(np(1-p)\\) 7.2 Poisson \\(\\lambda > 0\\) \\(k \\in \\mathbb{N}_0\\) \\(\\frac{\\lambda^k e^{-\\lambda}}{k!}\\) 4.6 \\(\\lambda\\) 7.3 \\(\\lambda\\) 7.3 geometric \\(p \\in (0,1]\\) \\(k \\in \\mathbb{N}_0\\) \\(p(1-p)^k\\) 4.5 \\(\\frac{1 - p}{p}\\) 7.4 \\(\\frac{1 - p}{p^2}\\) 9.3 normal \\(\\mu \\in \\mathbb{R}\\), \\(\\sigma^2 > 0\\) \\(x \\in \\mathbb{R}\\) \\(\\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(x - \\mu)^2}{2 \\sigma^2}}\\) 4.12 \\(\\mu\\) 7.8 \\(\\sigma^2\\) 7.8 uniform \\(a,b \\in \\mathbb{R}\\), \\(a < b\\) \\(x \\in [a,b]\\) \\(\\frac{1}{b-a}\\) 4.9 \\(\\frac{a+b}{2}\\) \\(\\frac{(b-a)^2}{12}\\) beta \\(\\alpha,\\beta > 0\\) \\(x \\in [0,1]\\) \\(\\frac{x^{\\alpha - 1} (1 - x)^{\\beta - 1}}{\\text{B}(\\alpha, \\beta)}\\) 4.10 \\(\\frac{\\alpha}{\\alpha + \\beta}\\) 7.6 \\(\\frac{\\alpha \\beta}{(\\alpha + \\beta)^2(\\alpha + \\beta + 1)}\\) 7.6 gamma \\(\\alpha,\\beta > 0\\) \\(x \\in (0, \\infty)\\) \\(\\frac{\\beta^\\alpha}{\\Gamma(\\alpha)} x^{\\alpha - 1}e^{-\\beta x}\\) 4.11 \\(\\frac{\\alpha}{\\beta}\\) 7.5 \\(\\frac{\\alpha}{\\beta^2}\\) 7.5 exponential \\(\\lambda > 0\\) \\(x \\in [0, \\infty)\\) \\(\\lambda e^{-\\lambda x}\\) 4.8 \\(\\frac{1}{\\lambda}\\) 7.7 \\(\\frac{1}{\\lambda^2}\\) 7.7 logistic \\(\\mu \\in \\mathbb{R}\\), \\(s > 0\\) \\(x \\in \\mathbb{R}\\) \\(\\frac{e^{-\\frac{x - \\mu}{s}}}{s(1 + e^{-\\frac{x - \\mu}{s}})^2}\\) 4.13 \\(\\mu\\) \\(\\frac{s^2 \\pi^2}{3}\\) negative binomial \\(r \\in \\mathbb{N}\\), \\(p \\in [0,1]\\) \\(k \\in \\mathbb{N}_0\\) \\(\\binom{k + r - 1}{k}(1-p)^r p^k\\) 4.7 \\(\\frac{rp}{1 - p}\\) 9.2 \\(\\frac{rp}{(1 - p)^2}\\) 9.2 multinomial \\(n \\in \\mathbb{N}\\), \\(k \\in \\mathbb{N}\\) \\(p_i \\in [0,1]\\), \\(\\sum p_i = 1\\) \\(x_i \\in \\{0,..., n\\}\\), \\(i \\in \\{1,...,k\\}\\), \\(\\sum{x_i} = n\\) \\(\\frac{n!}{x_1!x_2!...x_k!} p_1^{x_1} p_2^{x_2}...p_k^{x_k}\\) 8.1 \\(np_i\\) \\(np_i(1-p_i)\\) "],["references.html", "References", " References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] +[["index.html", "Principles of Uncertainty – exercises Preface", " Principles of Uncertainty – exercises Gregor Pirš, Erik Štrumbelj, David Nabergoj and Leon Hvastja 2023-10-02 Preface These are the exercises for the Principles of Uncertainty course of the Data Science Master’s at University of Ljubljana, Faculty of Computer and Information Science. This document will be extended each week as the course progresses. At the end of each exercise session, we will post the solutions to the exercises worked in class and select exercises for homework. Students are also encouraged to solve the remaining exercises to further extend their knowledge. Some exercises require the use of R. Those exercises (or parts of) are coloured blue. Students that are not familiar with R programming language should study A to learn the basics. As the course progresses, we will cover more relevant uses of R for data science. "],["introduction.html", "Chapter 1 Probability spaces 1.1 Measure and probability spaces 1.2 Properties of probability measures 1.3 Discrete probability spaces", " Chapter 1 Probability spaces This chapter deals with measures and probability spaces. At the end of the chapter, we look more closely at discrete probability spaces. The students are expected to acquire the following knowledge: Theoretical Use properties of probability to calculate probabilities. Combinatorics. Understanding of continuity of probability. R Vectors and vector operations. For loop. Estimating probability with simulation. sample function. Matrices and matrix operations. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 1.1 Measure and probability spaces Exercise 1.1 (Completing a set to a sigma algebra) Let \\(\\Omega = \\{1,2,...,10\\}\\) and let \\(A = \\{\\emptyset, \\{1\\}, \\{2\\}, \\Omega \\}\\). Show that \\(A\\) is not a sigma algebra of \\(\\Omega\\). Find the minimum number of elements to complete A to a sigma algebra of \\(\\Omega\\). Solution. \\(1^c = \\{2,3,...,10\\} \\notin A \\implies\\) \\(A\\) is not sigma algebra. First we need the complements of all elements, so we need to add sets \\(\\{2,3,...,10\\}\\) and \\(\\{1,3,4,...,10\\}\\). Next we need unions of all sets – we add the set \\(\\{1,2\\}\\). Again we need the complement of this set, so we add \\(\\{3,4,...,10\\}\\). So the minimum number of elements we need to add is 4. Exercise 1.2 (Diversity of sigma algebras) Let \\(\\Omega\\) be a set. Find the smallest sigma algebra of \\(\\Omega\\). Find the largest sigma algebra of \\(\\Omega\\). Solution. \\(A = \\{\\emptyset, \\Omega\\}\\) \\(2^{\\Omega}\\) Exercise 1.3 Find all sigma algebras for \\(\\Omega = \\{0, 1, 2\\}\\). Solution. \\(A = \\{\\emptyset, \\Omega\\}\\) \\(A = 2^{\\Omega}\\) \\(A = \\{\\emptyset, \\{0\\}, \\{1,2\\}, \\Omega\\}\\) \\(A = \\{\\emptyset, \\{1\\}, \\{0,2\\}, \\Omega\\}\\) \\(A = \\{\\emptyset, \\{2\\}, \\{0,1\\}, \\Omega\\}\\) Exercise 1.4 (Difference between algebra and sigma algebra) Let \\(\\Omega = \\mathbb{N}\\) and \\(\\mathcal{A} = \\{A \\subseteq \\mathbb{N}: A \\text{ is finite or } A^c \\text{ is finite.} \\}\\). Show that \\(\\mathcal{A}\\) is an algebra but not a sigma algebra. Solution. \\(\\emptyset\\) is finite so \\(\\emptyset \\in \\mathcal{A}\\). Let \\(A \\in \\mathcal{A}\\) and \\(B \\in \\mathcal{A}\\). If both are finite, then their union is also finite and therefore in \\(\\mathcal{A}\\). Let at least one of them not be finite. Then their union is not finite. But \\((A \\cup B)^c = A^c \\cap B^c\\). And since at least one is infinite, then its complement is finite and the intersection is too. So finite unions are in \\(\\mathcal{A}\\). Let us look at numbers \\(2n\\). For any \\(n\\), \\(2n \\in \\mathcal{A}\\) as it is finite. But \\(\\bigcup_{k = 1}^{\\infty} 2n \\notin \\mathcal{A}\\). Exercise 1.5 We define \\(\\sigma(X) = \\cap_{\\lambda \\in I} S_\\lambda\\) to be a sigma algebra, generated by the set \\(X\\), where \\(S_\\lambda\\) are all sigma algebras such that \\(X \\subseteq S_\\lambda\\). \\(S_\\lambda\\) are indexed by \\(\\lambda \\in I\\). Let \\(A, B \\subseteq 2^{\\Omega}\\). Prove that \\(\\sigma(A) = \\sigma(B) \\iff A \\subseteq \\sigma(B) \\land B \\subseteq \\sigma(A)\\). Solution. To prove the equivalence, we need to prove that the left hand side implies the right hand side and vice versa. Proving \\(\\sigma(A) = \\sigma(B) \\Rightarrow A \\subseteq \\sigma(B) \\land B \\subseteq \\sigma(A)\\): we know \\(A \\subseteq \\sigma(A)\\) is always true, so by substituting in \\(\\sigma(B)\\) from the left hand side equality we obtain \\(A \\subseteq \\sigma(B)\\). We obtain \\(B \\subseteq \\sigma(A)\\) by symmetry. This proves the implication. Proving \\(A \\subseteq \\sigma(B) \\land B \\subseteq \\sigma(A) \\Rightarrow \\sigma(A) = \\sigma(B)\\): by definition of a sigma algebra, generated by a set, we have \\(\\sigma(B) = \\cap_{\\lambda \\in I} S_\\lambda\\) where \\(S_\\lambda\\) are all sigma algebras where \\(B \\subseteq S_\\lambda\\). But \\(\\sigma(A)\\) is one of \\(S_\\lambda\\), so we can write \\(\\sigma(B) = \\sigma(A) \\cap \\left(\\cap_{\\lambda \\in I} S_\\lambda \\right)\\), which implies \\(\\sigma(B) \\subseteq \\sigma(A)\\). By symmetry, we have \\(\\sigma(A) \\subseteq \\sigma(B)\\). Since \\(\\sigma(A) \\subseteq \\sigma(B)\\) and \\(\\sigma(B) \\subseteq \\sigma(A)\\), we obtain \\(\\sigma(A) = \\sigma(B)\\), which proves the implication and completes the equivalence proof. Exercise 1.6 (Intro to measure) Take the measurable space \\(\\Omega = \\{1,2\\}\\), \\(F = 2^{\\Omega}\\). Which of the following is a measure? Which is a probability measure? \\(\\mu(\\emptyset) = 0\\), \\(\\mu(\\{1\\}) = 5\\), \\(\\mu(\\{2\\}) = 6\\), \\(\\mu(\\{1,2\\}) = 11\\) \\(\\mu(\\emptyset) = 0\\), \\(\\mu(\\{1\\}) = 0\\), \\(\\mu(\\{2\\}) = 0\\), \\(\\mu(\\{1,2\\}) = 1\\) \\(\\mu(\\emptyset) = 0\\), \\(\\mu(\\{1\\}) = 0\\), \\(\\mu(\\{2\\}) = 0\\), \\(\\mu(\\{1,2\\}) = 0\\) \\(\\mu(\\emptyset) = 0\\), \\(\\mu(\\{1\\}) = 0\\), \\(\\mu(\\{2\\}) = 1\\), \\(\\mu(\\{1,2\\}) = 1\\) \\(\\mu(\\emptyset)=0\\), \\(\\mu(\\{1\\})=0\\), \\(\\mu(\\{2\\})=\\infty\\), \\(\\mu(\\{1,2\\})=\\infty\\) Solution. Measure. Not probability measure since \\(\\mu(\\Omega) > 1\\). Neither due to countable additivity. Measure. Not probability measure since \\(\\mu(\\Omega) = 0\\). Probability measure. Measure. Not probability measure since \\(\\mu(\\Omega) > 1\\). Exercise 1.7 Define a probability space that could be used to model the outcome of throwing two fair 6-sided dice. Solution. \\(\\Omega = \\{\\{i,j\\}, i = 1,...,6, j = 1,...,6\\}\\) \\(F = 2^{\\Omega}\\) \\(\\forall \\omega \\in \\Omega\\), \\(P(\\omega) = \\frac{1}{6} \\times \\frac{1}{6} = \\frac{1}{36}\\) 1.2 Properties of probability measures Exercise 1.8 A standard deck (52 cards) is distributed to two persons: 26 cards to each person. All partitions are equally likely. Find the probability that: The first person gets 4 Queens. The first person gets at least 2 Queens. R: Use simulation (sample) to check the above answers. Solution. \\(\\frac{\\binom{48}{22}}{\\binom{52}{26}}\\) 1 - \\(\\frac{\\binom{48}{26} + 4 \\times \\binom{48}{25}}{\\binom{52}{26}}\\) For the simulation, let us represent cards with numbers from 1 to 52, and let 1 through 4 represent Queens. set.seed(1) cards <- 1:52 n <- 10000 q4 <- vector(mode = "logical", length = n) q2 <- vector(mode = "logical", length = n) tmp <- vector(mode = "logical", length = n) for (i in 1:n) { p1 <- sample(1:52, 26) q4[i] <- sum(1:4 %in% p1) == 4 q2[i] <- sum(1:4 %in% p1) >= 2 } sum(q4) / n ## [1] 0.0572 sum(q2) / n ## [1] 0.6894 Exercise 1.9 Let \\(A\\) and \\(B\\) be events with probabilities \\(P(A) = \\frac{2}{3}\\) and \\(P(B) = \\frac{1}{2}\\). Show that \\(\\frac{1}{6} \\leq P(A\\cap B) \\leq \\frac{1}{2}\\), and give examples to show that both extremes are possible. Find corresponding bounds for \\(P(A\\cup B)\\). R: Draw samples from the examples and show the probability bounds of \\(P(A \\cap B)\\) . Solution. From the properties of probability we have \\[\\begin{equation} P(A \\cup B) = P(A) + P(B) - P(A \\cap B) \\leq 1. \\end{equation}\\] From this follows \\[\\begin{align} P(A \\cap B) &\\geq P(A) + P(B) - 1 \\\\ &= \\frac{2}{3} + \\frac{1}{2} - 1 \\\\ &= \\frac{1}{6}, \\end{align}\\] which is the lower bound for the intersection. Conversely, we have \\[\\begin{equation} P(A \\cup B) = P(A) + P(B) - P(A \\cap B) \\geq P(A). \\end{equation}\\] From this follows \\[\\begin{align} P(A \\cap B) &\\leq P(B) \\\\ &= \\frac{1}{2}, \\end{align}\\] which is the upper bound for the intersection. For an example take a fair die. To achieve the lower bound let \\(A = \\{3,4,5,6\\}\\) and \\(B = \\{1,2,3\\}\\), then their intersection is \\(A \\cap B = \\{3\\}\\). To achieve the upper bound take \\(A = \\{1,2,3,4\\}\\) and $B = {1,2,3} $. For the bounds of the union we will use the results from the first part. Again from the properties of probability we have \\[\\begin{align} P(A \\cup B) &= P(A) + P(B) - P(A \\cap B) \\\\ &\\geq P(A) + P(B) - \\frac{1}{2} \\\\ &= \\frac{2}{3}. \\end{align}\\] Conversely \\[\\begin{align} P(A \\cup B) &= P(A) + P(B) - P(A \\cap B) \\\\ &\\leq P(A) + P(B) - \\frac{1}{6} \\\\ &= 1. \\end{align}\\] Therefore \\(\\frac{2}{3} \\leq P(A \\cup B) \\leq 1\\). We use sample in R: set.seed(1) n <- 10000 samps <- sample(1:6, n, replace = TRUE) # lower bound lb <- vector(mode = "logical", length = n) A <- c(1,2,3) B <- c(3,4,5,6) for (i in 1:n) { lb[i] <- samps[i] %in% A & samps[i] %in% B } sum(lb) / n ## [1] 0.1605 # upper bound ub <- vector(mode = "logical", length = n) A <- c(1,2,3) B <- c(1,2,3,4) for (i in 1:n) { ub[i] <- samps[i] %in% A & samps[i] %in% B } sum(ub) / n ## [1] 0.4913 Exercise 1.10 A fair coin is tossed repeatedly. Show that, with probability one, a head turns up sooner or later. Show similarly that any given finite sequence of heads and tails occurs eventually with probability one. Solution. \\[\\begin{align} P(\\text{no heads}) &= \\lim_{n \\rightarrow \\infty} P(\\text{no heads in first }n \\text{ tosses}) \\\\ &= \\lim_{n \\rightarrow \\infty} \\frac{1}{2^n} \\\\ &= 0. \\end{align}\\] For the second part, let us fix the given sequence of heads and tails of length \\(k\\) as \\(s\\). A probability that this happens in \\(k\\) tosses is \\(\\frac{1}{2^k}\\). \\[\\begin{align} P(s \\text{ occurs}) &= \\lim_{n \\rightarrow \\infty} P(s \\text{ occurs in first } nk \\text{ tosses}) \\end{align}\\] The right part of the upper equation is greater than if \\(s\\) occurs either in the first \\(k\\) tosses, second \\(k\\) tosses,…, \\(n\\)-th \\(k\\) tosses. Therefore \\[\\begin{align} P(s \\text{ occurs}) &\\geq \\lim_{n \\rightarrow \\infty} P(s \\text{ occurs in first } n \\text{ disjoint sequences of length } k) \\\\ &= \\lim_{n \\rightarrow \\infty} (1 - P(s \\text{ does not occur in first } n \\text{ disjoint sequences})) \\\\ &= 1 - \\lim_{n \\rightarrow \\infty} P(s \\text{ does not occur in first } n \\text{ disjoint sequences}) \\\\ &= 1 - \\lim_{n \\rightarrow \\infty} (1 - \\frac{1}{2^k})^n \\\\ &= 1. \\end{align}\\] Exercise 1.11 An Erdos-Renyi random graph \\(G(n,p)\\) is a model with \\(n\\) nodes, where each pair of nodes is connected with probability \\(p\\). Calculate the probability that there exists a node that is not connected to any other node in \\(G(4,0.6)\\). Show that the upper bound for the probability that there exist 2 nodes that are not connected to any other node for an arbitrary \\(G(n,p)\\) is \\(\\binom{n}{2} (1-p)^{2n - 3}\\). R: Estimate the probability from the first point using simulation. Solution. Let \\(A_i\\) be the event that the \\(i\\)-th node is not connected to any other node. Then our goal is to calculate \\(P(\\cup_{i=1}^n A_i)\\). Using the inclusion-exclusion principle, we get \\[\\begin{align} P(\\cup_{i=1}^n A_i) &= \\sum_i A_i - \\sum_{i<j} P(A_i \\cap A_j) + \\sum_{i<j<k} P(A_i \\cap A_j \\cap A_k) - P(A_1 \\cap A_2 \\cap A_3 \\cap A_4) \\\\ &=4 (1 - p)^3 - \\binom{4}{2} (1 - p)^5 + \\binom{4}{3} (1 - p)^6 - (1 - p)^6 \\\\ &\\approx 0.21. \\end{align}\\] Let \\(A_{ij}\\) be the event that nodes \\(i\\) and \\(j\\) are not connected to any other node. We are interested in \\(P(\\cup_{i<j}A_{ij})\\). By using Boole`s inequality, we get \\[\\begin{align} P(\\cup_{i<j}A_{ij}) \\leq \\sum_{i<j} P(A_{ij}). \\end{align}\\] What is the probability of \\(A_{ij}\\)? There need to be no connections to the \\(i\\)-th node to the remaining nodes (excluding \\(j\\)), the same for the \\(j\\)-th node, and there can be no connection between them. Therefore \\[\\begin{align} P(\\cup_{i<j}A_{ij}) &\\leq \\sum_{i<j} (1 - p)^{2(n-2) + 1} \\\\ &= \\binom{n}{2} (1 - p)^{2n - 3}. \\end{align}\\] set.seed(1) n_samp <- 100000 n <- 4 p <- 0.6 conn_samp <- vector(mode = "logical", length = n_samp) for (i in 1:n_samp) { tmp_mat <- matrix(data = 0, nrow = n, ncol = n) samp_conn <- sample(c(0,1), choose(4,2), replace = TRUE, prob = c(1 - p, p)) tmp_mat[lower.tri(tmp_mat)] <- samp_conn tmp_mat[upper.tri(tmp_mat)] <- t(tmp_mat)[upper.tri(t(tmp_mat))] not_conn <- apply(tmp_mat, 1, sum) if (any(not_conn == 0)) { conn_samp[i] <- TRUE } else { conn_samp[i] <- FALSE } } sum(conn_samp) / n_samp ## [1] 0.20565 1.3 Discrete probability spaces Exercise 1.12 Show that the standard measurable space on \\(\\Omega = \\{0,1,...,n\\}\\) equipped with binomial measure is a discrete probability space. Define another probability measure on this measurable space. Show that for \\(n=1\\) the binomial measure is the same as the Bernoulli measure. R: Draw 1000 samples from the binomial distribution \\(p=0.5\\), \\(n=20\\) (rbinom) and compare relative frequencies with theoretical probability measure. Solution. We need to show that the terms of \\(\\sum_{k=0}^n \\binom{n}{k} p^k (1 - p)^{n - k}\\) sum to 1. For that we use the binomial theorem \\(\\sum_{k=0}^n \\binom{n}{k} x^k y^{n-k} = (x + y)^n\\). So \\[\\begin{equation} \\sum_{k=0}^n \\binom{n}{k} p^k (1 - p)^{n - k} = (p + 1 - p)^n = 1. \\end{equation}\\] \\(P(\\{k\\}) = \\frac{1}{n + 1}\\). When \\(n=1\\) then \\(k \\in \\{0,1\\}\\). Inserting \\(n=1\\) into the binomial measure, we get \\(\\binom{1}{k}p^k (1-p)^{1 - k}\\). Now \\(\\binom{1}{1} = \\binom{1}{0} = 1\\), so the measure is \\(p^k (1-p)^{1 - k}\\), which is the Bernoulli measure. set.seed(1) library(ggplot2) library(dplyr) bin_samp <- rbinom(n = 1000, size = 20, prob = 0.5) bin_samp <- data.frame(x = bin_samp) %>% count(x) %>% mutate(n = n / 1000, type = "empirical_frequencies") %>% bind_rows(data.frame(x = 0:20, n = dbinom(0:20, size = 20, prob = 0.5), type = "theoretical_measure")) bin_plot <- ggplot(data = bin_samp, aes(x = x, y = n, fill = type)) + geom_bar(stat="identity", position = "dodge") plot(bin_plot) Exercise 1.13 Show that the standard measurable space on \\(\\Omega = \\{0,1,...,\\infty\\}\\) equipped with geometric measure is a discrete probability space, equipped with Poisson measure is a discrete probability space. Define another probability measure on this measurable space. R: Draw 1000 samples from the Poisson distribution \\(\\lambda = 10\\) (rpois) and compare relative frequencies with theoretical probability measure. Solution. \\(\\sum_{k = 0}^{\\infty} p(1 - p)^k = p \\sum_{k = 0}^{\\infty} (1 - p)^k = p \\frac{1}{1 - 1 + p} = 1\\). We used the formula for geometric series. \\(\\sum_{k = 0}^{\\infty} \\frac{\\lambda^k e^{-\\lambda}}{k!} = e^{-\\lambda} \\sum_{k = 0}^{\\infty} \\frac{\\lambda^k}{k!} = e^{-\\lambda} e^{\\lambda} = 1.\\) We used the Taylor expansion of the exponential function. Since we only have to define a probability measure, we could only assign probabilities that sum to one to a finite number of events in \\(\\Omega\\), and probability zero to the other infinite number of events. However to make this solution more educational, we will try to find a measure that assigns a non-zero probability to all events in \\(\\Omega\\). A good start for this would be to find a converging infinite series, as the probabilities will have to sum to one. One simple converging series is the geometric series \\(\\sum_{k=0}^{\\infty} p^k\\) for \\(|p| < 1\\). Let us choose an arbitrary \\(p = 0.5\\). Then \\(\\sum_{k=0}^{\\infty} p^k = \\frac{1}{1 - 0.5} = 2\\). To complete the measure, we have to normalize it, so it sums to one, therefore \\(P(\\{k\\}) = \\frac{0.5^k}{2}\\) is a probability measure on \\(\\Omega\\). We could make it even more difficult by making this measure dependent on some parameter \\(\\alpha\\), but this is out of the scope of this introductory chapter. set.seed(1) pois_samp <- rpois(n = 1000, lambda = 10) pois_samp <- data.frame(x = pois_samp) %>% count(x) %>% mutate(n = n / 1000, type = "empirical_frequencies") %>% bind_rows(data.frame(x = 0:25, n = dpois(0:25, lambda = 10), type = "theoretical_measure")) pois_plot <- ggplot(data = pois_samp, aes(x = x, y = n, fill = type)) + geom_bar(stat="identity", position = "dodge") plot(pois_plot) Exercise 1.14 Define a probability measure on \\((\\Omega = \\mathbb{Z}, 2^{\\mathbb{Z}})\\). Define a probability measure such that \\(P(\\omega) > 0, \\forall \\omega \\in \\Omega\\). R: Implement a random generator that will generate samples with the relative frequency that corresponds to your probability measure. Compare relative frequencies with theoretical probability measure . Solution. \\(P(0) = 1, P(\\omega) = 0, \\forall \\omega \\neq 0\\). \\(P(\\{k\\}) = \\sum_{k = -\\infty}^{\\infty} \\frac{p(1 - p)^{|k|}}{2^{1 - 1_0(k)}}\\), where \\(1_0(k)\\) is the indicator function, which equals to one if \\(k\\) is 0, and equals to zero in every other case. n <- 1000 geom_samps <- rgeom(n, prob = 0.5) sign_samps <- sample(c(FALSE, TRUE), size = n, replace = TRUE) geom_samps[sign_samps] <- -geom_samps[sign_samps] my_pmf <- function (k, p) { indic <- rep(1, length(k)) indic[k == 0] <- 0 return ((p * (1 - p)^(abs(k))) / 2^indic) } geom_samps <- data.frame(x = geom_samps) %>% count(x) %>% mutate(n = n / 1000, type = "empirical_frequencies") %>% bind_rows(data.frame(x = -10:10, n = my_pmf(-10:10, 0.5), type = "theoretical_measure")) geom_plot <- ggplot(data = geom_samps, aes(x = x, y = n, fill = type)) + geom_bar(stat="identity", position = "dodge") plot(geom_plot) Exercise 1.15 Define a probability measure on \\(\\Omega = \\{1,2,3,4,5,6\\}\\) with parameter \\(m \\in \\{1,2,3,4,5,6\\}\\), so that the probability of outcome at distance \\(1\\) from \\(m\\) is half of the probability at distance \\(0\\), at distance \\(2\\) is half of the probability at distance \\(1\\), etc. R: Implement a random generator that will generate samples with the relative frequency that corresponds to your probability measure. Compare relative frequencies with theoretical probability measure . Solution. \\(P(\\{k\\}) = \\frac{\\frac{1}{2}^{|m - k|}}{\\sum_{i=1}^6 \\frac{1}{2}^{|m - i|}}\\) n <- 10000 m <- 4 my_pmf <- function (k, m) { denom <- sum(0.5^abs(m - 1:6)) return (0.5^abs(m - k) / denom) } samps <- c() for (i in 1:n) { a <- sample(1:6, 1) a_val <- my_pmf(a, m) prob <- runif(1) if (prob < a_val) { samps <- c(samps, a) } } samps <- data.frame(x = samps) %>% count(x) %>% mutate(n = n / length(samps), type = "empirical_frequencies") %>% bind_rows(data.frame(x = 1:6, n = my_pmf(1:6, m), type = "theoretical_measure")) my_plot <- ggplot(data = samps, aes(x = x, y = n, fill = type)) + geom_bar(stat="identity", position = "dodge") plot(my_plot) "],["uprobspaces.html", "Chapter 2 Uncountable probability spaces 2.1 Borel sets 2.2 Lebesgue measure", " Chapter 2 Uncountable probability spaces This chapter deals with uncountable probability spaces. The students are expected to acquire the following knowledge: Theoretical Understand Borel sets and identify them. Estimate Lebesgue measure for different sets. Know when sets are Borel-measurable. Understanding of countable and uncountable sets. R Uniform sampling. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 2.1 Borel sets Exercise 2.1 Prove that the intersection of two sigma algebras on \\(\\Omega\\) is a sigma algebra. Prove that the collection of all open subsets \\((a,b)\\) on \\((0,1]\\) is not a sigma algebra of \\((0,1]\\). Solution. Empty set: \\[\\begin{equation} \\emptyset \\in \\mathcal{A} \\wedge \\emptyset \\in \\mathcal{B} \\Rightarrow \\emptyset \\in \\mathcal{A} \\cap \\mathcal{B} \\end{equation}\\] Complement: \\[\\begin{equation} \\text{Let } A \\in \\mathcal{A} \\cap \\mathcal{B} \\Rightarrow A \\in \\mathcal{A} \\wedge A \\in \\mathcal{B} \\Rightarrow A^c \\in \\mathcal{A} \\wedge A^c \\in \\mathcal{B} \\Rightarrow A^c \\in \\mathcal{A} \\cap \\mathcal{B} \\end{equation}\\] Countable additivity: Let \\(\\{A_i\\}\\) be a countable sequence of subsets in \\(\\mathcal{A} \\cap \\mathcal{B}\\). \\[\\begin{equation} \\forall i: A_i \\in \\mathcal{A} \\cap \\mathcal{B} \\Rightarrow A_i \\in \\mathcal{A} \\wedge A_i \\in \\mathcal{B} \\Rightarrow \\cup A_i \\in \\mathcal{A} \\wedge \\cup A_i \\in \\mathcal{B} \\Rightarrow \\cup A_i \\in \\mathcal{A} \\cap \\mathcal{B} \\end{equation}\\] Let \\(A\\) denote the collection of all open subsets \\((a,b)\\) on \\((0,1]\\). Then \\((0,1) \\in A\\). But \\((0,1)^c = 1 \\notin A\\). Exercise 2.2 Show that \\(\\mathcal{C} = \\sigma(\\mathcal{C})\\) if and only if \\(\\mathcal{C}\\) is a sigma algebra. Solution. “\\(\\Rightarrow\\)” This follows from the definition of a generated sigma algebra. “\\(\\Leftarrow\\)” Let \\(\\mathcal{F} = \\cap_i F_i\\) be the intersection of all sigma algebras that contain \\(\\mathcal{C}\\). Then \\(\\sigma(\\mathcal{C}) = \\mathcal{F}\\). Additionally, \\(\\forall i: \\mathcal{C} \\in F_i\\). So each \\(F_i\\) can be written as \\(F_i = \\mathcal{C} \\cup D\\), where \\(D\\) are the rest of the elements in the sigma algebra. In other words, each sigma algebra in the collection contains at least \\(\\mathcal{C}\\), but can contain other elements. Now for some \\(j\\), \\(F_j = \\mathcal{C}\\) as \\(\\{F_i\\}\\) contains all sigma algebras that contain \\(\\mathcal{C}\\) and \\(\\mathcal{C}\\) is such a sigma algebra. Since this is the smallest subset in the intersection it follows that \\(\\sigma(\\mathcal{C}) = \\mathcal{F} = \\mathcal{C}\\). Exercise 2.3 Let \\(\\mathcal{C}\\) and \\(\\mathcal{D}\\) be two collections of subsets on \\(\\Omega\\) such that \\(\\mathcal{C} \\subset \\mathcal{D}\\). Prove that \\(\\sigma(\\mathcal{C}) \\subseteq \\sigma(\\mathcal{D})\\). Solution. \\(\\sigma(\\mathcal{D})\\) is a sigma algebra that contains \\(\\mathcal{D}\\). It follows that \\(\\sigma(\\mathcal{D})\\) is a sigma algebra that contains \\(\\mathcal{C}\\). Let us write \\(\\sigma(\\mathcal{C}) = \\cap_i F_i\\), where \\(\\{F_i\\}\\) is the collection of all sigma algebras that contain \\(\\mathcal{C}\\). Since \\(\\sigma(\\mathcal{D})\\) is such a sigma algebra, there exists an index \\(j\\), so that \\(F_j = \\sigma(\\mathcal{D})\\). Then we can write \\[\\begin{align} \\sigma(\\mathcal{C}) &= (\\cap_{i \\neq j} F_i) \\cap \\sigma(\\mathcal{D}) \\\\ &\\subseteq \\sigma(\\mathcal{D}). \\end{align}\\] Exercise 2.4 Prove that the following subsets of \\((0,1]\\) are Borel-measurable by finding their measure. Any countable set. The set of numbers in (0,1] whose decimal expansion does not contain 7. Solution. This follows directly from the fact that every countable set is a union of singletons, whose measure is 0. Let us first look at numbers which have a 7 as the first decimal numbers. Their measure is 0.1. Then we take all the numbers with a 7 as the second decimal number (excluding those who already have it as the first). These have the measure 0.01, and there are 9 of them, so their total measure is 0.09. We can continue to do so infinitely many times. At each \\(n\\), we have the measure of the intervals which is \\(10^n\\) and the number of those intervals is \\(9^{n-1}\\). Now \\[\\begin{align} \\lambda(A) &= 1 - \\sum_{n = 0}^{\\infty} \\frac{9^n}{10^{n+1}} \\\\ &= 1 - \\frac{1}{10} \\sum_{n = 0}^{\\infty} (\\frac{9}{10})^n \\\\ &= 1 - \\frac{1}{10} \\frac{10}{1} \\\\ &= 0. \\end{align}\\] Since we have shown that the measure of the set is \\(0\\), we have also shown that the set is measurable. Exercise 2.5 Let \\(\\Omega = [0,1]\\), and let \\(\\mathcal{F}_3\\) consist of all countable subsets of \\(\\Omega\\), and all subsets of \\(\\Omega\\) having a countable complement. Show that \\(\\mathcal{F}_3\\) is a sigma algebra. Let us define \\(P(A)=0\\) if \\(A\\) is countable, and \\(P(A) = 1\\) if \\(A\\) has a countable complement. Is \\((\\Omega, \\mathcal{F}_3, P)\\) a legitimate probability space? Solution. The empty set is countable, therefore it is in \\(\\mathcal{F}_3\\). For any \\(A \\in \\mathcal{F}_3\\). If \\(A\\) is countable, then \\(A^c\\) has a countable complement and is in \\(\\mathcal{F}_3\\). If \\(A\\) is uncountable, then it has a countable complement \\(A^c\\) which is therefore also in \\(\\mathcal{F}_3\\). We are left with showing countable additivity. Let \\(\\{A_i\\}\\) be an arbitrary collection of sets in \\(\\mathcal{F}_3\\). We will look at two possibilities. First let all \\(A_i\\) be countable. A countable union of countable sets is countable, and therefore in \\(\\mathcal{F}_3\\). Second, let at least one \\(A_i\\) be uncountable. It follows that it has a countable complement. We can write \\[\\begin{equation} (\\cup_{i=1}^{\\infty} A_i)^c = \\cap_{i=1}^{\\infty} A_i^c. \\end{equation}\\] Since at least one \\(A_i^c\\) on the right side is countable, the whole intersection is countable, and therefore the union has a countable complement. It follows that the union is in \\(\\mathcal{F}_3\\). The tuple \\((\\Omega, \\mathcal{F}_3)\\) is a measurable space. Therefore, we only need to check whether \\(P\\) is a probability measure. The measure of the empty set is zero as it is countable. We have to check for countable additivity. Let us look at three situations. Let \\(A_i\\) be disjoint sets. First, let all \\(A_i\\) be countable. \\[\\begin{equation} P(\\cup_{i=1}^{\\infty} A_i) = \\sum_{i=1}^{\\infty}P( A_i)) = 0. \\end{equation}\\] Since the union is countable, the above equation holds. Second, let exactly one \\(A_i\\) be uncountable. W.L.O.G. let that be \\(A_1\\). Then \\[\\begin{equation} P(\\cup_{i=1}^{\\infty} A_i) = 1 + \\sum_{i=2}^{\\infty}P( A_i)) = 1. \\end{equation}\\] Since the union is uncountable, the above equation holds. Third, let at least two \\(A_i\\) be uncountable. We have to check whether it is possible for two uncountable sets in \\(\\mathcal{F}_3\\) to be disjoint. If that is possible, then their measures would sum to more than one and \\(P\\) would not be a probability measure. W.L.O.G. let \\(A_1\\) and \\(A_2\\) be uncountable. Then we have \\[\\begin{equation} A_1 \\cap A_2 = (A_1^c \\cup A_2^c)^c. \\end{equation}\\] Now \\(A_1^c\\) and \\(A_2^c\\) are countable and their union is therefore countable. Let \\(B = A_1^c \\cup A_2^c\\). So the intersection of \\(A_1\\) and \\(A_2\\) equals the complement of \\(B\\), which is countable. For the intersection to be the empty set, \\(B\\) would have to equal to \\(\\Omega\\). But \\(\\Omega\\) is uncountable and therefore \\(B\\) can not equal to \\(\\Omega\\). It follows that two uncountable sets in \\(\\mathcal{F}_3\\) can not have an empty intersection. Therefore the tuple is a legitimate probability space. 2.2 Lebesgue measure Exercise 2.6 Show that the Lebesgue measure of rational numbers on \\([0,1]\\) is 0. R: Implement a random number generator, which generates uniform samples of irrational numbers in \\([0,1]\\) by uniformly sampling from \\([0,1]\\) and rejecting a sample if it is rational. Solution. There are a countable number of rational numbers. Therefore, we can write \\[\\begin{align} \\lambda(\\mathbb{Q}) &= \\lambda(\\cup_{i = 1}^{\\infty} q_i) &\\\\ &= \\sum_{i = 1}^{\\infty} \\lambda(q_i) &\\text{ (countable additivity)} \\\\ &= \\sum_{i = 1}^{\\infty} 0 &\\text{ (Lebesgue measure of a singleton)} \\\\ &= 0. \\end{align}\\] Exercise 2.7 Prove that the Lebesgue measure of \\(\\mathbb{R}\\) is infinity. Paradox. Show that the cardinality of \\(\\mathbb{R}\\) and \\((0,1)\\) is the same, while their Lebesgue measures are infinity and one respectively. Solution. Let \\(a_i\\) be the \\(i\\)-th integer for \\(i \\in \\mathbb{Z}\\). We can write \\(\\mathbb{R} = \\cup_{-\\infty}^{\\infty} (a_i, a_{i + 1}]\\). \\[\\begin{align} \\lambda(\\mathbb{R}) &= \\lambda(\\cup_{i = -\\infty}^{\\infty} (a_i, a_{i + 1}]) \\\\ &= \\lambda(\\lim_{n \\rightarrow \\infty} \\cup_{i = -n}^{n} (a_i, a_{i + 1}]) \\\\ &= \\lim_{n \\rightarrow \\infty} \\lambda(\\cup_{i = -n}^{n} (a_i, a_{i + 1}]) \\\\ &= \\lim_{n \\rightarrow \\infty} \\sum_{i = -n}^{n} \\lambda((a_i, a_{i + 1}]) \\\\ &= \\lim_{n \\rightarrow \\infty} \\sum_{i = -n}^{n} 1 \\\\ &= \\lim_{n \\rightarrow \\infty} 2n \\\\ &= \\infty. \\end{align}\\] We need to find a bijection between \\(\\mathbb{R}\\) and \\((0,1)\\). A well-known function that maps from a bounded interval to \\(\\mathbb{R}\\) is the tangent. To make the bijection easier to achieve, we will take the inverse, which maps from \\(\\mathbb{R}\\) to \\((-\\frac{\\pi}{2}, \\frac{\\pi}{2})\\). However, we need to change the function so it maps to \\((0,1)\\). First we add \\(\\frac{\\pi}{2}\\), so that we move the function above zero. Then we only have to divide by the max value, which in this case is \\(\\pi\\). So our bijection is \\[\\begin{equation} f(x) = \\frac{\\tan^{-1}(x) + \\frac{\\pi}{2}}{\\pi}. \\end{equation}\\] Exercise 2.8 Take the measure space \\((\\Omega_1 = (0,1], B_{(0,1]}, \\lambda)\\) (we know that this is a probability space on \\((0,1]\\)). Define a map (function) from \\(\\Omega_1\\) to \\(\\Omega_2 = \\{1,2,3,4,5,6\\}\\) such that the measure space \\((\\Omega_2, 2^{\\Omega_2}, \\lambda(f^{-1}()))\\) will be a discrete probability space with uniform probabilities (\\(P(\\omega) = \\frac{1}{6}, \\forall \\omega \\in \\Omega_2)\\). Is the map that you defined in (a) the only such map? How would you in the same fashion define a map that would result in a probability space that can be interpreted as a coin toss with probability \\(p\\) of heads? R: Use the map in (a) as a basis for a random generator for this fair die. Solution. In other words, we have to assign disjunct intervals of the same size to each element of \\(\\Omega_2\\). Therefore \\[\\begin{equation} f(x) = \\lceil 6x \\rceil. \\end{equation}\\] No, we could for example rearrange the order in which the intervals are mapped to integers. Additionally, we could have several disjoint intervals that mapped to the same integer, as long as the Lebesgue measure of their union would be \\(\\frac{1}{6}\\) and the function would remain injective. We have \\(\\Omega_3 = \\{0,1\\}\\), where zero represents heads and one represents tails. Then \\[\\begin{equation} f(x) = 0^{I_{A}(x)}, \\end{equation}\\] where \\(A = \\{y \\in (0,1] : y < p\\}\\). set.seed(1) unif_s <- runif(1000) die_s <- ceiling(6 * unif_s) summary(as.factor(die_s)) ## 1 2 3 4 5 6 ## 166 154 200 146 166 168 "],["condprob.html", "Chapter 3 Conditional probability 3.1 Calculating conditional probabilities 3.2 Conditional independence 3.3 Monty Hall problem", " Chapter 3 Conditional probability This chapter deals with conditional probability. The students are expected to acquire the following knowledge: Theoretical Identify whether variables are independent. Calculation of conditional probabilities. Understanding of conditional dependence and independence. How to apply Bayes’ theorem to solve difficult probabilistic questions. R Simulating conditional probabilities. cumsum. apply. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 3.1 Calculating conditional probabilities Exercise 3.1 A military officer is in charge of identifying enemy aircraft and shooting them down. He is able to positively identify an enemy airplane 95% of the time and positively identify a friendly airplane 90% of the time. Furthermore, 99% of the airplanes are friendly. When the officer identifies an airplane as an enemy airplane, what is the probability that it is not and they will shoot at a friendly airplane? Solution. Let \\(E = 0\\) denote that the observed plane is friendly and \\(E=1\\) that it is an enemy. Let \\(I = 0\\) denote that the officer identified it as friendly and \\(I = 1\\) as enemy. Then \\[\\begin{align} P(E = 0 | I = 1) &= \\frac{P(I = 1 | E = 0)P(E = 0)}{P(I = 1)} \\\\ &= \\frac{P(I = 1 | E = 0)P(E = 0)}{P(I = 1 | E = 0)P(E = 0) + P(I = 1 | E = 1)P(E = 1)} \\\\ &= \\frac{0.1 \\times 0.99}{0.1 \\times 0.99 + 0.95 \\times 0.01} \\\\ &= 0.91. \\end{align}\\] Exercise 3.2 R: Consider tossing a fair die. Let \\(A = \\{2,4,6\\}\\) and \\(B = \\{1,2,3,4\\}\\). Then \\(P(A) = \\frac{1}{2}\\), \\(P(B) = \\frac{2}{3}\\) and \\(P(AB) = \\frac{1}{3}\\). Since \\(P(AB) = P(A)P(B)\\), the events \\(A\\) and \\(B\\) are independent. Simulate draws from the sample space and verify that the proportions are the same. Then find two events \\(C\\) and \\(D\\) that are not independent and repeat the simulation. set.seed(1) nsamps <- 10000 tosses <- sample(1:6, nsamps, replace = TRUE) PA <- sum(tosses %in% c(2,4,6)) / nsamps PB <- sum(tosses %in% c(1,2,3,4)) / nsamps PA * PB ## [1] 0.3295095 sum(tosses %in% c(2,4)) / nsamps ## [1] 0.3323 # Let C = {1,2} and D = {2,3} PC <- sum(tosses %in% c(1,2)) / nsamps PD <- sum(tosses %in% c(2,3)) / nsamps PC * PD ## [1] 0.1067492 sum(tosses %in% c(2)) / nsamps ## [1] 0.1622 Exercise 3.3 A machine reports the true value of a thrown 12-sided die 5 out of 6 times. If the machine reports a 1 has been tossed, what is the probability that it is actually a 1? Now let the machine only report whether a 1 has been tossed or not. Does the probability change? R: Use simulation to check your answers to a) and b). Solution. Let \\(T = 1\\) denote that the toss is 1 and \\(M = 1\\) that the machine reports a 1. \\[\\begin{align} P(T = 1 | M = 1) &= \\frac{P(M = 1 | T = 1)P(T = 1)}{P(M = 1)} \\\\ &= \\frac{P(M = 1 | T = 1)P(T = 1)}{\\sum_{k=1}^{12} P(M = 1 | T = k)P(T = k)} \\\\ &= \\frac{\\frac{5}{6}\\frac{1}{12}}{\\frac{5}{6}\\frac{1}{12} + 11 \\frac{1}{6} \\frac{1}{11} \\frac{1}{12}} \\\\ &= \\frac{5}{6}. \\end{align}\\] Yes. \\[\\begin{align} P(T = 1 | M = 1) &= \\frac{P(M = 1 | T = 1)P(T = 1)}{P(M = 1)} \\\\ &= \\frac{P(M = 1 | T = 1)P(T = 1)}{\\sum_{k=1}^{12} P(M = 1 | T = k)P(T = k)} \\\\ &= \\frac{\\frac{5}{6}\\frac{1}{12}}{\\frac{5}{6}\\frac{1}{12} + 11 \\frac{1}{6} \\frac{1}{12}} \\\\ &= \\frac{5}{16}. \\end{align}\\] set.seed(1) nsamps <- 10000 report_a <- vector(mode = "numeric", length = nsamps) report_b <- vector(mode = "logical", length = nsamps) truths <- vector(mode = "logical", length = nsamps) for (i in 1:10000) { toss <- sample(1:12, size = 1) truth <- sample(c(TRUE, FALSE), size = 1, prob = c(5/6, 1/6)) truths[i] <- truth if (truth) { report_a[i] <- toss report_b[i] <- toss == 1 } else { remaining <- (1:12)[1:12 != toss] report_a[i] <- sample(remaining, size = 1) report_b[i] <- toss != 1 } } truth_a1 <- truths[report_a == 1] sum(truth_a1) / length(truth_a1) ## [1] 0.8300733 truth_b1 <- truths[report_b] sum(truth_b1) / length(truth_b1) ## [1] 0.3046209 Exercise 3.4 A coin is tossed independently \\(n\\) times. The probability of heads at each toss is \\(p\\). At each time \\(k\\), \\((k = 2,3,...,n)\\) we get a reward at time \\(k+1\\) if \\(k\\)-th toss was a head and the previous toss was a tail. Let \\(A_k\\) be the event that a reward is obtained at time \\(k\\). Are events \\(A_k\\) and \\(A_{k+1}\\) independent? Are events \\(A_k\\) and \\(A_{k+2}\\) independent? R: simulate 10 tosses 10000 times, where \\(p = 0.7\\). Check your answers to a) and b) by counting the frequencies of the events \\(A_5\\), \\(A_6\\), and \\(A_7\\). Solution. For \\(A_k\\) to happen, we need the tosses \\(k-2\\) and \\(k-1\\) be tails and heads respectively. For \\(A_{k+1}\\) to happen, we need tosses \\(k-1\\) and \\(k\\) be tails and heads respectively. As the toss \\(k-1\\) need to be heads for one and tails for the other, these two events can not happen simultaneously. Therefore the probability of their intersection is 0. But the probability of each of them separately is \\(p(1-p) > 0\\). Therefore, they are not independent. For \\(A_k\\) to happen, we need the tosses \\(k-2\\) and \\(k-1\\) be tails and heads respectively. For \\(A_{k+2}\\) to happen, we need tosses \\(k\\) and \\(k+1\\) be tails and heads respectively. So the probability of intersection is \\(p^2(1-p)^2\\). And the probability of each separately is again \\(p(1-p)\\). Therefore, they are independent. set.seed(1) nsamps <- 10000 p <- 0.7 rewardA_5 <- vector(mode = "logical", length = nsamps) rewardA_6 <- vector(mode = "logical", length = nsamps) rewardA_7 <- vector(mode = "logical", length = nsamps) rewardA_56 <- vector(mode = "logical", length = nsamps) rewardA_57 <- vector(mode = "logical", length = nsamps) for (i in 1:nsamps) { samps <- sample(c(0,1), size = 10, replace = TRUE, prob = c(0.7, 0.3)) rewardA_5[i] <- (samps[4] == 0 & samps[3] == 1) rewardA_6[i] <- (samps[5] == 0 & samps[4] == 1) rewardA_7[i] <- (samps[6] == 0 & samps[5] == 1) rewardA_56[i] <- (rewardA_5[i] & rewardA_6[i]) rewardA_57[i] <- (rewardA_5[i] & rewardA_7[i]) } sum(rewardA_5) / nsamps ## [1] 0.2141 sum(rewardA_6) / nsamps ## [1] 0.2122 sum(rewardA_7) / nsamps ## [1] 0.2107 sum(rewardA_56) / nsamps ## [1] 0 sum(rewardA_57) / nsamps ## [1] 0.0454 Exercise 3.5 A drawer contains two coins. One is an unbiased coin, the other is a biased coin, which will turn up heads with probability \\(p\\) and tails with probability \\(1-p\\). One coin is selected uniformly at random. The selected coin is tossed \\(n\\) times. The coin turns up heads \\(k\\) times and tails \\(n-k\\) times. What is the probability that the coin is biased? The selected coin is tossed repeatedly until it turns up heads \\(k\\) times. Given that it is tossed \\(n\\) times in total, what is the probability that the coin is biased? Solution. Let \\(B = 1\\) denote that the coin is biased and let \\(H = k\\) denote that we’ve seen \\(k\\) heads. \\[\\begin{align} P(B = 1 | H = k) &= \\frac{P(H = k | B = 1)P(B = 1)}{P(H = k)} \\\\ &= \\frac{P(H = k | B = 1)P(B = 1)}{P(H = k | B = 1)P(B = 1) + P(H = k | B = 0)P(B = 0)} \\\\ &= \\frac{p^k(1-p)^{n-k} 0.5}{p^k(1-p)^{n-k} 0.5 + 0.5^{n+1}} \\\\ &= \\frac{p^k(1-p)^{n-k}}{p^k(1-p)^{n-k} + 0.5^n}. \\end{align}\\] The same results as in a). The only difference between these two scenarios is that in b) the last throw must be heads. However, this holds for the biased and the unbiased coin and therefore does not affect the probability of the coin being biased. Exercise 3.6 Judy goes around the company for Women’s day and shares flowers. In every office she leaves a flower, if there is at least one woman inside. The probability that there’s a woman in the office is \\(\\frac{3}{5}\\). What is the probability that Judy leaves her first flower in the fourth office? Given that she has given away exactly three flowers in the first four offices, what is the probability that she gives her fourth flower in the eighth office? What is the probability that she leaves the second flower in the fifth office? What is the probability that she leaves the second flower in the fifth office, given that she did not leave the second flower in the second office? Judy needs a new supply of flowers immediately after the office, where she gives away her last flower. What is the probability that she visits at least five offices, if she starts with two flowers? R: simulate Judy’s walk 10000 times to check your answers a) - e). Solution. Let \\(X_i = k\\) denote the event that … \\(i\\)-th sample on the \\(k\\)-th run. Since the events are independent, we can multiply their probabilities to get \\[\\begin{equation} P(X_1 = 4) = 0.4^3 \\times 0.6 = 0.0384. \\end{equation}\\] Same as in a) as we have a fresh start after first four offices. For this to be possible, she had to leave the first flower in one of the first four offices. Therefore there are four possibilities, and for each of those the probability is \\(0.4^3 \\times 0.6\\). Additionally, the probability that she leaves a flower in the fifth office is \\(0.6\\). So \\[\\begin{equation} P(X_2 = 5) = \\binom{4}{1} \\times 0.4^3 \\times 0.6^2 = 0.09216. \\end{equation}\\] We use Bayes’ theorem. \\[\\begin{align} P(X_2 = 5 | X_2 \\neq 2) &= \\frac{P(X_2 \\neq 2 | X_2 = 5)P(X_2 = 5)}{P(X_2 \\neq 2)} \\\\ &= \\frac{0.09216}{0.64} \\\\ &= 0.144. \\end{align}\\] The denominator in the second equation can be calculated as follows. One of three things has to happen for the second not to be dealt in the second round. First, both are zero, so \\(0.4^2\\). Second, first is zero, and second is one, so \\(0.4 \\times 0.6\\). Third, the first is one and the second one zero, so \\(0.6 \\times 0.4\\). Summing these values we get \\(0.64\\). We will look at the complement, so the events that she gave away exactly two flowers after two, three and four offices. \\[\\begin{equation} P(X_2 \\geq 5) = 1 - 0.6^2 - 2 \\times 0.4 \\times 0.6^2 - 3 \\times 0.4^2 \\times 0.6^2 = 0.1792. \\end{equation}\\] The multiplying parts represent the possibilities of the first flower. set.seed(1) nsamps <- 100000 Judyswalks <- matrix(data = NA, nrow = nsamps, ncol = 8) for (i in 1:nsamps) { thiswalk <- sample(c(0,1), size = 8, replace = TRUE, prob = c(0.4, 0.6)) Judyswalks[i, ] <- thiswalk } csJudy <- t(apply(Judyswalks, 1, cumsum)) # a sum(csJudy[ ,4] == 1 & csJudy[ ,3] == 0) / nsamps ## [1] 0.03848 # b csJsubset <- csJudy[csJudy[ ,4] == 3 & csJudy[ ,3] == 2, ] sum(csJsubset[ ,8] == 4 & csJsubset[ ,7] == 3) / nrow(csJsubset) ## [1] 0.03665893 # c sum(csJudy[ ,5] == 2 & csJudy[ ,4] == 1) / nsamps ## [1] 0.09117 # d sum(csJudy[ ,5] == 2 & csJudy[ ,4] == 1) / sum(csJudy[ ,2] != 2) ## [1] 0.1422398 # e sum(csJudy[ ,4] < 2) / nsamps ## [1] 0.17818 3.2 Conditional independence Exercise 3.7 Describe: A real-world example of two events \\(A\\) and \\(B\\) that are dependent but become conditionally independent if conditioned on a third event \\(C\\). A real-world example of two events \\(A\\) and \\(B\\) that are independent, but become dependent if conditioned on some third event \\(C\\). Solution. Let \\(A\\) be the height of a person and let \\(B\\) be the person’s knowledge of the Dutch language. These events are dependent since the Dutch are known to be taller than average. However if \\(C\\) is the nationality of the person, then \\(A\\) and \\(B\\) are independent given \\(C\\). Let \\(A\\) be the event that Mary passes the exam and let \\(B\\) be the event that John passes the exam. These events are independent. However, if the event \\(C\\) is that Mary and John studied together, then \\(A\\) and \\(B\\) are conditionally dependent given \\(C\\). Exercise 3.8 We have two coins of identical appearance. We know that one is a fair coin and the other flips heads 80% of the time. We choose one of the two coins uniformly at random. We discard the coin that was not chosen. We now flip the chosen coin independently 10 times, producing a sequence \\(Y_1 = y_1\\), \\(Y_2 = y_2\\), …, \\(Y_{10} = y_{10}\\). Intuitively, without doing and computation, are these random variables independent? Compute the probability \\(P(Y_1 = 1)\\). Compute the probabilities \\(P(Y_2 = 1 | Y_1 = 1)\\) and \\(P(Y_{10} = 1 | Y_1 = 1,...,Y_9 = 1)\\). Given your answers to b) and c), would you now change your answer to a)? If so, discuss why your intuition had failed. Solution. \\(P(Y_1 = 1) = 0.5 * 0.8 + 0.5 * 0.5 = 0.65\\). Since we know that \\(Y_1 = 1\\) this should change our view of the probability of the coin being biased or not. Let \\(B = 1\\) denote the event that the coin is biased and let \\(B = 0\\) denote that the coin is unbiased. By using marginal probability, we can write \\[\\begin{align} P(Y_2 = 1 | Y_1 = 1) &= P(Y_2 = 1, B = 1 | Y_1 = 1) + P(Y_2 = 1, B = 0 | Y_1 = 1) \\\\ &= \\sum_{k=1}^2 P(Y_2 = 1 | B = k, Y_1 = 1)P(B = k | Y_1 = 1) \\\\ &= 0.8 \\frac{P(Y_1 = 1 | B = 1)P(B = 1)}{P(Y_1 = 1)} + 0.5 \\frac{P(Y_1 = 1 | B = 0)P(B = 0)}{P(Y_1 = 1)} \\\\ &= 0.8 \\frac{0.8 \\times 0.5}{0.65} + 0.5 \\frac{0.5 \\times 0.5}{0.65} \\\\ &\\approx 0.68. \\end{align}\\] For the other calculation we follow the same procedure. Let \\(X = 1\\) denote that first nine tosses are all heads (equivalent to \\(Y_1 = 1\\),…, \\(Y_9 = 1\\)). \\[\\begin{align} P(Y_{10} = 1 | X = 1) &= P(Y_2 = 1, B = 1 | X = 1) + P(Y_2 = 1, B = 0 | X = 1) \\\\ &= \\sum_{k=1}^2 P(Y_2 = 1 | B = k, X = 1)P(B = k | X = 1) \\\\ &= 0.8 \\frac{P(X = 1 | B = 1)P(B = 1)}{P(X = 1)} + 0.5 \\frac{P(X = 1 | B = 0)P(B = 0)}{P(X = 1)} \\\\ &= 0.8 \\frac{0.8^9 \\times 0.5}{0.5 \\times 0.8^9 + 0.5 \\times 0.5^9} + 0.5 \\frac{0.5^9 \\times 0.5}{0.5 \\times 0.8^9 + 0.5 \\times 0.5^9} \\\\ &\\approx 0.8. \\end{align}\\] 3.3 Monty Hall problem The Monty Hall problem is a famous probability puzzle with non-intuitive outcome. Many established mathematicians and statisticians had problems solving it and many even disregarded the correct solution until they’ve seen the proof by simulation. Here we will show how it can be solved relatively simply with the use of Bayes’ theorem if we select the variables in a smart way. Exercise 3.9 (Monty Hall problem) A prize is placed at random behind one of three doors. You pick a door. Now Monty Hall chooses one of the other two doors, opens it and shows you that it is empty. He then gives you the opportunity to keep your door or switch to the other unopened door. Should you stay or switch? Use Bayes’ theorem to calculate the probability of winning if you switch and if you do not. R: Check your answers in R. Solution. W.L.O.G. assume we always pick the first door. The host can only open door 2 or door 3, as he can not open the door we picked. Let \\(k \\in \\{2,3\\}\\). Let us first look at what happens if we do not change. Then we have \\[\\begin{align} P(\\text{car in 1} | \\text{open $k$}) &= \\frac{P(\\text{open $k$} | \\text{car in 1})P(\\text{car in 1})}{P(\\text{open $k$})} \\\\ &= \\frac{P(\\text{open $k$} | \\text{car in 1})P(\\text{car in 1})}{\\sum_{n=1}^3 P(\\text{open $k$} | \\text{car in $n$})P(\\text{car in $n$)}}. \\end{align}\\] The probability that he opened \\(k\\) if the car is in 1 is \\(\\frac{1}{2}\\), as he can choose between door 2 and 3 as both have a goat behind it. Let us look at the normalization constant. When \\(n = 1\\) we get the value in the nominator. When \\(n=k\\), we get 0, as he will not open the door if there’s a prize behind. The remaining option is that we select 1, the car is behind \\(k\\) and he opens the only door left. Since he can’t open 1 due to it being our pick and \\(k\\) due to having the prize, the probability of opening the remaining door is 1, and the prior probability of the car being behind this door is \\(\\frac{1}{3}\\). So we have \\[\\begin{align} P(\\text{car in 1} | \\text{open $k$}) &= \\frac{\\frac{1}{2}\\frac{1}{3}}{\\frac{1}{2}\\frac{1}{3} + \\frac{1}{3}} \\\\ &= \\frac{1}{3}. \\end{align}\\] Now let us look at what happens if we do change. Let \\(k' \\in \\{2,3\\}\\) be the door that is not opened. If we change, we select this door, so we have \\[\\begin{align} P(\\text{car in $k'$} | \\text{open $k$}) &= \\frac{P(\\text{open $k$} | \\text{car in $k'$})P(\\text{car in $k'$})}{P(\\text{open $k$})} \\\\ &= \\frac{P(\\text{open $k$} | \\text{car in $k'$})P(\\text{car in $k'$})}{\\sum_{n=1}^3 P(\\text{open $k$} | \\text{car in $n$})P(\\text{car in $n$)}}. \\end{align}\\] The denominator stays the same, the only thing that is different from before is \\(P(\\text{open $k$} | \\text{car in $k'$})\\). We have a situation where we initially selected door 1 and the car is in door \\(k'\\). The probability that the host will open door \\(k\\) is then 1, as he can not pick any other door. So we have \\[\\begin{align} P(\\text{car in $k'$} | \\text{open $k$}) &= \\frac{\\frac{1}{3}}{\\frac{1}{2}\\frac{1}{3} + \\frac{1}{3}} \\\\ &= \\frac{2}{3}. \\end{align}\\] Therefore it makes sense to change the door. set.seed(1) nsamps <- 1000 ifchange <- vector(mode = "logical", length = nsamps) ifstay <- vector(mode = "logical", length = nsamps) for (i in 1:nsamps) { where_car <- sample(c(1:3), 1) where_player <- sample(c(1:3), 1) open_samp <- (1:3)[where_car != (1:3) & where_player != (1:3)] if (length(open_samp) == 1) { where_open <- open_samp } else { where_open <- sample(open_samp, 1) } ifstay[i] <- where_car == where_player where_ifchange <- (1:3)[where_open != (1:3) & where_player != (1:3)] ifchange[i] <- where_ifchange == where_car } sum(ifstay) / nsamps ## [1] 0.328 sum(ifchange) / nsamps ## [1] 0.672 "],["rvs.html", "Chapter 4 Random variables 4.1 General properties and calculations 4.2 Discrete random variables 4.3 Continuous random variables 4.4 Singular random variables 4.5 Transformations", " Chapter 4 Random variables This chapter deals with random variables and their distributions. The students are expected to acquire the following knowledge: Theoretical Identification of random variables. Convolutions of random variables. Derivation of PDF, PMF, CDF, and quantile function. Definitions and properties of common discrete random variables. Definitions and properties of common continuous random variables. Transforming univariate random variables. R Familiarize with PDF, PMF, CDF, and quantile functions for several distributions. Visual inspection of probability distributions. Analytical and empirical calculation of probabilities based on distributions. New R functions for plotting (for example, facet_wrap). Creating random number generators based on the Uniform distribution. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 4.1 General properties and calculations Exercise 4.1 Which of the functions below are valid CDFs? Find their respective densities. R: Plot the three functions. \\[\\begin{equation} F(x) = \\begin{cases} 1 - e^{-x^2} & x \\geq 0 \\\\ 0 & x < 0. \\end{cases} \\end{equation}\\] \\[\\begin{equation} F(x) = \\begin{cases} e^{-\\frac{1}{x}} & x > 0 \\\\ 0 & x \\leq 0. \\end{cases} \\end{equation}\\] \\[\\begin{equation} F(x) = \\begin{cases} 0 & x \\leq 0 \\\\ \\frac{1}{3} & 0 < x \\leq \\frac{1}{2} \\\\ 1 & x > \\frac{1}{2}. \\end{cases} \\end{equation}\\] Solution. Yes. First, let us check the limits. \\(\\lim_{x \\rightarrow -\\infty} (0) = 0\\). \\(\\lim_{x \\rightarrow \\infty} (1 - e^{-x^2}) = 1 - \\lim_{x \\rightarrow \\infty} e^{-x^2} = 1 - 0 = 1\\). Second, let us check whether the function is increasing. Let \\(x > y \\geq 0\\). Then \\(1 - e^{-x^2} \\geq 1 - e^{-y^2}\\). We only have to check right continuity for the point zero. \\(F(0) = 0\\) and \\(\\lim_{\\epsilon \\downarrow 0}F (0 + \\epsilon) = \\lim_{\\epsilon \\downarrow 0} 1 - e^{-\\epsilon^2} = 1 - \\lim_{\\epsilon \\downarrow 0} e^{-\\epsilon^2} = 1 - 1 = 0\\). We get the density by differentiating the CDF. \\(p(x) = \\frac{d}{dx} 1 - e^{-x^2} = 2xe^{-x^2}.\\) Students are encouraged to check that this is a proper PDF. Yes. First, let us check the limits. $_{x -} (0) = 0 and \\(\\lim_{x \\rightarrow \\infty} (e^{-\\frac{1}{x}}) = 1\\). Second, let us check whether the function is increasing. Let \\(x > y \\geq 0\\). Then \\(e^{-\\frac{1}{x}} \\geq e^{-\\frac{1}{y}}\\). We only have to check right continuity for the point zero. \\(F(0) = 0\\) and \\(\\lim_{\\epsilon \\downarrow 0}F (0 + \\epsilon) = \\lim_{\\epsilon \\downarrow 0} e^{-\\frac{1}{\\epsilon}} = 0\\). We get the density by differentiating the CDF. \\(p(x) = \\frac{d}{dx} e^{-\\frac{1}{x}} = \\frac{1}{x^2}e^{-\\frac{1}{x}}.\\) Students are encouraged to check that this is a proper PDF. No. The function is not right continuous as \\(F(\\frac{1}{2}) = \\frac{1}{3}\\), but \\(\\lim_{\\epsilon \\downarrow 0} F(\\frac{1}{2} + \\epsilon) = 1\\). f1 <- function (x) { tmp <- 1 - exp(-x^2) tmp[x < 0] <- 0 return(tmp) } f2 <- function (x) { tmp <- exp(-(1 / x)) tmp[x <= 0] <- 0 return(tmp) } f3 <- function (x) { tmp <- x tmp[x == x] <- 1 tmp[x <= 0.5] <- 1/3 tmp[x <= 0] <- 0 return(tmp) } cdf_data <- tibble(x = seq(-1, 20, by = 0.001), f1 = f1(x), f2 = f2(x), f3 = f3(x)) %>% melt(id.vars = "x") cdf_plot <- ggplot(data = cdf_data, aes(x = x, y = value, color = variable)) + geom_hline(yintercept = 1) + geom_line() plot(cdf_plot) Exercise 4.2 Let \\(X\\) be a random variable with CDF \\[\\begin{equation} F(x) = \\begin{cases} 0 & x < 0 \\\\ \\frac{x^2}{2} & 0 \\leq x < 1 \\\\ \\frac{1}{2} + \\frac{p}{2} & 1 \\leq x < 2 \\\\ \\frac{1}{2} + \\frac{p}{2} + \\frac{1 - p}{2} & x \\geq 2 \\end{cases} \\end{equation}\\] R: Plot this CDF for \\(p = 0.3\\). Is it a discrete, continuous, or mixed random varible? Find the probability density/mass of \\(X\\). f1 <- function (x, p) { tmp <- x tmp[x >= 2] <- 0.5 + (p * 0.5) + ((1-p) * 0.5) tmp[x < 2] <- 0.5 + (p * 0.5) tmp[x < 1] <- (x[x < 1])^2 / 2 tmp[x < 0] <- 0 return(tmp) } cdf_data <- tibble(x = seq(-1, 5, by = 0.001), y = f1(x, 0.3)) cdf_plot <- ggplot(data = cdf_data, aes(x = x, y = y)) + geom_hline(yintercept = 1) + geom_line(color = "blue") plot(cdf_plot) ::: {.solution} \\(X\\) is a mixed random variable. Since \\(X\\) is a mixed random variable, we have to find the PDF of the continuous part and the PMF of the discrete part. We get the continuous part by differentiating the corresponding CDF, \\(\\frac{d}{dx}\\frac{x^2}{2} = x\\). So the PDF, when \\(0 \\leq x < 1\\), is \\(p(x) = x\\). Let us look at the discrete part now. It has two steps, so this is a discrete distribution with two outcomes – numbers 1 and 2. The first happens with probability \\(\\frac{p}{2}\\), and the second with probability \\(\\frac{1 - p}{2}\\). This reminds us of the Bernoulli distribution. The PMF for the discrete part is \\(P(X = x) = (\\frac{p}{2})^{2 - x} (\\frac{1 - p}{2})^{x - 1}\\). ::: Exercise 4.3 (Convolutions) Convolutions are probability distributions that correspond to sums of independent random variables. Let \\(X\\) and \\(Y\\) be independent discrete variables. Find the PMF of \\(Z = X + Y\\). Hint: Use the law of total probability. Let \\(X\\) and \\(Y\\) be independent continuous variables. Find the PDF of \\(Z = X + Y\\). Hint: Start with the CDF. Solution. \\[\\begin{align} P(Z = z) &= P(X + Y = z) & \\\\ &= \\sum_{k = -\\infty}^\\infty P(X + Y = z | Y = k) P(Y = k) & \\text{ (law of total probability)} \\\\ &= \\sum_{k = -\\infty}^\\infty P(X + k = z | Y = k) P(Y = k) & \\\\ &= \\sum_{k = -\\infty}^\\infty P(X + k = z) P(Y = k) & \\text{ (independence of $X$ and $Y$)} \\\\ &= \\sum_{k = -\\infty}^\\infty P(X = z - k) P(Y = k). & \\end{align}\\] Let \\(f\\) and \\(g\\) be the PDFs of \\(X\\) and \\(Y\\) respectively. \\[\\begin{align} F(z) &= P(Z < z) \\\\ &= P(X + Y < z) \\\\ &= \\int_{-\\infty}^{\\infty} P(X + Y < z | Y = y)P(Y = y)dy \\\\ &= \\int_{-\\infty}^{\\infty} P(X + y < z | Y = y)P(Y = y)dy \\\\ &= \\int_{-\\infty}^{\\infty} P(X + y < z)P(Y = y)dy \\\\ &= \\int_{-\\infty}^{\\infty} P(X < z - y)P(Y = y)dy \\\\ &= \\int_{-\\infty}^{\\infty} (\\int_{-\\infty}^{z - y} f(x) dx) g(y) dy \\end{align}\\] Now \\[\\begin{align} p(z) &= \\frac{d}{dz} F(z) & \\\\ &= \\int_{-\\infty}^{\\infty} (\\frac{d}{dz}\\int_{-\\infty}^{z - y} f(x) dx) g(y) dy & \\\\ &= \\int_{-\\infty}^{\\infty} f(z - y) g(y) dy & \\text{ (fundamental theorem of calculus)}. \\end{align}\\] 4.2 Discrete random variables Exercise 4.4 (Binomial random variable) Let \\(X_k\\), \\(k = 1,...,n\\), be random variables with the Bernoulli measure as the PMF. Let \\(X = \\sum_{k=1}^n X_k\\). We call \\(X_k\\) a Bernoulli random variable with parameter \\(p \\in (0,1)\\). Find the CDF of \\(X_k\\). Find PMF of \\(X\\). This is a Binomial random variable with support in \\(\\{0,1,2,...,n\\}\\) and parameters \\(p \\in (0,1)\\) and \\(n \\in \\mathbb{N}_0\\). We denote \\[\\begin{equation} X | n,p \\sim \\text{binomial}(n,p). \\end{equation}\\] Find CDF of \\(X\\). R: Simulate from the binomial distribution with \\(n = 10\\) and \\(p = 0.5\\), and from \\(n\\) Bernoulli distributions with \\(p = 0.5\\). Visually compare the sum of Bernoullis and the binomial. Hint: there is no standard function like rpois for a Bernoulli random variable. Check exercise 1.12 to find out how to sample from a Bernoulli distribution. Solution. There are two outcomes – zero and one. Zero happens with probability \\(1 - p\\). Therefore \\[\\begin{equation} F(k) = \\begin{cases} 0 & k < 0 \\\\ 1 - p & 0 \\leq k < 1 \\\\ 1 & k \\geq 1. \\end{cases} \\end{equation}\\] For the probability of \\(X\\) to be equal to some \\(k \\leq n\\), exactly \\(k\\) Bernoulli variables need to be one, and the others zero. So \\(p^k(1-p)^{n-k}\\). There are \\(\\binom{n}{k}\\) such possible arrangements. Therefore \\[\\begin{align} P(X = k) = \\binom{n}{k} p^k (1 - p)^{n-k}. \\end{align}\\] \\[\\begin{equation} F(k) = \\sum_{i = 0}^{\\lfloor k \\rfloor} \\binom{n}{i} p^i (1 - p)^{n - i} \\end{equation}\\] set.seed(1) nsamps <- 10000 binom_samp <- rbinom(nsamps, size = 10, prob = 0.5) bernoulli_mat <- matrix(data = NA, nrow = nsamps, ncol = 10) for (i in 1:nsamps) { bernoulli_mat[i, ] <- rbinom(10, size = 1, prob = 0.5) } bern_samp <- apply(bernoulli_mat, 1, sum) b_data <- tibble(x = c(binom_samp, bern_samp), type = c(rep("binomial", 10000), rep("Bernoulli_sum", 10000))) b_plot <- ggplot(data = b_data, aes(x = x, fill = type)) + geom_bar(position = "dodge") plot(b_plot) Exercise 4.5 (Geometric random variable) A variable with PMF \\[\\begin{equation} P(k) = p(1-p)^k \\end{equation}\\] is a geometric random variable with support in non-negative integers. It has one parameter \\(p \\in (0,1]\\). We denote \\[\\begin{equation} X | p \\sim \\text{geometric}(p) \\end{equation}\\] Derive the CDF of a geometric random variable. R: Draw 1000 samples from the geometric distribution with \\(p = 0.3\\) and compare their frequencies to theoretical values. Solution. \\[\\begin{align} P(X \\leq k) &= \\sum_{i = 0}^k p(1-p)^i \\\\ &= p \\sum_{i = 0}^k (1-p)^i \\\\ &= p \\frac{1 - (1-p)^{k+1}}{1 - (1 - p)} \\\\ &= 1 - (1-p)^{k + 1} \\end{align}\\] set.seed(1) geo_samp <- rgeom(n = 1000, prob = 0.3) geo_samp <- data.frame(x = geo_samp) %>% count(x) %>% mutate(n = n / 1000, type = "empirical_frequencies") %>% bind_rows(data.frame(x = 0:20, n = dgeom(0:20, prob = 0.3), type = "theoretical_measure")) geo_plot <- ggplot(data = geo_samp, aes(x = x, y = n, fill = type)) + geom_bar(stat="identity", position = "dodge") plot(geo_plot) Exercise 4.6 (Poisson random variable) A variable with PMF \\[\\begin{equation} P(k) = \\frac{\\lambda^k e^{-\\lambda}}{k!} \\end{equation}\\] is a Poisson random variable with support in non-negative integers. It has one positive parameter \\(\\lambda\\), which also represents its mean value and variance (a measure of the deviation of the values from the mean – more on mean and variance in the next chapter). We denote \\[\\begin{equation} X | \\lambda \\sim \\text{Poisson}(\\lambda). \\end{equation}\\] This distribution is usually the default choice for modeling counts. We have already encountered a Poisson random variable in exercise 1.13, where we also sampled from this distribution. The CDF of a Poisson random variable is \\(P(X <= x) = e^{-\\lambda} \\sum_{i=0}^x \\frac{\\lambda^{i}}{i!}\\). R: Draw 1000 samples from the Poisson distribution with \\(\\lambda = 5\\) and compare their empirical cumulative distribution function with the theoretical CDF. set.seed(1) pois_samp <- rpois(n = 1000, lambda = 5) pois_samp <- data.frame(x = pois_samp) pois_plot <- ggplot(data = pois_samp, aes(x = x, colour = "ECDF")) + stat_ecdf(geom = "step") + geom_step(data = tibble(x = 0:17, y = ppois(x, 5)), aes(x = x, y = y, colour = "CDF")) + scale_colour_manual("Lgend title", values = c("black", "red")) plot(pois_plot) Exercise 4.7 (Negative binomial random variable) A variable with PMF \\[\\begin{equation} p(k) = \\binom{k + r - 1}{k}(1-p)^r p^k \\end{equation}\\] is a negative binomial random variable with support in non-negative integers. It has two parameters \\(r > 0\\) and \\(p \\in (0,1)\\). We denote \\[\\begin{equation} X | r,p \\sim \\text{NB}(r,p). \\end{equation}\\] Let us reparameterize the negative binomial distribution with \\(q = 1 - p\\). Find the PMF of \\(X \\sim \\text{NB}(1, q)\\). Do you recognize this distribution? Show that the sum of two negative binomial random variables with the same \\(p\\) is also a negative binomial random variable. Hint: Use the fact that the number of ways to place \\(n\\) indistinct balls into \\(k\\) boxes is \\(\\binom{n + k - 1}{n}\\). R: Draw samples from \\(X \\sim \\text{NB}(5, 0.4)\\) and \\(Y \\sim \\text{NB}(3, 0.4)\\). Draw samples from \\(Z = X + Y\\), where you use the parameters calculated in b). Plot both distributions, their sum, and \\(Z\\) using facet_wrap. Be careful, as R uses a different parameterization size=\\(r\\) and prob=\\(1 - p\\). Solution. \\[\\begin{align} P(X = k) &= \\binom{k + 1 - 1}{k}q^1 (1-q)^k \\\\ &= q(1-q)^k. \\end{align}\\] This is the geometric distribution. Let \\(X \\sim \\text{NB}(r_1, p)\\) and \\(Y \\sim \\text{NB}(r_2, p)\\). Let \\(Z = X + Y\\). \\[\\begin{align} P(Z = z) &= \\sum_{k = 0}^{\\infty} P(X = z - k)P(Y = k), \\text{ if k < 0, then the probabilities are 0} \\\\ &= \\sum_{k = 0}^{z} P(X = z - k)P(Y = k), \\text{ if k > z, then the probabilities are 0} \\\\ &= \\sum_{k = 0}^{z} \\binom{z - k + r_1 - 1}{z - k}(1 - p)^{r_1} p^{z - k} \\binom{k + r_2 - 1}{k}(1 - p)^{r_2} p^{k} & \\\\ &= \\sum_{k = 0}^{z} \\binom{z - k + r_1 - 1}{z - k} \\binom{k + r_2 - 1}{k}(1 - p)^{r_1 + r_2} p^{z} & \\\\ &= (1 - p)^{r_1 + r_2} p^{z} \\sum_{k = 0}^{z} \\binom{z - k + r_1 - 1}{z - k} \\binom{k + r_2 - 1}{k}& \\end{align}\\] The part before the sum reminds us of the negative binomial distribution with parameters \\(r_1 + r_2\\) and \\(p\\). To complete this term to the negative binomial PMF we need \\(\\binom{z + r_1 + r_2 -1}{z}\\). So the only thing we need to prove is that the sum equals this term. Both terms in the sum can be interpreted as numbers of ways to place a number of balls into boxes. For the left term it is \\(z-k\\) balls into \\(r_1\\) boxes, and for the right \\(k\\) balls into \\(r_2\\) boxes. For each \\(k\\) we are distributing \\(z\\) balls in total. By summing over all \\(k\\), we actually get all the possible placements of \\(z\\) balls into \\(r_1 + r_2\\) boxes. Therefore \\[\\begin{align} P(Z = z) &= (1 - p)^{r_1 + r_2} p^{z} \\sum_{k = 0}^{z} \\binom{z - k + r_1 - 1}{z - k} \\binom{k + r_2 - 1}{k}& \\\\ &= \\binom{z + r_1 + r_2 -1}{z} (1 - p)^{r_1 + r_2} p^{z}. \\end{align}\\] From this it also follows that the sum of geometric distributions with the same parameter is a negative binomial distribution. \\(Z \\sim \\text{NB}(8, 0.4)\\). set.seed(1) nsamps <- 10000 x <- rnbinom(nsamps, size = 5, prob = 0.6) y <- rnbinom(nsamps, size = 3, prob = 0.6) xpy <- x + y z <- rnbinom(nsamps, size = 8, prob = 0.6) samps <- tibble(x, y, xpy, z) samps <- melt(samps) ggplot(data = samps, aes(x = value)) + geom_bar() + facet_wrap(~ variable) 4.3 Continuous random variables Exercise 4.8 (Exponential random variable) A variable \\(X\\) with PDF \\(\\lambda e^{-\\lambda x}\\) is an exponential random variable with support in non-negative real numbers. It has one positive parameter \\(\\lambda\\). We denote \\[\\begin{equation} X | \\lambda \\sim \\text{Exp}(\\lambda). \\end{equation}\\] Find the CDF of an exponential random variable. Find the quantile function of an exponential random variable. Calculate the probability \\(P(1 \\leq X \\leq 3)\\), where \\(X \\sim \\text{Exp(1.5)}\\). R: Check your answer to c) with a simulation (rexp). Plot the probability in a meaningful way. R: Implement PDF, CDF, and the quantile function and compare their values with corresponding R functions visually. Hint: use the size parameter to make one of the curves wider. Solution. \\[\\begin{align} F(x) &= \\int_{0}^{x} \\lambda e^{-\\lambda t} dt \\\\ &= \\lambda \\int_{0}^{x} e^{-\\lambda t} dt \\\\ &= \\lambda (\\frac{1}{-\\lambda}e^{-\\lambda t} |_{0}^{x}) \\\\ &= \\lambda(\\frac{1}{\\lambda} - \\frac{1}{\\lambda} e^{-\\lambda x}) \\\\ &= 1 - e^{-\\lambda x}. \\end{align}\\] \\[\\begin{align} F(F^{-1}(x)) &= x \\\\ 1 - e^{-\\lambda F^{-1}(x)} &= x \\\\ e^{-\\lambda F^{-1}(x)} &= 1 - x \\\\ -\\lambda F^{-1}(x) &= \\ln(1 - x) \\\\ F^{-1}(x) &= - \\frac{ln(1 - x)}{\\lambda}. \\end{align}\\] \\[\\begin{align} P(1 \\leq X \\leq 3) &= P(X \\leq 3) - P(X \\leq 1) \\\\ &= P(X \\leq 3) - P(X \\leq 1) \\\\ &= 1 - e^{-1.5 \\times 3} - 1 + e^{-1.5 \\times 1} \\\\ &\\approx 0.212. \\end{align}\\] set.seed(1) nsamps <- 1000 samps <- rexp(nsamps, rate = 1.5) sum(samps >= 1 & samps <= 3) / nsamps ## [1] 0.212 exp_plot <- ggplot(data.frame(x = seq(0, 5, by = 0.01)), aes(x = x)) + stat_function(fun = dexp, args = list(rate = 1.5)) + stat_function(fun = dexp, args = list(rate = 1.5), xlim = c(1,3), geom = "area", fill = "red") plot(exp_plot) exp_pdf <- function(x, lambda) { return (lambda * exp(-lambda * x)) } exp_cdf <- function(x, lambda) { return (1 - exp(-lambda * x)) } exp_quant <- function(q, lambda) { return (-(log(1 - q) / lambda)) } ggplot(data = data.frame(x = seq(0, 5, by = 0.01)), aes(x = x)) + stat_function(fun = dexp, args = list(rate = 1.5), aes(color = "R"), size = 2.5) + stat_function(fun = exp_pdf, args = list(lambda = 1.5), aes(color = "Mine"), size = 1.2) + scale_color_manual(values = c("red", "black")) ggplot(data = data.frame(x = seq(0, 5, by = 0.01)), aes(x = x)) + stat_function(fun = pexp, args = list(rate = 1.5), aes(color = "R"), size = 2.5) + stat_function(fun = exp_cdf, args = list(lambda = 1.5), aes(color = "Mine"), size = 1.2) + scale_color_manual(values = c("red", "black")) ggplot(data = data.frame(x = seq(0, 1, by = 0.01)), aes(x = x)) + stat_function(fun = qexp, args = list(rate = 1.5), aes(color = "R"), size = 2.5) + stat_function(fun = exp_quant, args = list(lambda = 1.5), aes(color = "Mine"), size = 1.2) + scale_color_manual(values = c("red", "black")) Exercise 4.9 (Uniform random variable) Continuous uniform random variable with parameters \\(a\\) and \\(b\\) has the PDF \\[\\begin{equation} p(x) = \\begin{cases} \\frac{1}{b - a} & x \\in [a,b] \\\\ 0 & \\text{otherwise}. \\end{cases} \\end{equation}\\] Find the CDF of the uniform random variable. Find the quantile function of the uniform random variable. Let \\(X \\sim \\text{Uniform}(a,b)\\). Find the CDF of the variable \\(Y = \\frac{X - a}{b - a}\\). This is the standard uniform random variable. Let \\(X \\sim \\text{Uniform}(-1, 3)\\). Find such \\(z\\) that \\(P(X < z + \\mu_x) = \\frac{1}{5}\\). R: Check your result from d) using simulation. Solution. \\[\\begin{align} F(x) &= \\int_{a}^x \\frac{1}{b - a} dt \\\\ &= \\frac{1}{b - a} \\int_{a}^x dt \\\\ &= \\frac{x - a}{b - a}. \\end{align}\\] \\[\\begin{align} F(F^{-1}(p)) &= p \\\\ \\frac{F^{-1}(p) - a}{b - a} &= p \\\\ F^{-1}(p) &= p(b - a) + a. \\end{align}\\] \\[\\begin{align} F_Y(y) &= P(Y < y) \\\\ &= P(\\frac{X - a}{b - a} < y) \\\\ &= P(X < y(b - a) + a) \\\\ &= F_X(y(b - a) + a) \\\\ &= \\frac{(y(b - a) + a) - a}{b - a} \\\\ &= y. \\end{align}\\] \\[\\begin{align} P(X < z + 1) &= \\frac{1}{5} \\\\ F(z + 1) &= \\frac{1}{5} \\\\ z + 1 &= F^{-1}(\\frac{1}{5}) \\\\ z &= \\frac{1}{5}4 - 1 - 1 \\\\ z &= -1.2. \\end{align}\\] set.seed(1) a <- -1 b <- 3 nsamps <- 10000 unif_samp <- runif(nsamps, a, b) mu_x <- mean(unif_samp) new_samp <- unif_samp - mu_x quantile(new_samp, probs = 1/5) ## 20% ## -1.203192 punif(-0.2, -1, 3) ## [1] 0.2 Exercise 4.10 (Beta random variable) A variable \\(X\\) with PDF \\[\\begin{equation} p(x) = \\frac{x^{\\alpha - 1} (1 - x)^{\\beta - 1}}{\\text{B}(\\alpha, \\beta)}, \\end{equation}\\] where \\(\\text{B}(\\alpha, \\beta) = \\frac{\\Gamma(\\alpha) \\Gamma(\\beta)}{\\Gamma(\\alpha + \\beta)}\\) and \\(\\Gamma(x) = \\int_0^{\\infty} x^{z - 1} e^{-x} dx\\) is a Beta random variable with support on \\([0,1]\\). It has two positive parameters \\(\\alpha\\) and \\(\\beta\\). Notation: \\[\\begin{equation} X | \\alpha, \\beta \\sim \\text{Beta}(\\alpha, \\beta) \\end{equation}\\] It is often used in modeling rates. Calculate the PDF for \\(\\alpha = 1\\) and \\(\\beta = 1\\). What do you notice? R: Plot densities of the beta distribution for parameter pairs (2, 2), (4, 1), (1, 4), (2, 5), and (0.1, 0.1). R: Sample from \\(X \\sim \\text{Beta}(2, 5)\\) and compare the histogram with Beta PDF. Solution. \\[\\begin{equation} p(x) = \\frac{x^{1 - 1} (1 - x)^{1 - 1}}{\\text{B}(1, 1)} = 1. \\end{equation}\\] This is the standard uniform distribution. set.seed(1) ggplot(data = data.frame(x = seq(0, 1, by = 0.01)), aes(x = x)) + stat_function(fun = dbeta, args = list(shape1 = 2, shape2 = 2), aes(color = "alpha = 0.5")) + stat_function(fun = dbeta, args = list(shape1 = 4, shape2 = 1), aes(color = "alpha = 4")) + stat_function(fun = dbeta, args = list(shape1 = 1, shape2 = 4), aes(color = "alpha = 1")) + stat_function(fun = dbeta, args = list(shape1 = 2, shape2 = 5), aes(color = "alpha = 25")) + stat_function(fun = dbeta, args = list(shape1 = 0.1, shape2 = 0.1), aes(color = "alpha = 0.1")) set.seed(1) nsamps <- 1000 samps <- rbeta(nsamps, 2, 5) ggplot(data = data.frame(x = samps), aes(x = x)) + geom_histogram(aes(y = ..density..), color = "black") + stat_function(data = data.frame(x = seq(0, 1, by = 0.01)), aes(x = x), fun = dbeta, args = list(shape1 = 2, shape2 = 5), color = "red", size = 1.2) Exercise 4.11 (Gamma random variable) A random variable with PDF \\[\\begin{equation} p(x) = \\frac{\\beta^\\alpha}{\\Gamma(\\alpha)} x^{\\alpha - 1}e^{-\\beta x} \\end{equation}\\] is a Gamma random variable with support on the positive numbers and parameters shape \\(\\alpha > 0\\) and rate \\(\\beta > 0\\). We write \\[\\begin{equation} X | \\alpha, \\beta \\sim \\text{Gamma}(\\alpha, \\beta) \\end{equation}\\] and it’s CDF is \\[\\begin{equation} \\frac{\\gamma(\\alpha, \\beta x)}{\\Gamma(\\alpha)}, \\end{equation}\\] where \\(\\gamma(s, x) = \\int_0^x t^{s-1} e^{-t} dt\\). It is usually used in modeling positive phenomena (for example insurance claims and rainfalls). Let \\(X \\sim \\text{Gamma}(1, \\beta)\\). Find the PDF of \\(X\\). Do you recognize this PDF? Let \\(k = \\alpha\\) and \\(\\theta = \\frac{1}{\\beta}\\). Find the PDF of \\(X | k, \\theta \\sim \\text{Gamma}(k, \\theta)\\). Random variables can be reparameterized, and sometimes a reparameterized distribution is more suitable for certain calculations. The first parameterization is for example usually used in Bayesian statistics, while this parameterization is more common in econometrics and some other applied fields. Note that you also need to pay attention to the parameters in statistical software, so diligently read the help files when using functions like rgamma to see how the function is parameterized. R: Plot gamma CDF for random variables with shape and rate parameters (1,1), (10,1), (1,10). Solution. \\[\\begin{align} p(x) &= \\frac{\\beta^1}{\\Gamma(1)} x^{1 - 1}e^{-\\beta x} \\\\ &= \\beta e^{-\\beta x} \\end{align}\\] This is the PDF of the exponential distribution with parameter \\(\\beta\\). \\[\\begin{align} p(x) &= \\frac{1}{\\Gamma(k)\\beta^k} x^{k - 1}e^{-\\frac{x}{\\theta}}. \\end{align}\\] set.seed(1) ggplot(data = data.frame(x = seq(0, 25, by = 0.01)), aes(x = x)) + stat_function(fun = pgamma, args = list(shape = 1, rate = 1), aes(color = "Gamma(1,1)")) + stat_function(fun = pgamma, args = list(shape = 10, rate = 1), aes(color = "Gamma(10,1)")) + stat_function(fun = pgamma, args = list(shape = 1, rate = 10), aes(color = "Gamma(1,10)")) Exercise 4.12 (Normal random variable) A random variable with PDF \\[\\begin{equation} p(x) = \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(x - \\mu)^2}{2 \\sigma^2}} \\end{equation}\\] is a normal random variable with support on the real axis and parameters \\(\\mu\\) in reals and \\(\\sigma^2 > 0\\). The first is the mean parameter and the second is the variance parameter. Many statistical methods assume a normal distribution. We denote \\[\\begin{equation} X | \\mu, \\sigma \\sim \\text{N}(\\mu, \\sigma^2), \\end{equation}\\] and it’s CDF is \\[\\begin{equation} F(x) = \\int_{-\\infty}^x \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(t - \\mu)^2}{2 \\sigma^2}} dt, \\end{equation}\\] which is intractable and is usually approximated. Due to its flexibility it is also one of the most researched distributions. For that reason statisticians often use transformations of variables or approximate distributions with the normal distribution. Show that a variable \\(\\frac{X - \\mu}{\\sigma} \\sim \\text{N}(0,1)\\). This transformation is called standardization, and \\(\\text{N}(0,1)\\) is a standard normal distribution. R: Plot the normal distribution with \\(\\mu = 0\\) and different values for the \\(\\sigma\\) parameter. R: The normal distribution provides a good approximation for the Poisson distribution with a large \\(\\lambda\\). Let \\(X \\sim \\text{Poisson}(50)\\). Approximate \\(X\\) with the normal distribution and compare its density with the Poisson histogram. What are the values of \\(\\mu\\) and \\(\\sigma^2\\) that should provide the best approximation? Note that R function rnorm takes standard deviation (\\(\\sigma\\)) as a parameter and not variance. Solution. \\[\\begin{align} P(\\frac{X - \\mu}{\\sigma} < x) &= P(X < \\sigma x + \\mu) \\\\ &= F(\\sigma x + \\mu) \\\\ &= \\int_{-\\infty}^{\\sigma x + \\mu} \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(t - \\mu)^2}{2\\sigma^2}} dt \\end{align}\\] Now let \\(s = f(t) = \\frac{t - \\mu}{\\sigma}\\), then \\(ds = \\frac{dt}{\\sigma}\\) and \\(f(\\sigma x + \\mu) = x\\), so \\[\\begin{align} P(\\frac{X - \\mu}{\\sigma} < x) &= \\int_{-\\infty}^{x} \\frac{1}{\\sqrt{2 \\pi}} e^{-\\frac{s^2}{2}} ds. \\end{align}\\] There is no need to evaluate this integral, as we recognize it as the CDF of a normal distribution with \\(\\mu = 0\\) and \\(\\sigma^2 = 1\\). set.seed(1) # b ggplot(data = data.frame(x = seq(-15, 15, by = 0.01)), aes(x = x)) + stat_function(fun = dnorm, args = list(mean = 0, sd = 1), aes(color = "sd = 1")) + stat_function(fun = dnorm, args = list(mean = 0, sd = 0.4), aes(color = "sd = 0.1")) + stat_function(fun = dnorm, args = list(mean = 0, sd = 2), aes(color = "sd = 2")) + stat_function(fun = dnorm, args = list(mean = 0, sd = 5), aes(color = "sd = 5")) # c mean_par <- 50 nsamps <- 100000 pois_samps <- rpois(nsamps, lambda = mean_par) norm_samps <- rnorm(nsamps, mean = mean_par, sd = sqrt(mean_par)) my_plot <- ggplot() + geom_bar(data = tibble(x = pois_samps), aes(x = x, y = (..count..)/sum(..count..))) + geom_density(data = tibble(x = norm_samps), aes(x = x), color = "red") plot(my_plot) Exercise 4.13 (Logistic random variable) A logistic random variable has CDF \\[\\begin{equation} F(x) = \\frac{1}{1 + e^{-\\frac{x - \\mu}{s}}}, \\end{equation}\\] where \\(\\mu\\) is real and \\(s > 0\\). The support is on the real axis. We denote \\[\\begin{equation} X | \\mu, s \\sim \\text{Logistic}(\\mu, s). \\end{equation}\\] The distribution of the logistic random variable resembles a normal random variable, however it has heavier tails. Find the PDF of a logistic random variable. R: Implement logistic PDF and CDF and visually compare both for \\(X \\sim \\text{N}(0, 1)\\) and \\(Y \\sim \\text{logit}(0, \\sqrt{\\frac{3}{\\pi^2}})\\). These distributions have the same mean and variance. Additionally, plot the same plot on the interval [5,10], to better see the difference in the tails. R: For the distributions in b) find the probability \\(P(|X| > 4)\\) and interpret the result. Solution. \\[\\begin{align} p(x) &= \\frac{d}{dx} \\frac{1}{1 + e^{-\\frac{x - \\mu}{s}}} \\\\ &= \\frac{- \\frac{d}{dx} (1 + e^{-\\frac{x - \\mu}{s}})}{(1 + e{-\\frac{x - \\mu}{s}})^2} \\\\ &= \\frac{e^{-\\frac{x - \\mu}{s}}}{s(1 + e{-\\frac{x - \\mu}{s}})^2}. \\end{align}\\] # b set.seed(1) logit_pdf <- function (x, mu, s) { return ((exp(-(x - mu)/(s))) / (s * (1 + exp(-(x - mu)/(s)))^2)) } nl_plot <- ggplot(data = data.frame(x = seq(-12, 12, by = 0.01)), aes(x = x)) + stat_function(fun = dnorm, args = list(mean = 0, sd = 2), aes(color = "normal")) + stat_function(fun = logit_pdf, args = list(mu = 0, s = sqrt(12/pi^2)), aes(color = "logit")) plot(nl_plot) nl_plot <- ggplot(data = data.frame(x = seq(5, 10, by = 0.01)), aes(x = x)) + stat_function(fun = dnorm, args = list(mean = 0, sd = 2), aes(color = "normal")) + stat_function(fun = logit_pdf, args = list(mu = 0, s = sqrt(12/pi^2)), aes(color = "logit")) plot(nl_plot) # c logit_cdf <- function (x, mu, s) { return (1 / (1 + exp(-(x - mu) / s))) } p_logistic <- 1 - logit_cdf(4, 0, sqrt(12/pi^2)) + logit_cdf(-4, 0, sqrt(12/pi^2)) p_norm <- 1 - pnorm(4, 0, 2) + pnorm(-4, 0, 2) p_logistic ## [1] 0.05178347 p_norm ## [1] 0.04550026 # Logistic distribution has wider tails, therefore the probability of larger # absolute values is higher. 4.4 Singular random variables Exercise 4.14 (Cantor distribution) The Cantor set is a subset of \\([0,1]\\), which we create by iteratively deleting the middle third of the interval. For example, in the first iteration, we get the sets \\([0,\\frac{1}{3}]\\) and \\([\\frac{2}{3},1]\\). In the second iteration, we get \\([0,\\frac{1}{9}]\\), \\([\\frac{2}{9},\\frac{1}{3}]\\), \\([\\frac{2}{3}, \\frac{7}{9}]\\), and \\([\\frac{8}{9}, 1]\\). On the \\(n\\)-th iteration, we have \\[\\begin{equation} C_n = \\frac{C_{n-1}}{3} \\cup \\bigg(\\frac{2}{3} + \\frac{C_{n-1}}{3} \\bigg), \\end{equation}\\] where \\(C_0 = [0,1]\\). The Cantor set is then defined as the intersection of these sets \\[\\begin{equation} C = \\cap_{n=1}^{\\infty} C_n. \\end{equation}\\] It has the same cardinality as \\([0,1]\\). Another way to define the Cantor set is the set of all numbers on \\([0,1]\\), that do not have a 1 in the ternary representation \\(x = \\sum_{n=1}^\\infty \\frac{x_i}{3^i}, x_i \\in \\{0,1,2\\}\\). A random variable follows the Cantor distribution, if its CDF is the Cantor function (below). You can find the implementations of random number generator, CDF, and quantile functions for the Cantor distributions at https://github.com/Henrygb/CantorDist.R. Show that the Lebesgue measure of the Cantor set is 0. (Jagannathan) Let us look at an infinite sequence of independent fair-coin tosses. If the outcome is heads, let \\(x_i = 2\\) and \\(x_i = 0\\), when tails. Then use these to create \\(x = \\sum_{n=1}^\\infty \\frac{x_i}{3^i}\\). This is a random variable with the Cantor distribution. Show that \\(X\\) has a singular distribution. Solution. \\[\\begin{align} \\lambda(C) &= 1 - \\lambda(C^c) \\\\ &= 1 - \\frac{1}{3}\\sum_{k = 0}^\\infty (\\frac{2}{3})^k \\\\ &= 1 - \\frac{\\frac{1}{3}}{1 - \\frac{2}{3}} \\\\ &= 0. \\end{align}\\] First, for every \\(x\\), the probability of observing it is \\(\\lim_{n \\rightarrow \\infty} \\frac{1}{2^n} = 0\\). Second, the probability that we observe one of all the possible sequences is 1. Therefore \\(P(C) = 1\\). So this is a singular variable. The CDF only increments on the elements of the Cantor set. 4.5 Transformations Exercise 4.15 Let \\(X\\) be a random variable that is uniformly distributed on \\(\\{-2, -1, 0, 1, 2\\}\\). Find the PMF of \\(Y = X^2\\). Solution. \\[\\begin{align} P_Y(y) = \\sum_{x \\in \\sqrt(y)} P_X(x) = \\begin{cases} 0 & y \\notin \\{0,1,4\\} \\\\ \\frac{1}{5} & y = 0 \\\\ \\frac{2}{5} & y \\in \\{1,4\\} \\end{cases} \\end{align}\\] Exercise 4.16 (Lognormal random variable) A lognormal random variable is a variable whose logarithm is normally distributed. In practice, we often encounter skewed data. Usually using a log transformation on such data makes it more symmetric and therefore more suitable for modeling with the normal distribution (more on why we wish to model data with the normal distribution in the following chapters). Let \\(X \\sim \\text{N}(\\mu,\\sigma)\\). Find the PDF of \\(Y: \\log(Y) = X\\). R: Sample from the lognormal distribution with parameters \\(\\mu = 5\\) and \\(\\sigma = 2\\). Plot a histogram of the samples. Then log-transform the samples and plot a histogram along with the theoretical normal PDF. Solution. \\[\\begin{align} p_Y(y) &= p_X(\\log(y)) \\frac{d}{dy} \\log(y) \\\\ &= \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{\\frac{(\\log(y) - \\mu)^2}{2 \\sigma^2}} \\frac{1}{y} \\\\ &= \\frac{1}{y \\sqrt{2 \\pi \\sigma^2}} e^{\\frac{(\\log(y) - \\mu)^2}{2 \\sigma^2}}. \\end{align}\\] set.seed(1) nsamps <- 10000 mu <- 0.5 sigma <- 0.4 ln_samps <- rlnorm(nsamps, mu, sigma) ln_plot <- ggplot(data = data.frame(x = ln_samps), aes(x = x)) + geom_histogram(color = "black") plot(ln_plot) norm_samps <- log(ln_samps) n_plot <- ggplot(data = data.frame(x = norm_samps), aes(x = x)) + geom_histogram(aes(y = ..density..), color = "black") + stat_function(fun = dnorm, args = list(mean = mu, sd = sigma), color = "red") plot(n_plot) Exercise 4.17 (Probability integral transform) This exercise is borrowed from Wasserman. Let \\(X\\) have a continuous, strictly increasing CDF \\(F\\). Let \\(Y = F(X)\\). Find the density of \\(Y\\). This is called the probability integral transform. Let \\(U \\sim \\text{Uniform}(0,1)\\) and let \\(X = F^{-1}(U)\\). Show that \\(X \\sim F\\). R: Implement a program that takes Uniform(0,1) random variables and generates random variables from an exponential(\\(\\beta\\)) distribution. Compare your implemented function with function rexp in R. Solution. \\[\\begin{align} F_Y(y) &= P(Y < y) \\\\ &= P(F(X) < y) \\\\ &= P(X < F_X^{-1}(y)) \\\\ &= F_X(F_X^{-1}(y)) \\\\ &= y. \\end{align}\\] From the above it follows that \\(p(y) = 1\\). Note that we need to know the inverse CDF to be able to apply this procedure. \\[\\begin{align} P(X < x) &= P(F^{-1}(U) < x) \\\\ &= P(U < F(x)) \\\\ &= F_U(F(x)) \\\\ &= F(x). \\end{align}\\] set.seed(1) nsamps <- 10000 beta <- 4 generate_exp <- function (n, beta) { tmp <- runif(n) X <- qexp(tmp, beta) return (X) } df <- tibble("R" = rexp(nsamps, beta), "myGenerator" = generate_exp(nsamps, beta)) %>% gather() ggplot(data = df, aes(x = value, fill = key)) + geom_histogram(position = "dodge") "],["mrvs.html", "Chapter 5 Multiple random variables 5.1 General 5.2 Bivariate distribution examples 5.3 Transformations", " Chapter 5 Multiple random variables This chapter deals with multiple random variables and their distributions. The students are expected to acquire the following knowledge: Theoretical Calculation of PDF of transformed multiple random variables. Finding marginal and conditional distributions. R Scatterplots of bivariate random variables. New R functions (for example, expand.grid). .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 5.1 General Exercise 5.1 Let \\(X \\sim \\text{N}(0,1)\\) and \\(Y \\sim \\text{N}(0,1)\\) be independent random variables. Draw 1000 samples from \\((X,Y)\\) and plot a scatterplot. Now let \\(X \\sim \\text{N}(0,1)\\) and \\(Y | X = x \\sim N(ax, 1)\\). Draw 1000 samples from \\((X,Y)\\) for \\(a = 1\\), \\(a=0\\), and \\(a=-0.5\\). Plot the scatterplots. How would you interpret parameter \\(a\\)? Plot the marginal distribution of \\(Y\\) for cases \\(a=1\\), \\(a=0\\), and \\(a=-0.5\\). Can you guess which distribution it is? set.seed(1) nsamps <- 1000 x <- rnorm(nsamps) y <- rnorm(nsamps) ggplot(data.frame(x, y), aes(x = x, y = y)) + geom_point() y1 <- rnorm(nsamps, mean = 1 * x) y2 <- rnorm(nsamps, mean = 0 * x) y3 <- rnorm(nsamps, mean = -0.5 * x) df <- tibble(x = c(x,x,x), y = c(y1,y2,y3), a = c(rep(1, nsamps), rep(0, nsamps), rep(-0.5, nsamps))) ggplot(df, aes(x = x, y = y)) + geom_point() + facet_wrap(~a) + coord_equal(ratio=1) # Parameter a controls the scale of linear dependency between X and Y. ggplot(df, aes(x = y)) + geom_density() + facet_wrap(~a) 5.2 Bivariate distribution examples Exercise 5.2 (Discrete bivariate random variable) Let \\(X\\) represent the event that a die rolls an even number and let \\(Y\\) represent the event that a die rolls one, two, or a three. Find the marginal distributions of \\(X\\) and \\(Y\\). Find the PMF of \\((X,Y)\\). Find the CDF of \\((X,Y)\\). Find \\(P(X = 1 | Y = 1)\\). Solution. \\[\\begin{align} P(X = 1) = \\frac{1}{2} \\text{ and } P(X = 0) = \\frac{1}{2} \\\\ P(Y = 1) = \\frac{1}{2} \\text{ and } P(Y = 0) = \\frac{1}{2} \\\\ \\end{align}\\] \\[\\begin{align} P(X = 1, Y = 1) = \\frac{1}{6} \\\\ P(X = 1, Y = 0) = \\frac{2}{6} \\\\ P(X = 0, Y = 1) = \\frac{2}{6} \\\\ P(X = 0, Y = 0) = \\frac{1}{6} \\end{align}\\] \\[\\begin{align} P(X \\leq x, Y \\leq y) = \\begin{cases} \\frac{1}{6} & x = 0, y = 0 \\\\ \\frac{3}{6} & x \\neq y \\\\ 1 & x = 1, y = 1 \\end{cases} \\end{align}\\] \\[\\begin{align} P(X = 1 | Y = 1) = \\frac{1}{3} \\end{align}\\] Exercise 5.3 (Continuous bivariate random variable) Let \\(p(x,y) = 6 (x - y)^2\\) be the PDF of a bivariate random variable \\((X,Y)\\), where both variables range from zero to one. Find CDF. Find marginal distributions. Find conditional distributions. R: Plot a grid of points and colour them by value – this can help us visualize the PDF. R: Implement a random number generator, which will generate numbers from \\((X,Y)\\) and visually check the results. R: Plot the marginal distribution of \\(Y\\) and the conditional distributions of \\(X | Y = y\\), where \\(y \\in \\{0, 0.1, 0.5\\}\\). Solution. \\[\\begin{align} F(x,y) &= \\int_0^{x} \\int_0^{y} 6 (t - s)^2 ds dt\\\\ &= 6 \\int_0^{x} \\int_0^{y} t^2 - 2ts + s^2 ds dt\\\\ &= 6 \\int_0^{x} t^2y - ty^2 + \\frac{y^3}{3} dt \\\\ &= 6 (\\frac{x^3 y}{3} - \\frac{x^2y^2}{2} + \\frac{x y^3}{3}) \\\\ &= 2 x^3 y - 3 t^2y^2 + 2 x y^3 \\end{align}\\] \\[\\begin{align} p(x) &= \\int_0^{1} 6 (x - y)^2 dy\\\\ &= 6 (x^2 - x + \\frac{1}{3}) \\\\ &= 6x^2 - 6x + 2 \\end{align}\\] \\[\\begin{align} p(y) &= \\int_0^{1} 6 (x - y)^2 dx\\\\ &= 6 (y^2 - y + \\frac{1}{3}) \\\\ &= 6y^2 - 6y + 2 \\end{align}\\] \\[\\begin{align} p(x|y) &= \\frac{p(xy)}{p(y)} \\\\ &= \\frac{6 (x - y)^2}{6 (y^2 - y + \\frac{1}{3})} \\\\ &= \\frac{(x - y)^2}{y^2 - y + \\frac{1}{3}} \\end{align}\\] \\[\\begin{align} p(y|x) &= \\frac{p(xy)}{p(x)} \\\\ &= \\frac{6 (x - y)^2}{6 (x^2 - x + \\frac{1}{3})} \\\\ &= \\frac{(x - y)^2}{x^2 - x + \\frac{1}{3}} \\end{align}\\] set.seed(1) # d pxy <- function (x, y) { return ((x - y)^2) } x_axis <- seq(0, 1, length.out = 100) y_axis <- seq(0, 1, length.out = 100) df <- expand.grid(x_axis, y_axis) colnames(df) <- c("x", "y") df <- cbind(df, pdf = pxy(df$x, df$y)) ggplot(data = df, aes(x = x, y = y, color = pdf)) + geom_point() # e samps <- NULL for (i in 1:10000) { xt <- runif(1, 0, 1) yt <- runif(1, 0, 1) pdft <- pxy(xt, yt) acc <- runif(1, 0, 6) if (acc <= pdft) { samps <- rbind(samps, c(xt, yt)) } } colnames(samps) <- c("x", "y") ggplot(data = as.data.frame(samps), aes(x = x, y = y)) + geom_point() # f mar_pdf <- function (x) { return (6 * x^2 - 6 * x + 2) } cond_pdf <- function (x, y) { return (((x - y)^2) / (y^2 - y + 1/3)) } df <- tibble(x = x_axis, mar = mar_pdf(x), y0 = cond_pdf(x, 0), y0.1 = cond_pdf(x, 0.1), y0.5 = cond_pdf(x, 0.5)) %>% gather(dist, value, -x) ggplot(df, aes(x = x, y = value, color = dist)) + geom_line() Exercise 5.4 (Mixed bivariate random variable) Let \\(f(x,y) = \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)y!} x^{y+ \\alpha -1} e^{-x(1 + \\beta)}\\) be the PDF of a bivariate random variable, where \\(x \\in (0, \\infty)\\) and \\(y \\in \\mathbb{N}_0\\). Find the marginal distribution of \\(X\\). Do you recognize this distribution? Find the conditional distribution of \\(Y | X\\). Do you recognize this distribution? Calculate the probability \\(P(Y = 2 | X = 2.5)\\) for \\((X,Y)\\). Find the marginal distribution of \\(Y\\). Do you recognize this distribution? R: Take 1000 random samples from \\((X,Y)\\) with parameters \\(\\beta = 1\\) and \\(\\alpha = 1\\). Plot a scatterplot. Plot a bar plot of the marginal distribution of \\(Y\\), and the theoretical PMF calculated from d) on the range from 0 to 10. Hint: Use the gamma function in R.? Solution. \\[\\begin{align} p(x) &= \\sum_{k = 0}^{\\infty} \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)k!} x^{k + \\alpha -1} e^{-x(1 + \\beta)} & \\\\ &= \\sum_{k = 0}^{\\infty} \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)k!} x^{k} x^{\\alpha -1} e^{-x} e^{-\\beta x} & \\\\ &= \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} x^{\\alpha -1} e^{-\\beta x} \\sum_{k = 0}^{\\infty} \\frac{1}{k!} x^{k} e^{-x} & \\\\ &= \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} x^{\\alpha -1} e^{-\\beta x} & \\text{the last term above sums to one} \\end{align}\\] This is the Gamma PDF. \\[\\begin{align} p(y|x) &= \\frac{p(x,y)}{p(x)} \\\\ &= \\frac{\\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)y!} x^{y+ \\alpha -1} e^{-x(1 + \\beta)}}{\\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} x^{\\alpha -1} e^{-\\beta x}} \\\\ &= \\frac{x^y e^{-x}}{y!}. \\end{align}\\] This is the Poisson PMF. \\[\\begin{align} P(Y = 2 | X = 2.5) = \\frac{2.5^2 e^{-2.5}}{2!} \\approx 0.26. \\end{align}\\] \\[\\begin{align} p(y) &= \\int_{0}^{\\infty} \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)y!} x^{y + \\alpha -1} e^{-x(1 + \\beta)} dx & \\\\ &= \\frac{1}{y!} \\int_{0}^{\\infty} \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} x^{(y + \\alpha) -1} e^{-(1 + \\beta)x} dx & \\\\ &= \\frac{1}{y!} \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} \\int_{0}^{\\infty} \\frac{\\Gamma(y + \\alpha)}{(1 + \\beta)^{y + \\alpha}} \\frac{(1 + \\beta)^{y + \\alpha}}{\\Gamma(y + \\alpha)} x^{(y + \\alpha) -1} e^{-(1 + \\beta)x} dx & \\text{complete to Gamma PDF} \\\\ &= \\frac{1}{y!} \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} \\frac{\\Gamma(y + \\alpha)}{(1 + \\beta)^{y + \\alpha}}. \\end{align}\\] We add the terms in the third equality to get a Gamma PDF inside the integral, which then integrates to one. We do not recognize this distribution. set.seed(1) px <- function (x, alpha, beta) { return((1 / factorial(x)) * (beta^alpha / gamma(alpha)) * (gamma(x + alpha) / (1 + beta)^(x + alpha))) } nsamps <- 1000 rx <- rgamma(nsamps, 1, 1) ryx <- rpois(nsamps, rx) ggplot(data = data.frame(x = rx, y = ryx), aes(x = x, y = y)) + geom_point() ggplot(data = data.frame(x = rx, y = ryx), aes(x = y)) + geom_bar(aes(y = (..count..)/sum(..count..))) + stat_function(fun = px, args = list(alpha = 1, beta = 1), color = "red") Exercise 5.5 Let \\(f(x,y) = cx^2y\\) for \\(x^2 \\leq y \\leq 1\\) and zero otherwise. Find such \\(c\\) that \\(f\\) is a PDF of a bivariate random variable. This exercise is borrowed from Wasserman. Solution. \\[\\begin{align} 1 &= \\int_{-1}^{1} \\int_{x^2}^1 cx^2y dy dx \\\\ &= \\int_{-1}^{1} cx^2 (\\frac{1}{2} - \\frac{x^4}{2}) dx \\\\ &= \\frac{c}{2} \\int_{-1}^{1} x^2 - x^6 dx \\\\ &= \\frac{c}{2} (\\frac{1}{3} + \\frac{1}{3} - \\frac{1}{7} - \\frac{1}{7}) \\\\ &= \\frac{c}{2} \\frac{8}{21} \\\\ &= \\frac{4c}{21} \\end{align}\\] It follows \\(c = \\frac{21}{4}\\). 5.3 Transformations Exercise 5.6 Let \\((X,Y)\\) be uniformly distributed on the unit ball \\(\\{(x,y,z) : x^2 + y^2 + z^2 \\leq 1\\}\\). Let \\(R = \\sqrt{X^2 + Y^2 + Z^2}\\). Find the CDF and PDF of \\(R\\). Solution. \\[\\begin{align} P(R < r) &= P(\\sqrt{X^2 + Y^2 + Z^2} < r) \\\\ &= P(X^2 + Y^2 + Z^2 < r^2) \\\\ &= \\frac{\\frac{4}{3} \\pi r^3}{\\frac{4}{3}\\pi} \\\\ &= r^3. \\end{align}\\] The second line shows us that we are looking at the probability which is represented by a smaller ball with radius \\(r\\). To get the probability, we divide it by the radius of the whole ball. We get the PDF by differentiating the CDF, so \\(p(r) = 3r^2\\). "],["integ.html", "Chapter 6 Integration 6.1 Monte Carlo integration 6.2 Lebesgue integrals", " Chapter 6 Integration This chapter deals with abstract and Monte Carlo integration. The students are expected to acquire the following knowledge: Theoretical How to calculate Lebesgue integrals for non-simple functions. R Monte Carlo integration. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 6.1 Monte Carlo integration Exercise 6.1 Let \\(X\\) and \\(Y\\) be continuous random variables on the unit interval and \\(p(x,y) = 6(x - y)^2\\). Use Monte Carlo integration to estimate the probability \\(P(0.2 \\leq X \\leq 0.5, \\: 0.1 \\leq Y \\leq 0.2)\\). Can you find the exact value? set.seed(1) nsamps <- 1000 V <- (0.5 - 0.2) * (0.2 - 0.1) x1 <- runif(nsamps, 0.2, 0.5) x2 <- runif(nsamps, 0.1, 0.2) f_1 <- function (x, y) { return (6 * (x - y)^2) } mcint <- V * (1 / nsamps) * sum(f_1(x1, x2)) sdm <- sqrt((V^2 / nsamps) * var(f_1(x1, x2))) mcint ## [1] 0.008793445 sdm ## [1] 0.0002197686 F_1 <- function (x, y) { return (2 * x^3 * y - 3 * x^2 * y^2 + 2 * x * y^3) } F_1(0.5, 0.2) - F_1(0.2, 0.2) - F_1(0.5, 0.1) + F_1(0.2, 0.1) ## [1] 0.0087 6.2 Lebesgue integrals Exercise 6.2 (borrowed from Jagannathan) Find the Lebesgue integral of the following functions on (\\(\\mathbb{R}\\), \\(\\mathcal{B}(\\mathbb{R})\\), \\(\\lambda\\)). \\[\\begin{align} f(\\omega) = \\begin{cases} \\omega, & \\text{for } \\omega = 0,1,...,n \\\\ 0, & \\text{elsewhere} \\end{cases} \\end{align}\\] \\[\\begin{align} f(\\omega) = \\begin{cases} 1, & \\text{for } \\omega = \\mathbb{Q}^c \\cap [0,1] \\\\ 0, & \\text{elsewhere} \\end{cases} \\end{align}\\] \\[\\begin{align} f(\\omega) = \\begin{cases} n, & \\text{for } \\omega = \\mathbb{Q}^c \\cap [0,n] \\\\ 0, & \\text{elsewhere} \\end{cases} \\end{align}\\] Solution. \\[\\begin{align} \\int f(\\omega) d\\lambda = \\sum_{\\omega = 0}^n \\omega \\lambda(\\omega) = 0. \\end{align}\\] \\[\\begin{align} \\int f(\\omega) d\\lambda = 1 \\times \\lambda(\\mathbb{Q}^c \\cap [0,1]) = 1. \\end{align}\\] \\[\\begin{align} \\int f(\\omega) d\\lambda = n \\times \\lambda(\\mathbb{Q}^c \\cap [0,n]) = n^2. \\end{align}\\] Exercise 6.3 (borrowed from Jagannathan) Let \\(c \\in \\mathbb{R}\\) be fixed and (\\(\\mathbb{R}\\), \\(\\mathcal{B}(\\mathbb{R})\\)) a measurable space. If for any Borel set \\(A\\), \\(\\delta_c (A) = 1\\) if \\(c \\in A\\), and \\(\\delta_c (A) = 0\\) otherwise, then \\(\\delta_c\\) is called a Dirac measure. Let \\(g\\) be a non-negative, measurable function. Show that \\(\\int g d \\delta_c = g(c)\\). Solution. \\[\\begin{align} \\int g d \\delta_c &= \\sup_{q \\in S(g)} \\int q d \\delta_c \\\\ &= \\sup_{q \\in S(g)} \\sum_{i = 1}^n a_i \\delta_c(A_i) \\\\ &= \\sup_{q \\in S(g)} \\sum_{i = 1}^n a_i \\text{I}_{A_i}(c) \\\\ &= \\sup_{q \\in S(g)} q(c) \\\\ &= g(c) \\end{align}\\] "],["ev.html", "Chapter 7 Expected value 7.1 Discrete random variables 7.2 Continuous random variables 7.3 Sums, functions, conditional expectations 7.4 Covariance", " Chapter 7 Expected value This chapter deals with expected values of random variables. The students are expected to acquire the following knowledge: Theoretical Calculation of the expected value. Calculation of variance and covariance. Cauchy distribution. R Estimation of expected value. Estimation of variance and covariance. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 7.1 Discrete random variables Exercise 7.1 (Bernoulli) Let \\(X \\sim \\text{Bernoulli}(p)\\). Find \\(E[X]\\). Find \\(Var[X]\\). R: Let \\(p = 0.4\\). Check your answers to a) and b) with a simulation. Solution. \\[\\begin{align*} E[X] = \\sum_{k=0}^1 p^k (1-p)^{1-k} k = p. \\end{align*}\\] \\[\\begin{align*} Var[X] = E[X^2] - E[X]^2 = \\sum_{k=0}^1 (p^k (1-p)^{1-k} k^2) - p^2 = p(1-p). \\end{align*}\\] set.seed(1) nsamps <- 1000 x <- rbinom(nsamps, 1, 0.4) mean(x) ## [1] 0.394 var(x) ## [1] 0.239003 0.4 * (1 - 0.4) ## [1] 0.24 Exercise 7.2 (Binomial) Let \\(X \\sim \\text{Binomial}(n,p)\\). Find \\(E[X]\\). Find \\(Var[X]\\). Solution. Let \\(X = \\sum_{i=0}^n X_i\\), where \\(X_i \\sim \\text{Bernoulli}(p)\\). Then, due to linearity of expectation \\[\\begin{align*} E[X] = E[\\sum_{i=0}^n X_i] = \\sum_{i=0}^n E[X_i] = np. \\end{align*}\\] Again let \\(X = \\sum_{i=0}^n X_i\\), where \\(X_i \\sim \\text{Bernoulli}(p)\\). Since the Bernoulli variables \\(X_i\\) are independent we have \\[\\begin{align*} Var[X] = Var[\\sum_{i=0}^n X_i] = \\sum_{i=0}^n Var[X_i] = np(1-p). \\end{align*}\\] Exercise 7.3 (Poisson) Let \\(X \\sim \\text{Poisson}(\\lambda)\\). Find \\(E[X]\\). Find \\(Var[X]\\). Solution. \\[\\begin{align*} E[X] &= \\sum_{k=0}^\\infty \\frac{\\lambda^k e^{-\\lambda}}{k!} k & \\\\ &= \\sum_{k=1}^\\infty \\frac{\\lambda^k e^{-\\lambda}}{k!} k & \\text{term at $k=0$ is 0} \\\\ &= e^{-\\lambda} \\lambda \\sum_{k=1}^\\infty \\frac{\\lambda^{k-1}}{(k - 1)!} & \\\\ &= e^{-\\lambda} \\lambda \\sum_{k=0}^\\infty \\frac{\\lambda^{k}}{k!} & \\\\ &= e^{-\\lambda} \\lambda e^\\lambda & \\\\ &= \\lambda. \\end{align*}\\] \\[\\begin{align*} Var[X] &= E[X^2] - E[X]^2 & \\\\ &= e^{-\\lambda} \\lambda \\sum_{k=1}^\\infty k \\frac{\\lambda^{k-1}}{(k - 1)!} - \\lambda^2 & \\\\ &= e^{-\\lambda} \\lambda \\sum_{k=1}^\\infty (k - 1) + 1) \\frac{\\lambda^{k-1}}{(k - 1)!} - \\lambda^2 & \\\\ &= e^{-\\lambda} \\lambda \\big(\\sum_{k=1}^\\infty (k - 1) \\frac{\\lambda^{k-1}}{(k - 1)!} + \\sum_{k=1}^\\infty \\frac{\\lambda^{k-1}}{(k - 1)!}\\Big) - \\lambda^2 & \\\\ &= e^{-\\lambda} \\lambda \\big(\\lambda\\sum_{k=2}^\\infty \\frac{\\lambda^{k-2}}{(k - 2)!} + e^\\lambda\\Big) - \\lambda^2 & \\\\ &= e^{-\\lambda} \\lambda \\big(\\lambda e^\\lambda + e^\\lambda\\Big) - \\lambda^2 & \\\\ &= \\lambda^2 + \\lambda - \\lambda^2 & \\\\ &= \\lambda. \\end{align*}\\] Exercise 7.4 (Geometric) Let \\(X \\sim \\text{Geometric}(p)\\). Find \\(E[X]\\). Hint: \\(\\frac{d}{dx} x^k = k x^{(k - 1)}\\). Solution. \\[\\begin{align*} E[X] &= \\sum_{k=0}^\\infty (1 - p)^k p k & \\\\ &= p (1 - p) \\sum_{k=0}^\\infty (1 - p)^{k-1} k & \\\\ &= p (1 - p) \\sum_{k=0}^\\infty -\\frac{d}{dp}(1 - p)^k & \\\\ &= p (1 - p) \\Big(-\\frac{d}{dp}\\Big) \\sum_{k=0}^\\infty (1 - p)^k & \\\\ &= p (1 - p) \\Big(-\\frac{d}{dp}\\Big) \\frac{1}{1 - (1 - p)} & \\text{geometric series} \\\\ &= \\frac{1 - p}{p} \\end{align*}\\] 7.2 Continuous random variables Exercise 7.5 (Gamma) Let \\(X \\sim \\text{Gamma}(\\alpha, \\beta)\\). Hint: \\(\\Gamma(z) = \\int_0^\\infty t^{z-1}e^{-t} dt\\) and \\(\\Gamma(z + 1) = z \\Gamma(z)\\). Find \\(E[X]\\). Find \\(Var[X]\\). R: Let \\(\\alpha = 10\\) and \\(\\beta = 2\\). Plot the density of \\(X\\). Add a horizontal line at the expected value that touches the density curve (geom_segment). Shade the area within a standard deviation of the expected value. Solution. \\[\\begin{align*} E[X] &= \\int_0^\\infty \\frac{\\beta^\\alpha}{\\Gamma(\\alpha)}x^\\alpha e^{-\\beta x} dx & \\\\ &= \\frac{\\beta^\\alpha}{\\Gamma(\\alpha)} \\int_0^\\infty x^\\alpha e^{-\\beta x} dx & \\text{ (let $t = \\beta x$)} \\\\ &= \\frac{\\beta^\\alpha}{\\Gamma(\\alpha) }\\int_0^\\infty \\frac{t^\\alpha}{\\beta^\\alpha} e^{-t} \\frac{dt}{\\beta} & \\\\ &= \\frac{1}{\\beta \\Gamma(\\alpha) }\\int_0^\\infty t^\\alpha e^{-t} dt & \\\\ &= \\frac{\\Gamma(\\alpha + 1)}{\\beta \\Gamma(\\alpha)} & \\\\ &= \\frac{\\alpha \\Gamma(\\alpha)}{\\beta \\Gamma(\\alpha)} & \\\\ &= \\frac{\\alpha}{\\beta}. & \\end{align*}\\] \\[\\begin{align*} Var[X] &= E[X^2] - E[X]^2 \\\\ &= \\int_0^\\infty \\frac{\\beta^\\alpha}{\\Gamma(\\alpha)}x^{\\alpha+1} e^{-\\beta x} dx - \\frac{\\alpha^2}{\\beta^2} \\\\ &= \\frac{\\Gamma(\\alpha + 2)}{\\beta^2 \\Gamma(\\alpha)} - \\frac{\\alpha^2}{\\beta^2} \\\\ &= \\frac{(\\alpha + 1)\\alpha\\Gamma(\\alpha)}{\\beta^2 \\Gamma(\\alpha)} - \\frac{\\alpha^2}{\\beta^2} \\\\ &= \\frac{\\alpha^2 + \\alpha}{\\beta^2} - \\frac{\\alpha^2}{\\beta^2} \\\\ &= \\frac{\\alpha}{\\beta^2}. \\end{align*}\\] set.seed(1) x <- seq(0, 25, by = 0.01) y <- dgamma(x, shape = 10, rate = 2) df <- data.frame(x = x, y = y) ggplot(df, aes(x = x, y = y)) + geom_line() + geom_segment(aes(x = 5, y = 0, xend = 5, yend = dgamma(5, shape = 10, rate = 2)), color = "red") + stat_function(fun = dgamma, args = list(shape = 10, rate = 2), xlim = c(5 - sqrt(10/4), 5 + sqrt(10/4)), geom = "area", fill = "gray", alpha = 0.4) Exercise 7.6 (Beta) Let \\(X \\sim \\text{Beta}(\\alpha, \\beta)\\). Find \\(E[X]\\). Hint 1: \\(\\text{B}(x,y) = \\int_0^1 t^{x-1} (1 - t)^{y-1} dt\\). Hint 2: \\(\\text{B}(x + 1, y) = \\text{B}(x,y)\\frac{x}{x + y}\\). Find \\(Var[X]\\). Solution. \\[\\begin{align*} E[X] &= \\int_0^1 \\frac{x^{\\alpha - 1} (1 - x)^{\\beta - 1}}{\\text{B}(\\alpha, \\beta)} x dx \\\\ &= \\frac{1}{\\text{B}(\\alpha, \\beta)}\\int_0^1 x^{\\alpha} (1 - x)^{\\beta - 1} dx \\\\ &= \\frac{1}{\\text{B}(\\alpha, \\beta)} \\text{B}(\\alpha + 1, \\beta) \\\\ &= \\frac{1}{\\text{B}(\\alpha, \\beta)} \\text{B}(\\alpha, \\beta) \\frac{\\alpha}{\\alpha + \\beta} \\\\ &= \\frac{\\alpha}{\\alpha + \\beta}. \\\\ \\end{align*}\\] \\[\\begin{align*} Var[X] &= E[X^2] - E[X]^2 \\\\ &= \\int_0^1 \\frac{x^{\\alpha - 1} (1 - x)^{\\beta - 1}}{\\text{B}(\\alpha, \\beta)} x^2 dx - \\frac{\\alpha^2}{(\\alpha + \\beta)^2} \\\\ &= \\frac{1}{\\text{B}(\\alpha, \\beta)}\\int_0^1 x^{\\alpha + 1} (1 - x)^{\\beta - 1} dx - \\frac{\\alpha^2}{(\\alpha + \\beta)^2} \\\\ &= \\frac{1}{\\text{B}(\\alpha, \\beta)} \\text{B}(\\alpha + 2, \\beta) - \\frac{\\alpha^2}{(\\alpha + \\beta)^2} \\\\ &= \\frac{1}{\\text{B}(\\alpha, \\beta)} \\text{B}(\\alpha + 1, \\beta) \\frac{\\alpha + 1}{\\alpha + \\beta + 1} - \\frac{\\alpha^2}{(\\alpha + \\beta)^2} \\\\ &= \\frac{\\alpha + 1}{\\alpha + \\beta + 1} \\frac{\\alpha}{\\alpha + \\beta} - \\frac{\\alpha^2}{(\\alpha + \\beta)^2}\\\\ &= \\frac{\\alpha \\beta}{(\\alpha + \\beta)^2(\\alpha + \\beta + 1)}. \\end{align*}\\] Exercise 7.7 (Exponential) Let \\(X \\sim \\text{Exp}(\\lambda)\\). Find \\(E[X]\\). Hint: \\(\\Gamma(z + 1) = z\\Gamma(z)\\) and \\(\\Gamma(1) = 1\\). Find \\(Var[X]\\). Solution. \\[\\begin{align*} E[X] &= \\int_0^\\infty \\lambda e^{-\\lambda x} x dx & \\\\ &= \\lambda \\int_0^\\infty x e^{-\\lambda x} dx & \\\\ &= \\lambda \\int_0^\\infty \\frac{t}{\\lambda} e^{-t} \\frac{dt}{\\lambda} & \\text{$t = \\lambda x$}\\\\ &= \\lambda \\lambda^{-2} \\Gamma(2) & \\text{definition of gamma function} \\\\ &= \\lambda^{-1}. \\end{align*}\\] \\[\\begin{align*} Var[X] &= E[X^2] - E[X]^2 & \\\\ &= \\int_0^\\infty \\lambda e^{-\\lambda x} x^2 dx - \\lambda^{-2} & \\\\ &= \\lambda \\int_0^\\infty \\frac{t^2}{\\lambda^2} e^{-t} \\frac{dt}{\\lambda} - \\lambda^{-2} & \\text{$t = \\lambda x$} \\\\ &= \\lambda \\lambda^{-3} \\Gamma(3) - \\lambda^{-2} & \\text{definition of gamma function} & \\\\ &= \\lambda^{-2} 2 \\Gamma(2) - \\lambda^{-2} & \\\\ &= 2 \\lambda^{-2} - \\lambda^{-2} & \\\\ &= \\lambda^{-2}. & \\\\ \\end{align*}\\] Exercise 7.8 (Normal) Let \\(X \\sim \\text{N}(\\mu, \\sigma)\\). Show that \\(E[X] = \\mu\\). Hint: Use the error function \\(\\text{erf}(x) = \\frac{1}{\\sqrt(\\pi)} \\int_{-x}^x e^{-t^2} dt\\). The statistical interpretation of this function is that if \\(Y \\sim \\text{N}(0, 0.5)\\), then the error function describes the probability of \\(Y\\) falling between \\(-x\\) and \\(x\\). Also, \\(\\text{erf}(\\infty) = 1\\). Show that \\(Var[X] = \\sigma^2\\). Hint: Start with the definition of variance. Solution. \\[\\begin{align*} E[X] &= \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt{2\\pi \\sigma^2}} e^{-\\frac{(x - \\mu)^2}{2\\sigma^2}} x dx & \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} \\int_{-\\infty}^\\infty x e^{-\\frac{(x - \\mu)^2}{2\\sigma^2}} dx & \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} \\int_{-\\infty}^\\infty \\Big(t \\sqrt{2\\sigma^2} + \\mu\\Big)e^{-t^2} \\sqrt{2 \\sigma^2} dt & t = \\frac{x - \\mu}{\\sqrt{2}\\sigma} \\\\ &= \\frac{\\sqrt{2\\sigma^2}}{\\sqrt{\\pi}} \\int_{-\\infty}^\\infty t e^{-t^2} dt + \\frac{1}{\\sqrt{\\pi}} \\int_{-\\infty}^\\infty \\mu e^{-t^2} dt & \\\\ \\end{align*}\\] Let us calculate these integrals separately. \\[\\begin{align*} \\int t e^{-t^2} dt &= -\\frac{1}{2}\\int e^{s} ds & s = -t^2 \\\\ &= -\\frac{e^s}{2} + C \\\\ &= -\\frac{e^{-t^2}}{2} + C & \\text{undoing substitution}. \\end{align*}\\] Inserting the integration limits we get \\[\\begin{align*} \\int_{-\\infty}^\\infty t e^{-t^2} dt &= 0, \\end{align*}\\] due to the integrated function being symmetric. Reordering the second integral we get \\[\\begin{align*} \\mu \\frac{1}{\\sqrt{\\pi}} \\int_{-\\infty}^\\infty e^{-t^2} dt &= \\mu \\text{erf}(\\infty) & \\text{definition of error function} \\\\ &= \\mu & \\text{probability of $Y$ falling between $-\\infty$ and $\\infty$}. \\end{align*}\\] Combining all of the above we get \\[\\begin{align*} E[X] &= \\frac{\\sqrt{2\\sigma^2}}{\\sqrt{\\pi}} \\times 0 + \\mu &= \\mu.\\\\ \\end{align*}\\] \\[\\begin{align*} Var[X] &= E[(X - E[X])^2] \\\\ &= E[(X - \\mu)^2] \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} \\int_{-\\infty}^\\infty (x - \\mu)^2 e^{-\\frac{(x - \\mu)^2}{2\\sigma^2}} dx \\\\ &= \\frac{\\sigma^2}{\\sqrt{2\\pi}} \\int_{-\\infty}^\\infty t^2 e^{-\\frac{t^2}{2}} dt \\\\ &= \\frac{\\sigma^2}{\\sqrt{2\\pi}} \\bigg(\\Big(- t e^{-\\frac{t^2}{2}} |_{-\\infty}^\\infty \\Big) + \\int_{-\\infty}^\\infty e^{-\\frac{t^2}{2}} \\bigg) dt & \\text{integration by parts} \\\\ &= \\frac{\\sigma^2}{\\sqrt{2\\pi}} \\sqrt{2 \\pi} \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt(\\pi)}e^{-s^2} \\bigg) & s = \\frac{t}{\\sqrt{2}} \\text{ and evaluating the left expression at the bounds} \\\\ &= \\frac{\\sigma^2}{\\sqrt{2\\pi}} \\sqrt{2 \\pi} \\Big(\\text{erf}(\\infty) & \\text{definition of error function} \\\\ &= \\sigma^2. \\end{align*}\\] 7.3 Sums, functions, conditional expectations Exercise 7.9 (Expectation of transformations) Let \\(X\\) follow a normal distribution with mean \\(\\mu\\) and variance \\(\\sigma^2\\). Find \\(E[2X + 4]\\). Find \\(E[X^2]\\). Find \\(E[\\exp(X)]\\). Hint: Use the error function \\(\\text{erf}(x) = \\frac{1}{\\sqrt(\\pi)} \\int_{-x}^x e^{-t^2} dt\\). Also, \\(\\text{erf}(\\infty) = 1\\). R: Check your results numerically for \\(\\mu = 0.4\\) and \\(\\sigma^2 = 0.25\\) and plot the densities of all four distributions. Solution. \\[\\begin{align} E[2X + 4] &= 2E[X] + 4 & \\text{linearity of expectation} \\\\ &= 2\\mu + 4. \\\\ \\end{align}\\] \\[\\begin{align} E[X^2] &= E[X]^2 + Var[X] & \\text{definition of variance} \\\\ &= \\mu^2 + \\sigma^2. \\end{align}\\] \\[\\begin{align} E[\\exp(X)] &= \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt{2\\pi \\sigma^2}} e^{-\\frac{(x - \\mu)^2}{2\\sigma^2}} e^x dx & \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} \\int_{-\\infty}^\\infty e^{\\frac{2 \\sigma^2 x}{2\\sigma^2} -\\frac{(x - \\mu)^2}{2\\sigma^2}} dx & \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} \\int_{-\\infty}^\\infty e^{-\\frac{x^2 - 2x(\\mu + \\sigma^2) + \\mu^2}{2\\sigma^2}} dx & \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} \\int_{-\\infty}^\\infty e^{-\\frac{(x - (\\mu + \\sigma^2))^2 + \\mu^2 - (\\mu + \\sigma^2)^2}{2\\sigma^2}} dx & \\text{complete the square} \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} e^{\\frac{- \\mu^2 + (\\mu + \\sigma^2)^2}{2\\sigma^2}} \\int_{-\\infty}^\\infty e^{-\\frac{(x - (\\mu + \\sigma^2))^2}{2\\sigma^2}} dx & \\\\ &= \\frac{1}{\\sqrt{2\\pi \\sigma^2}} e^{\\frac{- \\mu^2 + (\\mu + \\sigma^2)^2}{2\\sigma^2}} \\sigma \\sqrt{2 \\pi} \\text{erf}(\\infty) & \\\\ &= e^{\\frac{2\\mu + \\sigma^2}{2}}. \\end{align}\\] set.seed(1) mu <- 0.4 sigma <- 0.5 x <- rnorm(100000, mean = mu, sd = sigma) mean(2*x + 4) ## [1] 4.797756 2 * mu + 4 ## [1] 4.8 mean(x^2) ## [1] 0.4108658 mu^2 + sigma^2 ## [1] 0.41 mean(exp(x)) ## [1] 1.689794 exp((2 * mu + sigma^2) / 2) ## [1] 1.690459 Exercise 7.10 (Sum of independent random variables) Borrowed from Wasserman. Let \\(X_1, X_2,...,X_n\\) be IID random variables with expected value \\(E[X_i] = \\mu\\) and variance \\(Var[X_i] = \\sigma^2\\). Find the expected value and variance of \\(\\bar{X} = \\frac{1}{n} \\sum_{i=1}^n X_i\\). \\(\\bar{X}\\) is called a statistic (a function of the values in a sample). It is itself a random variable and its distribution is called a sampling distribution. R: Take \\(n = 5, 10, 100, 1000\\) samples from the N(\\(2\\), \\(6\\)) distribution 10000 times. Plot the theoretical density and the densities of \\(\\bar{X}\\) statistic for each \\(n\\). Intuitively, are the results in correspondence with your calculations? Check them numerically. Solution. Let us start with the expectation of \\(\\bar{X}\\). \\[\\begin{align} E[\\bar{X}] &= E[\\frac{1}{n} \\sum_{i=1}^n X_i] & \\\\ &= \\frac{1}{n} E[\\sum_{i=1}^n X_i] & \\text{ (multiplication with a scalar)} \\\\ &= \\frac{1}{n} \\sum_{i=1}^n E[X_i] & \\text{ (linearity)} \\\\ &= \\frac{1}{n} n \\mu & \\\\ &= \\mu. \\end{align}\\] Now the variance \\[\\begin{align} Var[\\bar{X}] &= Var[\\frac{1}{n} \\sum_{i=1}^n X_i] & \\\\ &= \\frac{1}{n^2} Var[\\sum_{i=1}^n X_i] & \\text{ (multiplication with a scalar)} \\\\ &= \\frac{1}{n^2} \\sum_{i=1}^n Var[X_i] & \\text{ (independence of samples)} \\\\ &= \\frac{1}{n^2} n \\sigma^2 & \\\\ &= \\frac{1}{n} \\sigma^2. \\end{align}\\] set.seed(1) nsamps <- 10000 mu <- 2 sigma <- sqrt(6) N <- c(5, 10, 100, 500) X <- matrix(data = NA, nrow = nsamps, ncol = length(N)) ind <- 1 for (n in N) { for (i in 1:nsamps) { X[i,ind] <- mean(rnorm(n, mu, sigma)) } ind <- ind + 1 } colnames(X) <- N X <- melt(as.data.frame(X)) ggplot(data = X, aes(x = value, colour = variable)) + geom_density() + stat_function(data = data.frame(x = seq(-2, 6, by = 0.01)), aes(x = x), fun = dnorm, args = list(mean = mu, sd = sigma), color = "black") Exercise 7.11 (Conditional expectation) Let \\(X \\in \\mathbb{R}_0^+\\) and \\(Y \\in \\mathbb{N}_0\\) be random variables with joint distribution \\(p_{XY}(X,Y) = \\frac{1}{y + 1} e^{-\\frac{x}{y + 1}} 0.5^{y + 1}\\). Find \\(E[X | Y = y]\\) by first finding \\(p_Y\\) and then \\(p_{X|Y}\\). Find \\(E[X]\\). R: check your answers to a) and b) by drawing 10000 samples from \\(p_Y\\) and \\(p_{X|Y}\\). Solution. \\[\\begin{align} p(y) &= \\int_0^\\infty \\frac{1}{y + 1} e^{-\\frac{x}{y + 1}} 0.5^{y + 1} dx \\\\ &= \\frac{0.5^{y + 1}}{y + 1} \\int_0^\\infty e^{-\\frac{x}{y + 1}} dx \\\\ &= \\frac{0.5^{y + 1}}{y + 1} (y + 1) \\\\ &= 0.5^{y + 1} \\\\ &= 0.5(1 - 0.5)^y. \\end{align}\\] We recognize this as the geometric distribution. \\[\\begin{align} p(x|y) &= \\frac{p(x,y)}{p(y)} \\\\ &= \\frac{1}{y + 1} e^{-\\frac{x}{y + 1}}. \\end{align}\\] We recognize this as the exponential distribution. \\[\\begin{align} E[X | Y = y] &= \\int_0^\\infty x \\frac{1}{y + 1} e^{-\\frac{x}{y + 1}} dx \\\\ &= y + 1 & \\text{expected value of the exponential distribution} \\end{align}\\] Use the law of iterated expectation. \\[\\begin{align} E[X] &= E[E[X | Y]] \\\\ &= E[Y + 1] \\\\ &= E[Y] + 1 \\\\ &= \\frac{1 - 0.5}{0.5} + 1 \\\\ &= 2. \\end{align}\\] set.seed(1) y <- rgeom(100000, 0.5) x <- rexp(100000, rate = 1 / (y + 1)) x2 <- x[y == 3] mean(x2) ## [1] 4.048501 3 + 1 ## [1] 4 mean(x) ## [1] 2.007639 (1 - 0.5) / 0.5 + 1 ## [1] 2 Exercise 7.12 (Cauchy distribution) Let \\(p(x | x_0, \\gamma) = \\frac{1}{\\pi \\gamma \\Big(1 + \\big(\\frac{x - x_0}{\\gamma}\\big)^2\\Big)}\\). A random variable with this PDF follows a Cauchy distribution. This distribution is symmetric and has wider tails than the normal distribution. R: Draw \\(n = 1,...,1000\\) samples from a standard normal and \\(\\text{Cauchy}(0, 1)\\). For each \\(n\\) plot the mean and the median of the sample using facets. Interpret the results. To get a mathematical explanation of the results in a), evaluate the integral \\(\\int_0^\\infty \\frac{x}{1 + x^2} dx\\) and consider that \\(E[X] = \\int_{-\\infty}^\\infty \\frac{x}{1 + x^2}dx\\). set.seed(1) n <- 1000 means_n <- vector(mode = "numeric", length = n) means_c <- vector(mode = "numeric", length = n) medians_n <- vector(mode = "numeric", length = n) medians_c <- vector(mode = "numeric", length = n) for (i in 1:n) { tmp_n <- rnorm(i) tmp_c <- rcauchy(i) means_n[i] <- mean(tmp_n) means_c[i] <- mean(tmp_c) medians_n[i] <- median(tmp_n) medians_c[i] <- median(tmp_c) } df <- data.frame("distribution" = c(rep("normal", 2 * n), rep("Cauchy", 2 * n)), "type" = c(rep("mean", n), rep("median", n), rep("mean", n), rep("median", n)), "value" = c(means_n, medians_n, means_c, medians_c), "n" = rep(1:n, times = 4)) ggplot(df, aes(x = n, y = value)) + geom_line(alpha = 0.5) + facet_wrap(~ type + distribution , scales = "free") Solution. \\[\\begin{align} \\int_0^\\infty \\frac{x}{1 + x^2} dx &= \\frac{1}{2} \\int_1^\\infty \\frac{1}{u} du & u = 1 + x^2 \\\\ &= \\frac{1}{2} \\ln(x) |_0^\\infty. \\end{align}\\] This integral is not finite. The same holds for the negative part. Therefore, the expectation is undefined, as \\(E[|X|] = \\infty\\). Why can we not just claim that \\(f(x) = x / (1 + x^2)\\) is odd and \\(\\int_{-\\infty}^\\infty f(x) = 0\\)? By definition of the Lebesgue integral \\(\\int_{-\\infty}^{\\infty} f= \\int_{-\\infty}^{\\infty} f_+-\\int_{-\\infty}^{\\infty} f_-\\). At least one of the two integrals needs to be finite for \\(\\int_{-\\infty}^{\\infty} f\\) to be well-defined. However \\(\\int_{-\\infty}^{\\infty} f_+=\\int_0^{\\infty} x/(1+x^2)\\) and \\(\\int_{-\\infty}^{\\infty} f_-=\\int_{-\\infty}^{0} |x|/(1+x^2)\\). We have just shown that both of these integrals are infinite, which implies that their sum is also infinite. 7.4 Covariance Exercise 7.13 Below is a table of values for random variables \\(X\\) and \\(Y\\). X Y 2.1 8 -0.5 11 1 10 -2 12 4 9 Find sample covariance of \\(X\\) and \\(Y\\). Find sample variances of \\(X\\) and \\(Y\\). Find sample correlation of \\(X\\) and \\(Y\\). Find sample variance of \\(Z = 2X - 3Y\\). Solution. \\(\\bar{X} = 0.92\\) and \\(\\bar{Y} = 10\\). \\[\\begin{align} s(X, Y) &= \\frac{1}{n - 1} \\sum_{i=1}^5 (X_i - 0.92) (Y_i - 10) \\\\ &= -3.175. \\end{align}\\] \\[\\begin{align} s(X) &= \\frac{\\sum_{i=1}^5(X_i - 0.92)^2}{5 - 1} \\\\ &= 5.357. \\end{align}\\] \\[\\begin{align} s(Y) &= \\frac{\\sum_{i=1}^5(Y_i - 10)^2}{5 - 1} \\\\ &= 2.5. \\end{align}\\] \\[\\begin{align} r(X,Y) &= \\frac{Cov(X,Y)}{\\sqrt{Var[X]Var[Y]}} \\\\ &= \\frac{-3.175}{\\sqrt{5.357 \\times 2.5}} \\\\ &= -8.68. \\end{align}\\] \\[\\begin{align} s(Z) &= 2^2 s(X) + 3^2 s(Y) + 2 \\times 2 \\times 3 s(X, Y) \\\\ &= 4 \\times 5.357 + 9 \\times 2.5 + 12 \\times 3.175 \\\\ &= 82.028. \\end{align}\\] Exercise 7.14 Let \\(X \\sim \\text{Uniform}(0,1)\\) and \\(Y | X = x \\sim \\text{Uniform(0,x)}\\). Find the covariance of \\(X\\) and \\(Y\\). Find the correlation of \\(X\\) and \\(Y\\). R: check your answers to a) and b) with simulation. Plot \\(X\\) against \\(Y\\) on a scatterplot. Solution. The joint PDF is \\(p(x,y) = p(x)p(y|x) = \\frac{1}{x}\\). \\[\\begin{align} Cov(X,Y) &= E[XY] - E[X]E[Y] \\\\ \\end{align}\\] Let us first evaluate the first term: \\[\\begin{align} E[XY] &= \\int_0^1 \\int_0^x x y \\frac{1}{x} dy dx \\\\ &= \\int_0^1 \\int_0^x y dy dx \\\\ &= \\int_0^1 \\frac{x^2}{2} dx \\\\ &= \\frac{1}{6}. \\end{align}\\] Now let us find \\(E[Y]\\), \\(E[X]\\) is trivial. \\[\\begin{align} E[Y] = E[E[Y | X]] = E[\\frac{X}{2}] = \\frac{1}{2} \\int_0^1 x dx = \\frac{1}{4}. \\end{align}\\] Combining all: \\[\\begin{align} Cov(X,Y) &= \\frac{1}{6} - \\frac{1}{2} \\frac{1}{4} = \\frac{1}{24}. \\end{align}\\] \\[\\begin{align} \\rho(X,Y) &= \\frac{Cov(X,Y)}{\\sqrt{Var[X]Var[Y]}} \\\\ \\end{align}\\] Let us calculate \\(Var[X]\\). \\[\\begin{align} Var[X] &= E[X^2] - \\frac{1}{4} \\\\ &= \\int_0^1 x^2 - \\frac{1}{4} \\\\ &= \\frac{1}{3} - \\frac{1}{4} \\\\ &= \\frac{1}{12}. \\end{align}\\] Let us calculate \\(E[E[Y^2|X]]\\). \\[\\begin{align} E[E[Y^2|X]] &= E[\\frac{x^2}{3}] \\\\ &= \\frac{1}{9}. \\end{align}\\] Then \\(Var[Y] = \\frac{1}{9} - \\frac{1}{16} = \\frac{7}{144}\\). Combining all \\[\\begin{align} \\rho(X,Y) &= \\frac{\\frac{1}{24}}{\\sqrt{\\frac{1}{12}\\frac{7}{144}}} \\\\ &= 0.65. \\end{align}\\] set.seed(1) nsamps <- 10000 x <- runif(nsamps) y <- runif(nsamps, 0, x) cov(x, y) ## [1] 0.04274061 1/24 ## [1] 0.04166667 cor(x, y) ## [1] 0.6629567 (1 / 24) / (sqrt(7 / (12 * 144))) ## [1] 0.6546537 ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) + geom_point(alpha = 0.2) + geom_smooth(method = "lm") "],["mrv.html", "Chapter 8 Multivariate random variables 8.1 Multinomial random variables 8.2 Multivariate normal random variables 8.3 Transformations", " Chapter 8 Multivariate random variables This chapter deals with multivariate random variables. The students are expected to acquire the following knowledge: Theoretical Multinomial distribution. Multivariate normal distribution. Cholesky decomposition. Eigendecomposition. R Sampling from the multinomial distribution. Sampling from the multivariate normal distribution. Matrix decompositions. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 8.1 Multinomial random variables Exercise 8.1 Let \\(X_i\\), \\(i = 1,...,k\\) represent \\(k\\) events, and \\(p_i\\) the probabilities of these events happening in a trial. Let \\(n\\) be the number of trials, and \\(X\\) a multivariate random variable, the collection of \\(X_i\\). Then \\(p(x) = \\frac{n!}{x_1!x_2!...x_k!} p_1^{x_1} p_2^{x_2}...p_k^{x_k}\\) is the PMF of a multinomial distribution, where \\(n = \\sum_{i = 1}^k x_i\\). Show that the marginal distribution of \\(X_i\\) is a binomial distribution. Take 1000 samples from the multinomial distribution with \\(n=4\\) and probabilities \\(p = (0.2, 0.2, 0.5, 0.1)\\). Then take 1000 samples from four binomial distributions with the same parameters. Inspect the results visually. Solution. We will approach this proof from the probabilistic point of view. W.L.O.G. let \\(x_1\\) be the marginal distribution we are interested in. The term \\(p^{x_1}\\) denotes the probability that event 1 happened \\(x_1\\) times. For this event not to happen, one of the other events needs to happen. So for each of the remaining trials, the probability of another event is \\(\\sum_{i=2}^k p_i = 1 - p_1\\), and there were \\(n - x_1\\) such trials. What is left to do is to calculate the number of permutations of event 1 happening and event 1 not happening. We choose \\(x_1\\) trials, from \\(n\\) trials. Therefore \\(p(x_1) = \\binom{n}{x_1} p_1^{x_1} (1 - p_1)^{n - x_1}\\), which is the binomial PMF. Interested students are encouraged to prove this mathematically. set.seed(1) nsamps <- 1000 samps_mult <- rmultinom(nsamps, 4, prob = c(0.2, 0.2, 0.5, 0.1)) samps_mult <- as_tibble(t(samps_mult)) %>% gather() samps <- tibble( V1 = rbinom(nsamps, 4, 0.2), V2 = rbinom(nsamps, 4, 0.2), V3 = rbinom(nsamps, 4, 0.5), V4 = rbinom(nsamps, 4, 0.1) ) %>% gather() %>% bind_rows(samps_mult) %>% bind_cols("dist" = c(rep("binomial", 4*nsamps), rep("multinomial", 4*nsamps))) ggplot(samps, aes(x = value, fill = dist)) + geom_bar(position = "dodge") + facet_wrap(~ key) Exercise 8.2 (Multinomial expected value) Find the expected value, variance and covariance of the multinomial distribution. Hint: First find the expected value for \\(n = 1\\) and then use the fact that the trials are independent. Solution. Let us first calculate the expected value of \\(X_1\\), when \\(n = 1\\). \\[\\begin{align} E[X_1] &= \\sum_{n_1 = 0}^1 \\sum_{n_2 = 0}^1 ... \\sum_{n_k = 0}^1 \\frac{1}{n_1!n_2!...n_k!}p_1^{n_1}p_2^{n_2}...p_k^{n_k}n_1 \\\\ &= \\sum_{n_1 = 0}^1 \\frac{p_1^{n_1} n_1}{n_1!} \\sum_{n_2 = 0}^1 ... \\sum_{n_k = 0}^1 \\frac{1}{n_2!...n_k!}p_2^{n_2}...p_k^{n_k} \\end{align}\\] When \\(n_1 = 0\\) then the whole terms is zero, so we do not need to evaluate other sums. When \\(n_1 = 1\\), all other \\(n_i\\) must be zero, as we have \\(1 = \\sum_{i=1}^k n_i\\). Therefore the other sums equal \\(1\\). So \\(E[X_1] = p_1\\) and \\(E[X_i] = p_i\\) for \\(i = 1,...,k\\). Now let \\(Y_j\\), \\(j = 1,...,n\\), have a multinomial distribution with \\(n = 1\\), and let \\(X\\) have a multinomial distribution with an arbitrary \\(n\\). Then we can write \\(X = \\sum_{j=1}^n Y_j\\). And due to independence \\[\\begin{align} E[X] &= E[\\sum_{j=1}^n Y_j] \\\\ &= \\sum_{j=1}^n E[Y_j] \\\\ &= np. \\end{align}\\] For the variance, we need \\(E[X^2]\\). Let us follow the same procedure as above and first calculate \\(E[X_i]\\) for \\(n = 1\\). The only thing that changes is that the term \\(n_i\\) becomes \\(n_i^2\\). Since we only have \\(0\\) and \\(1\\) this does not change the outcome. So \\[\\begin{align} Var[X_i] &= E[X_i^2] - E[X_i]^2\\\\ &= p_i(1 - p_i). \\end{align}\\] Analogous to above for arbitrary \\(n\\) \\[\\begin{align} Var[X] &= E[X^2] - E[X]^2 \\\\ &= \\sum_{j=1}^n E[Y_j^2] - \\sum_{j=1}^n E[Y_j]^2 \\\\ &= \\sum_{j=1}^n E[Y_j^2] - E[Y_j]^2 \\\\ &= \\sum_{j=1}^n p(1-p) \\\\ &= np(1-p). \\end{align}\\] To calculate the covariance, we need \\(E[X_i X_j]\\). Again, let us start with \\(n = 1\\). Without loss of generality, let us assume \\(i = 1\\) and \\(j = 2\\). \\[\\begin{align} E[X_1 X_2] = \\sum_{n_1 = 0}^1 \\sum_{n_2 = 0}^1 \\frac{p_1^{n_1} n_1}{n_1!} \\frac{p_2^{n_2} n_2}{n_2!} \\sum_{n_3 = 0}^1 ... \\sum_{n_k = 0}^1 \\frac{1}{n_3!...n_k!}p_3^{n_3}...p_k^{n_k}. \\end{align}\\] In the above expression, at each iteration we multiply with \\(n_1\\) and \\(n_2\\). Since \\(n = 1\\), one of these always has to be zero. Therefore \\(E[X_1 X_2] = 0\\) and \\[\\begin{align} Cov(X_i, X_j) &= E[X_i X_j] - E[X_i]E[X_j] \\\\ &= - p_i p_j. \\end{align}\\] For arbitrary \\(n\\), let \\(X = \\sum_{t = 1}^n Y_t\\) be the sum of independent multinomial random variables \\(Y_t = [X_{1t}, X_{2t},...,X_{kt}]^T\\) with \\(n=1\\). Then \\(X_1 = \\sum_{t = 1}^n X_{1t}\\) and \\(X_2 = \\sum_{l = 1}^n X_{2l}\\). \\[\\begin{align} Cov(X_1, X_2) &= E[X_1 X_2] - E[X_1] E[X_2] \\\\ &= E[\\sum_{t = 1}^n X_{1t} \\sum_{l = 1}^n X_{2l}] - n^2 p_1 p_2 \\\\ &= \\sum_{t = 1}^n \\sum_{l = 1}^n E[X_{1t} X_{2l}] - n^2 p_1 p_2. \\end{align}\\] For \\(X_{1t}\\) and \\(X_{2l}\\) the expected value is zero when \\(t = l\\). When \\(t \\neq l\\) then they are independent, so the expected value is the product \\(p_1 p_2\\). There are \\(n^2\\) total terms, and for \\(n\\) of them \\(t = l\\) holds. So \\(E[X_1 X_2] = (n^2 - n) p_1 p_2\\). Inserting into the above \\[\\begin{align} Cov(X_1, X_2) &= (n^2 - n) p_1 p_2 - n^2 p_1 p_2 \\\\ &= - n p_1 p_2. \\end{align}\\] 8.2 Multivariate normal random variables Exercise 8.3 (Cholesky decomposition) Let \\(X\\) be a random vector of length \\(k\\) with \\(X_i \\sim \\text{N}(0, 1)\\) and \\(LL^*\\) the Cholesky decomposition of a Hermitian positive-definite matrix \\(A\\). Let \\(\\mu\\) be a vector of length \\(k\\). Find the distribution of the random vector \\(Y = \\mu + L X\\). Find the Cholesky decomposition of \\(A = \\begin{bmatrix} 2 & 1.2 \\\\ 1.2 & 1 \\end{bmatrix}\\). R: Use the results from a) and b) to sample from the MVN distribution \\(\\text{N}(\\mu, A)\\), where \\(\\mu = [1.5, -1]^T\\). Plot a scatterplot and compare it to direct samples from the multivariate normal distribution (rmvnorm). Solution. \\(X\\) has an independent normal distribution of dimension \\(k\\). Then \\[\\begin{align} Y = \\mu + L X &\\sim \\text{N}(\\mu, LL^T) \\\\ &\\sim \\text{N}(\\mu, A). \\end{align}\\] Solve \\[\\begin{align} \\begin{bmatrix} a & 0 \\\\ b & c \\end{bmatrix} \\begin{bmatrix} a & b \\\\ 0 & c \\end{bmatrix} = \\begin{bmatrix} 2 & 1.2 \\\\ 1.2 & 1 \\end{bmatrix} \\end{align}\\] # a set.seed(1) nsamps <- 1000 X <- matrix(data = rnorm(nsamps * 2), ncol = 2) mu <- c(1.5, -1) L <- matrix(data = c(sqrt(2), 0, 1.2 / sqrt(2), sqrt(1 - 1.2^2/2)), ncol = 2, byrow = TRUE) Y <- t(mu + L %*% t(X)) plot_df <- data.frame(rbind(X, Y), c(rep("X", nsamps), rep("Y", nsamps))) colnames(plot_df) <- c("D1", "D2", "var") ggplot(data = plot_df, aes(x = D1, y = D2, colour = as.factor(var))) + geom_point() Exercise 8.4 (Eigendecomposition) R: Let \\(\\Sigma = U \\Lambda U^T\\) be the eigendecomposition of covariance matrix \\(\\Sigma\\). Follow the procedure below, to sample from a multivariate normal with \\(\\mu = [-2, 1]^T\\) and \\(\\Sigma = \\begin{bmatrix} 0.3, -0.5 \\\\ -0.5, 1.6 \\end{bmatrix}\\): Sample from two independent standardized normal distributions to get \\(X\\). Find the eigendecomposition of \\(X\\) (eigen). Multiply \\(X\\) by \\(\\Lambda^{\\frac{1}{2}}\\) to get \\(X2\\). Consider how the eigendecomposition for \\(X2\\) changes compared to \\(X\\). Multiply \\(X2\\) by \\(U\\) to get \\(X3\\). Consider how the eigendecomposition for \\(X3\\) changes compared to \\(X2\\). Add \\(\\mu\\) to \\(X3\\). Consider how the eigendecomposition for \\(X4\\) changes compared to \\(X3\\). Plot the data and the eigenvectors (scaled with \\(\\Lambda^{\\frac{1}{2}}\\)) at each step. Hint: Use geom_segment for the eigenvectors. # a set.seed(1) sigma <- matrix(data = c(0.3, -0.5, -0.5, 1.6), nrow = 2, byrow = TRUE) ed <- eigen(sigma) e_val <- ed$values e_vec <- ed$vectors # b set.seed(1) nsamps <- 1000 X <- matrix(data = rnorm(nsamps * 2), ncol = 2) vec1 <- matrix(c(1,0,0,1), nrow = 2) X2 <- t(sqrt(diag(e_val)) %*% t(X)) vec2 <- sqrt(diag(e_val)) %*% vec1 X3 <- t(e_vec %*% t(X2)) vec3 <- e_vec %*% vec2 X4 <- t(c(-2, 1) + t(X3)) vec4 <- c(-2, 1) + vec3 vec_mat <- data.frame(matrix(c(0,0,0,0,0,0,0,0,0,0,0,0,-2,1,-2,1), ncol = 2, byrow = TRUE), t(cbind(vec1, vec2, vec3, vec4)), c(1,1,2,2,3,3,4,4)) df <- data.frame(rbind(X, X2, X3, X4), c(rep(1, nsamps), rep(2, nsamps), rep(3, nsamps), rep(4, nsamps))) colnames(df) <- c("D1", "D2", "wh") colnames(vec_mat) <- c("D1", "D2", "E1", "E2", "wh") ggplot(data = df, aes(x = D1, y = D2)) + geom_point() + geom_segment(data = vec_mat, aes(xend = E1, yend = E2), color = "red") + facet_wrap(~ wh) + coord_fixed() Exercise 8.5 (Marginal and conditional distributions) Let \\(X \\sim \\text{N}(\\mu, \\Sigma)\\), where \\(\\mu = [2, 0, -1]^T\\) and \\(\\Sigma = \\begin{bmatrix} 1 & -0.2 & 0.5 \\\\ -0.2 & 1.4 & -1.2 \\\\ 0.5 & -1.2 & 2 \\\\ \\end{bmatrix}\\). Let \\(A\\) represent the first two random variables and \\(B\\) the third random variable. R: For the calculation in the following points, you can use R. Find the marginal distribution of \\(B\\). Find the conditional distribution of \\(B | A = [a_1, a_2]^T\\). Find the marginal distribution of \\(A\\). Find the conditional distribution of \\(A | B = b\\). R: Visually compare the distributions of a) and b), and c) and d) at three different conditional values. mu <- c(2, 0, -1) Sigma <- matrix(c(1, -0.2, 0.5, -0.2, 1.4, -1.2, 0.5, -1.2, 2), nrow = 3, byrow = TRUE) mu_A <- c(2, 0) mu_B <- -1 Sigma_A <- Sigma[1:2, 1:2] Sigma_B <- Sigma[3, 3] Sigma_AB <- Sigma[1:2, 3] # b tmp_b <- t(Sigma_AB) %*% solve(Sigma_A) mu_b <- mu_B - tmp_b %*% mu_A Sigma_b <- Sigma_B - t(Sigma_AB) %*% solve(Sigma_A) %*% Sigma_AB mu_b ## [,1] ## [1,] -1.676471 tmp_b ## [,1] [,2] ## [1,] 0.3382353 -0.8088235 Sigma_b ## [,1] ## [1,] 0.8602941 # d tmp_a <- Sigma_AB * (1 / Sigma_B) mu_a <- mu_A - tmp_a * mu_B Sigma_d <- Sigma_A - (Sigma_AB * (1 / Sigma_B)) %*% t(Sigma_AB) mu_a ## [1] 2.25 -0.60 tmp_a ## [1] 0.25 -0.60 Sigma_d ## [,1] [,2] ## [1,] 0.875 0.10 ## [2,] 0.100 0.68 Solution. \\(B \\sim \\text{N}(-1, 2)\\). \\(B | A = a \\sim \\text{N}(-1.68 + [0.34, -0.81] a, 0.86)\\). \\(\\mu_A = [2, 0]^T\\) and \\(\\Sigma_A = \\begin{bmatrix} 1 & -0.2 & \\\\ -0.2 & 1.4 \\\\ \\end{bmatrix}\\). \\[\\begin{align} A | B = b &\\sim \\text{N}(\\mu_t, \\Sigma_t), \\\\ \\mu_t &= [2.25, -0.6]^T + [0.25, -0.6]^T b, \\\\ \\Sigma_t &= \\begin{bmatrix} 0.875 & 0.1 \\\\ 0.1 & 0.68 \\\\ \\end{bmatrix} \\end{align}\\] library(mvtnorm) set.seed(1) nsamps <- 1000 # a and b samps <- as.data.frame(matrix(data = NA, nrow = 4 * nsamps, ncol = 2)) samps[1:nsamps,1] <- rnorm(nsamps, mu_B, Sigma_B) samps[1:nsamps,2] <- "marginal" for (i in 1:3) { a <- rmvnorm(1, mu_A, Sigma_A) samps[(i*nsamps + 1):((i + 1) * nsamps), 1] <- rnorm(nsamps, mu_b + tmp_b %*% t(a), Sigma_b) samps[(i*nsamps + 1):((i + 1) * nsamps), 2] <- paste0(# "cond", round(a, digits = 2), collapse = "-") } colnames(samps) <- c("x", "dist") ggplot(samps, aes(x = x)) + geom_density() + facet_wrap(~ dist) # c and d samps <- as.data.frame(matrix(data = NA, nrow = 4 * nsamps, ncol = 3)) samps[1:nsamps,1:2] <- rmvnorm(nsamps, mu_A, Sigma_A) samps[1:nsamps,3] <- "marginal" for (i in 1:3) { b <- rnorm(1, mu_B, Sigma_B) samps[(i*nsamps + 1):((i + 1) * nsamps), 1:2] <- rmvnorm(nsamps, mu_a + tmp_a * b, Sigma_d) samps[(i*nsamps + 1):((i + 1) * nsamps), 3] <- b } colnames(samps) <- c("x", "y", "dist") ggplot(samps, aes(x = x, y = y)) + geom_point() + geom_smooth(method = "lm") + facet_wrap(~ dist) 8.3 Transformations Exercise 8.6 Let \\((U,V)\\) be a random variable with PDF \\(p(u,v) = \\frac{1}{4 \\sqrt{u}}\\), \\(U \\in [0,4]\\) and \\(V \\in [\\sqrt{U}, \\sqrt{U} + 1]\\). Let \\(X = \\sqrt{U}\\) and \\(Y = V - \\sqrt{U}\\). Find PDF of \\((X,Y)\\). What can you tell about distributions of \\(X\\) and \\(Y\\)? This exercise shows how we can simplify a probabilistic problem with a clever use of transformations. R: Take 1000 samples from \\((X,Y)\\) and transform them with inverses of the above functions to get samples from \\((U,V)\\). Plot both sets of samples. Solution. First we need to find the inverse functions. Since \\(x = \\sqrt{u}\\) it follows that \\(u = x^2\\), and that \\(x \\in [0,2]\\). Similarly \\(v = y + x\\) and \\(y \\in [0,1]\\). Let us first find the Jacobian. \\[\\renewcommand\\arraystretch{1.6} J(x,y) = \\begin{bmatrix} \\frac{\\partial u}{\\partial x} & \\frac{\\partial v}{\\partial x} \\\\%[1ex] % <-- 1ex more space between rows of matrix \\frac{\\partial u}{\\partial y} & \\frac{\\partial v}{\\partial y} \\end{bmatrix} = \\begin{bmatrix} 2x & 1 \\\\%[1ex] % <-- 1ex more space between rows of matrix 0 & 1 \\end{bmatrix}, \\] and the determinant is \\(|J(x,y)| = 2x\\). Putting everything together, we get \\[\\begin{align} p_{X,Y}(x,y) = p_{U,V}(x^2, y + x) |J(x,y)| = \\frac{1}{4 \\sqrt{x^2}} 2x = \\frac{1}{2}. \\end{align}\\] This reminds us of the Uniform distribution. Indeed we can see that \\(p_X(x) = \\frac{1}{2}\\) and \\(p_Y(y) = 1\\). So instead of dealing with an awkward PDF of \\((U,V)\\) and the corresponding dynamic bounds, we are now looking at two independent Uniform random variables. In practice, this could make modeling much easier. set.seed(1) nsamps <- 2000 x <- runif(nsamps, min = 0, max = 2) y <- runif(nsamps) orig <- tibble(x = x, y = y, vrs = "original") u <- x^2 v <- y + x transf <- tibble(x = u, y = v, vrs = "transformed") df <- bind_rows(orig, transf) ggplot(df, aes(x = x, y = y, color = vrs)) + geom_point(alpha = 0.3) Exercise 8.7 R: Write a function that will calculate the probability density of an arbitraty multivariate normal distribution, based on independent standardized normal PDFs. Compare with dmvnorm from the mvtnorm package. library(mvtnorm) set.seed(1) mvn_dens <- function (y, mu, Sigma) { L <- chol(Sigma) L_inv <- solve(t(L)) g_inv <- L_inv %*% t(y - mu) J <- L_inv J_det <- det(J) return(prod(dnorm(g_inv)) * J_det) } mu_v <- c(-2, 0, 1) cov_m <- matrix(c(1, -0.2, 0.5, -0.2, 2, 0.3, 0.5, 0.3, 1.6), ncol = 3, byrow = TRUE) n_comp <- 20 for (i in 1:n_comp) { x <- rmvnorm(1, mean = mu_v, sigma = cov_m) print(paste0("My function: ", mvn_dens(x, mu_v, cov_m), ", dmvnorm: ", dmvnorm(x, mu_v, cov_m))) } ## [1] "My function: 0.0229514237156383, dmvnorm: 0.0229514237156383" ## [1] "My function: 0.00763138915406231, dmvnorm: 0.00763138915406231" ## [1] "My function: 0.0230688881105741, dmvnorm: 0.0230688881105741" ## [1] "My function: 0.0113616213114731, dmvnorm: 0.0113616213114731" ## [1] "My function: 0.00151808500121907, dmvnorm: 0.00151808500121907" ## [1] "My function: 0.0257658045974509, dmvnorm: 0.0257658045974509" ## [1] "My function: 0.0157963825730805, dmvnorm: 0.0157963825730805" ## [1] "My function: 0.00408856287529248, dmvnorm: 0.00408856287529248" ## [1] "My function: 0.0327793540101256, dmvnorm: 0.0327793540101256" ## [1] "My function: 0.0111606542967978, dmvnorm: 0.0111606542967978" ## [1] "My function: 0.0147636757585684, dmvnorm: 0.0147636757585684" ## [1] "My function: 0.0142948300412207, dmvnorm: 0.0142948300412207" ## [1] "My function: 0.0203093820657542, dmvnorm: 0.0203093820657542" ## [1] "My function: 0.0287533273357481, dmvnorm: 0.0287533273357481" ## [1] "My function: 0.0213402305128623, dmvnorm: 0.0213402305128623" ## [1] "My function: 0.0218356957993885, dmvnorm: 0.0218356957993885" ## [1] "My function: 0.0250750113961771, dmvnorm: 0.0250750113961771" ## [1] "My function: 0.0166498666348048, dmvnorm: 0.0166498666348048" ## [1] "My function: 0.00189725106874659, dmvnorm: 0.00189725106874659" ## [1] "My function: 0.0196697814975113, dmvnorm: 0.0196697814975113" "],["ard.html", "Chapter 9 Alternative representation of distributions 9.1 Probability generating functions (PGFs) 9.2 Moment generating functions (MGFs)", " Chapter 9 Alternative representation of distributions This chapter deals with alternative representation of distributions. The students are expected to acquire the following knowledge: Theoretical Probability generating functions. Moment generating functions. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 9.1 Probability generating functions (PGFs) Exercise 9.1 Show that the sum of independent Poisson random variables is itself a Poisson random variable. R: Let \\(X\\) be a sum of three Poisson distributions with \\(\\lambda_i \\in \\{2, 5.2, 10\\}\\). Take 1000 samples and plot the three distributions and the sum. Then take 1000 samples from the theoretical distribution of \\(X\\) and compare them to the sum. Solution. Let \\(X_i \\sim \\text{Poisson}(\\lambda_i)\\) for \\(i = 1,...,n\\), and let \\(X = \\sum_{i=1}^n X_i\\). \\[\\begin{align} \\alpha_X(t) &= \\prod_{i=1}^n \\alpha_{X_i}(t) \\\\ &= \\prod_{i=1}^n \\bigg( \\sum_{j=0}^\\infty t^j \\frac{\\lambda_i^j e^{-\\lambda_i}}{j!} \\bigg) \\\\ &= \\prod_{i=1}^n \\bigg( e^{-\\lambda_i} \\sum_{j=0}^\\infty \\frac{(t\\lambda_i)^j }{j!} \\bigg) \\\\ &= \\prod_{i=1}^n \\bigg( e^{-\\lambda_i} e^{t \\lambda_i} \\bigg) & \\text{power series} \\\\ &= \\prod_{i=1}^n \\bigg( e^{\\lambda_i(t - 1)} \\bigg) \\\\ &= e^{\\sum_{i=1}^n \\lambda_i(t - 1)} \\\\ &= e^{t \\sum_{i=1}^n \\lambda_i - \\sum_{i=1}^n \\lambda_i} \\\\ &= e^{-\\sum_{i=1}^n \\lambda_i} \\sum_{j=0}^\\infty \\frac{(t \\sum_{i=1}^n \\lambda_i)^j}{j!}\\\\ &= \\sum_{j=0}^\\infty \\frac{e^{-\\sum_{i=1}^n \\lambda_i} (t \\sum_{i=1}^n \\lambda_i)^j}{j!}\\\\ \\end{align}\\] The last term is the PGF of a Poisson random variable with parameter \\(\\sum_{i=1}^n \\lambda_i\\). Because the PGF is unique, \\(X\\) is a Poisson random variable. set.seed(1) library(tidyr) nsamps <- 1000 samps <- matrix(data = NA, nrow = nsamps, ncol = 4) samps[ ,1] <- rpois(nsamps, 2) samps[ ,2] <- rpois(nsamps, 5.2) samps[ ,3] <- rpois(nsamps, 10) samps[ ,4] <- samps[ ,1] + samps[ ,2] + samps[ ,3] colnames(samps) <- c(2, 2.5, 10, "sum") gsamps <- as_tibble(samps) gsamps <- gather(gsamps, key = "dist", value = "value") ggplot(gsamps, aes(x = value)) + geom_bar() + facet_wrap(~ dist) samps <- cbind(samps, "theoretical" = rpois(nsamps, 2 + 5.2 + 10)) gsamps <- as_tibble(samps[ ,4:5]) gsamps <- gather(gsamps, key = "dist", value = "value") ggplot(gsamps, aes(x = value, fill = dist)) + geom_bar(position = "dodge") Exercise 9.2 Find the expected value and variance of the negative binomial distribution. Hint: Find the Taylor series of \\((1 - y)^{-r}\\) at point 0. Solution. Let \\(X \\sim \\text{NB}(r, p)\\). \\[\\begin{align} \\alpha_X(t) &= E[t^X] \\\\ &= \\sum_{j=0}^\\infty t^j \\binom{j + r - 1}{j} (1 - p)^r p^j \\\\ &= (1 - p)^r \\sum_{j=0}^\\infty \\binom{j + r - 1}{j} (tp)^j \\\\ &= (1 - p)^r \\sum_{j=0}^\\infty \\frac{(j + r - 1)(j + r - 2)...r}{j!} (tp)^j. \\\\ \\end{align}\\] Let us look at the Taylor series of \\((1 - y)^{-r}\\) at 0 \\[\\begin{align} (1 - y)^{-r} = &1 + \\frac{-r(-1)}{1!}y + \\frac{-r(-r - 1)(-1)^2}{2!}y^2 + \\\\ &\\frac{-r(-r - 1)(-r - 2)(-1)^3}{3!}y^3 + ... \\\\ \\end{align}\\] How does the \\(k\\)-th term look like? We have \\(k\\) derivatives of our function so \\[\\begin{align} \\frac{d^k}{d^k y} (1 - y)^{-r} &= \\frac{-r(-r - 1)...(-r - k + 1)(-1)^k}{k!}y^k \\\\ &= \\frac{r(r + 1)...(r + k - 1)}{k!}y^k. \\end{align}\\] We observe that this equals to the \\(j\\)-th term in the sum of NB PGF. Therefore \\[\\begin{align} \\alpha_X(t) &= (1 - p)^r (1 - tp)^{-r} \\\\ &= \\Big(\\frac{1 - p}{1 - tp}\\Big)^r \\end{align}\\] To find the expected value, we need to differentiate \\[\\begin{align} \\frac{d}{dt} \\Big(\\frac{1 - p}{1 - tp}\\Big)^r &= r \\Big(\\frac{1 - p}{1 - tp}\\Big)^{r-1} \\frac{d}{dt} \\frac{1 - p}{1 - tp} \\\\ &= r \\Big(\\frac{1 - p}{1 - tp}\\Big)^{r-1} \\frac{p(1 - p)}{(1 - tp)^2}. \\\\ \\end{align}\\] Evaluating this at 1, we get: \\[\\begin{align} E[X] = \\frac{rp}{1 - p}. \\end{align}\\] For the variance we need the second derivative. \\[\\begin{align} \\frac{d^2}{d^2t} \\Big(\\frac{1 - p}{1 - tp}\\Big)^r &= \\frac{p^2 r (r + 1) (\\frac{1 - p}{1 - tp})^r}{(tp - 1)^2} \\end{align}\\] Evaluating this at 1 and inserting the first derivatives, we get: \\[\\begin{align} Var[X] &= \\frac{d^2}{dt^2} \\alpha_X(1) + \\frac{d}{dt}\\alpha_X(1) - \\Big(\\frac{d}{dt}\\alpha_X(t) \\Big)^2 \\\\ &= \\frac{p^2 r (r + 1)}{(1 - p)^2} + \\frac{rp}{1 - p} - \\frac{r^2p^2}{(1 - p)^2} \\\\ &= \\frac{rp}{(1 - p)^2}. \\end{align}\\] library(tidyr) set.seed(1) nsamps <- 100000 find_p <- function (mu, r) { return (10 / (r + 10)) } r <- c(1,2,10,20) p <- find_p(10, r) sigma <- rep(sqrt(p*r / (1 - p)^2), each = nsamps) samps <- cbind("r=1" = rnbinom(nsamps, size = r[1], prob = 1 - p[1]), "r=2" = rnbinom(nsamps, size = r[2], prob = 1 - p[2]), "r=4" = rnbinom(nsamps, size = r[3], prob = 1 - p[3]), "r=20" = rnbinom(nsamps, size = r[4], prob = 1 - p[4])) gsamps <- gather(as.data.frame(samps)) iw <- (gsamps$value > sigma + 10) | (gsamps$value < sigma - 10) ggplot(gsamps, aes(x = value, fill = iw)) + geom_bar() + # geom_density() + facet_wrap(~ key) 9.2 Moment generating functions (MGFs) Exercise 9.3 Find the variance of the geometric distribution. Solution. Let \\(X \\sim \\text{Geometric}(p)\\). The MGF of the geometric distribution is \\[\\begin{align} M_X(t) &= E[e^{tX}] \\\\ &= \\sum_{k=0}^\\infty p(1 - p)^k e^{tk} \\\\ &= p \\sum_{k=0}^\\infty ((1 - p)e^t)^k. \\end{align}\\] Let us assume that \\((1 - p)e^t < 1\\). Then, by using the geometric series we get \\[\\begin{align} M_X(t) &= \\frac{p}{1 - e^t + pe^t}. \\end{align}\\] The first derivative of the above expression is \\[\\begin{align} \\frac{d}{dt}M_X(t) &= \\frac{-p(-e^t + pe^t)}{(1 - e^t + pe^t)^2}, \\end{align}\\] and evaluating at \\(t = 0\\), we get \\(\\frac{1 - p}{p}\\), which we already recognize as the expected value of the geometric distribution. The second derivative is \\[\\begin{align} \\frac{d^2}{dt^2}M_X(t) &= \\frac{(p-1)pe^t((p-1)e^t - 1)}{((p - 1)e^t + 1)^3}, \\end{align}\\] and evaluating at \\(t = 0\\), we get \\(\\frac{(p - 1)(p - 2)}{p^2}\\). Combining we get the variance \\[\\begin{align} Var(X) &= \\frac{(p - 1)(p - 2)}{p^2} - \\frac{(1 - p)^2}{p^2} \\\\ &= \\frac{(p-1)(p-2) - (1-p)^2}{p^2} \\\\ &= \\frac{1 - p}{p^2}. \\end{align}\\] Exercise 9.4 Find the distribution of sum of two normal random variables \\(X\\) and \\(Y\\), by comparing \\(M_{X+Y}(t)\\) to \\(M_X(t)\\). R: To illustrate the result draw random samples from N\\((-3, 1)\\) and N\\((5, 1.2)\\) and calculate the empirical mean and variance of \\(X+Y\\). Plot all three histograms in one plot. Solution. Let \\(X \\sim \\text{N}(\\mu_X, 1)\\) and \\(Y \\sim \\text{N}(\\mu_Y, 1)\\). The MGF of the sum is \\[\\begin{align} M_{X+Y}(t) &= M_X(t) M_Y(t). \\end{align}\\] Let us calculate \\(M_X(t)\\), the MGF for \\(Y\\) then follows analogously. \\[\\begin{align} M_X(t) &= \\int_{-\\infty}^\\infty e^{tx} \\frac{1}{\\sqrt{2 \\pi \\sigma_X^2}} e^{-\\frac{(x - mu_X)^2}{2\\sigma_X^2}} dx \\\\ &= \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt{2 \\pi \\sigma_X^2}} e^{-\\frac{(x - mu_X)^2 - 2\\sigma_X tx}{2\\sigma_X^2}} dx \\\\ &= \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt{2 \\pi \\sigma_X^2}} e^{-\\frac{x^2 - 2\\mu_X x + \\mu_X^2 - 2\\sigma_X tx}{2\\sigma_X^2}} dx \\\\ &= \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt{2 \\pi \\sigma_X^2}} e^{-\\frac{(x - (\\mu_X + \\sigma_X^2 t))^2 + \\mu_X^2 - (\\mu_X + \\sigma_X^2 t)^2}{2\\sigma_X^2}} dx & \\text{complete the square}\\\\ &= e^{-\\frac{\\mu_X^2 - (\\mu_X + \\sigma_X^2 t)^2}{2\\sigma_X^2}} \\int_{-\\infty}^\\infty \\frac{1}{\\sqrt{2 \\pi \\sigma_X^2}} e^{-\\frac{(x - (\\mu_X + \\sigma_X^2 t))^2}{2\\sigma_X^2}} dx & \\\\ &= e^{-\\frac{\\mu_X^2 - (\\mu_X + \\sigma_X^2 t)^2}{2\\sigma_X^2}} & \\text{normal PDF} \\\\ &= e^{-\\frac{\\mu_X^2 - \\mu_X^2 - \\mu_X \\sigma_X^2 t - 2 \\sigma_X^4 t^2}{2\\sigma_X^2}} \\\\ &= e^{\\sigma_X^2 t^2 + \\frac{\\mu_X t}{2}}. \\\\ \\end{align}\\] The MGF of the sum is then \\[\\begin{align} M_{X+Y}(t) &= e^{\\sigma_X^2 t^2 + 0.5\\mu_X t} e^{\\sigma_Y^2 t^2 + 0.5\\mu_Y t} \\\\ &= e^{t^2(\\sigma_X^2 + \\sigma_Y^2) + 0.5 t(\\mu_X + \\mu_Y)}. \\end{align}\\] By comparing \\(M_{X+Y}(t)\\) and \\(M_X(t)\\) we observe that both have two terms. The first is \\(2t^2\\) multiplied by the variance, and the second is \\(2t\\) multiplied by the mean. Since MGFs are unique, we conclude that \\(Z = X + Y \\sim \\text{N}(\\mu_X + \\mu_Y, \\sigma_X^2 + \\sigma_Y^2)\\). library(tidyr) library(ggplot2) set.seed(1) nsamps <- 1000 x <- rnorm(nsamps, -3, 1) y <- rnorm(nsamps, 5, 1.2) z <- x + y mean(z) ## [1] 1.968838 var(z) ## [1] 2.645034 df <- data.frame(x = x, y = y, z = z) %>% gather() ggplot(df, aes(x = value, fill = key)) + geom_histogram(position = "dodge") "],["ci.html", "Chapter 10 Concentration inequalities 10.1 Comparison 10.2 Practical", " Chapter 10 Concentration inequalities This chapter deals with concentration inequalities. The students are expected to acquire the following knowledge: Theoretical More assumptions produce closer bounds. R Optimization. Estimating probability inequalities. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 10.1 Comparison Exercise 10.1 R: Let \\(X\\) be geometric random variable with \\(p = 0.7\\). Visually compare the Markov bound, Chernoff bound, and the theoretical probabilities for \\(x = 1,...,12\\). To get the best fitting Chernoff bound, you will need to optimize the bound depending on \\(t\\). Use either analytical or numerical optimization. bound_chernoff <- function (t, p, a) { return ((p / (1 - exp(t) + p * exp(t))) / exp(a * t)) } set.seed(1) p <- 0.7 a <- seq(1, 12, by = 1) ci_markov <- (1 - p) / p / a t <- vector(mode = "numeric", length = length(a)) for (i in 1:length(t)) { t[i] <- optimize(bound_chernoff, interval = c(0, log(1 / (1 - p))), p = p, a = a[i])$minimum } t ## [1] 0.5108267 0.7984981 0.9162927 0.9808238 1.0216635 1.0498233 1.0704327 ## [8] 1.0861944 1.0986159 1.1086800 1.1169653 1.1239426 ci_chernoff <- (p / (1 - exp(t) + p * exp(t))) / exp(a * t) actual <- 1 - pgeom(a, 0.7) plot_df <- rbind( data.frame(x = a, y = ci_markov, type = "Markov"), data.frame(x = a, y = ci_chernoff, type = "Chernoff"), data.frame(x = a, y = actual, type = "Actual") ) ggplot(plot_df, aes(x = x, y = y, color = type)) + geom_line() Exercise 10.2 R: Let \\(X\\) be a sum of 100 Beta distributions with random parameters. Take 1000 samples and plot the Chebyshev bound, Hoeffding bound, and the empirical probabilities. set.seed(1) nvars <- 100 nsamps <- 1000 samps <- matrix(data = NA, nrow = nsamps, ncol = nvars) Sn_mean <- 0 Sn_var <- 0 for (i in 1:nvars) { alpha1 <- rgamma(1, 10, 1) beta1 <- rgamma(1, 10, 1) X <- rbeta(nsamps, alpha1, beta1) Sn_mean <- Sn_mean + alpha1 / (alpha1 + beta1) Sn_var <- Sn_var + alpha1 * beta1 / ((alpha1 + beta1)^2 * (alpha1 + beta1 + 1)) samps[ ,i] <- X } mean(apply(samps, 1, sum)) ## [1] 51.12511 Sn_mean ## [1] 51.15723 var(apply(samps, 1, sum)) ## [1] 1.170652 Sn_var ## [1] 1.166183 a <- 1:30 b <- a / sqrt(Sn_var) ci_chebyshev <- 1 / b^2 ci_hoeffding <- 2 * exp(- 2 * a^2 / nvars) empirical <- NULL for (i in 1:length(a)) { empirical[i] <- sum(abs((apply(samps, 1, sum)) - Sn_mean) >= a[i])/ nsamps } plot_df <- rbind( data.frame(x = a, y = ci_chebyshev, type = "Chebyshev"), data.frame(x = a, y = ci_hoeffding, type = "Hoeffding"), data.frame(x = a, y = empirical, type = "Empirical") ) ggplot(plot_df, aes(x = x, y = y, color = type)) + geom_line() ggplot(plot_df, aes(x = x, y = y, color = type)) + geom_line() + coord_cartesian(xlim = c(15, 25), ylim = c(0, 0.05)) 10.2 Practical Exercise 10.3 From Jagannathan. Let \\(X_i\\), \\(i = 1,...n\\), be a random sample of size \\(n\\) of a random variable \\(X\\). Let \\(X\\) have mean \\(\\mu\\) and variance \\(\\sigma^2\\). Find the size of the sample \\(n\\) required so that the probability that the difference between sample mean and true mean is smaller than \\(\\frac{\\sigma}{10}\\) is at least 0.95. Hint: Derive a version of the Chebyshev inequality for \\(P(|X - \\mu| \\geq a)\\) using Markov inequality. Solution. Let \\(\\bar{X} = \\sum_{i=1}^n X_i\\). Then \\(E[\\bar{X}] = \\mu\\) and \\(Var[\\bar{X}] = \\frac{\\sigma^2}{n}\\). Let us first derive another representation of Chebyshev inequality. \\[\\begin{align} P(|X - \\mu| \\geq a) = P(|X - \\mu|^2 \\geq a^2) \\leq \\frac{E[|X - \\mu|^2]}{a^2} = \\frac{Var[X]}{a^2}. \\end{align}\\] Let us use that on our sampling distribution: \\[\\begin{align} P(|\\bar{X} - \\mu| \\geq \\frac{\\sigma}{10}) \\leq \\frac{100 Var[\\bar{X}]}{\\sigma^2} = \\frac{100 Var[X]}{n \\sigma^2} = \\frac{100}{n}. \\end{align}\\] We are interested in the difference being smaller, therefore \\[\\begin{align} P(|\\bar{X} - \\mu| < \\frac{\\sigma}{10}) = 1 - P(|\\bar{X} - \\mu| \\geq \\frac{\\sigma}{10}) \\geq 1 - \\frac{100}{n} \\geq 0.95. \\end{align}\\] It follows that we need a sample size of \\(n \\geq \\frac{100}{0.05} = 2000\\). "],["crv.html", "Chapter 11 Convergence of random variables", " Chapter 11 Convergence of random variables This chapter deals with convergence of random variables. The students are expected to acquire the following knowledge: Theoretical Finding convergences of random variables. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } Exercise 11.1 Let \\(X_1\\), \\(X_2\\),…, \\(X_n\\) be a sequence of Bernoulli random variables. Let \\(Y_n = \\frac{X_1 + X_2 + ... + X_n}{n^2}\\). Show that this sequence converges point-wise to the zero random variable. R: Use a simulation to check your answer. Solution. Let \\(\\epsilon\\) be arbitrary. We need to find such \\(n_0\\), that for every \\(n\\) greater than \\(n_0\\) \\(|Y_n| < \\epsilon\\) holds. \\[\\begin{align} |Y_n| &= |\\frac{X_1 + X_2 + ... + X_n}{n^2}| \\\\ &\\leq |\\frac{n}{n^2}| \\\\ &= \\frac{1}{n}. \\end{align}\\] So we need to find such \\(n_0\\), that for every \\(n > n_0\\) we will have \\(\\frac{1}{n} < \\epsilon\\). So \\(n_0 > \\frac{1}{\\epsilon}\\). x <- 1:1000 X <- matrix(data = NA, nrow = length(x), ncol = 100) y <- vector(mode = "numeric", length = length(x)) for (i in 1:length(x)) { X[i, ] <- rbinom(100, size = 1, prob = 0.5) } X <- apply(X, 2, cumsum) tmp_mat <- matrix(data = (1:1000)^2, nrow = 1000, ncol = 100) X <- X / tmp_mat y <- apply(X, 1, mean) ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) + geom_line() Exercise 11.2 Let \\(\\Omega = [0,1]\\) and let \\(X_n\\) be a sequence of random variables, defined as \\[\\begin{align} X_n(\\omega) = \\begin{cases} \\omega^3, &\\omega = \\frac{i}{n}, &0 \\leq i \\leq 1 \\\\ 1, & \\text{otherwise.} \\end{cases} \\end{align}\\] Show that \\(X_n\\) converges almost surely to \\(X \\sim \\text{Uniform}(0,1)\\). Solution. We need to show \\(P(\\{\\omega: X_n(\\omega) \\rightarrow X(\\omega)\\}) = 1\\). Let \\(\\omega \\neq \\frac{i}{n}\\). Then for any \\(\\omega\\), \\(X_n\\) converges pointwise to \\(X\\): \\[\\begin{align} X_n(\\omega) = 1 \\implies |X_n(\\omega) - X(s)| = |1 - 1| < \\epsilon. \\end{align}\\] The above is independent of \\(n\\). Since there are countably infinite number of elements in the complement (\\(\\frac{i}{n}\\)), the probability of this set is 1. Exercise 11.3 Borrowed from Wasserman. Let \\(X_n \\sim \\text{N}(0, \\frac{1}{n})\\) and let \\(X\\) be a random variable with CDF \\[\\begin{align} F_X(x) = \\begin{cases} 0, &x < 0 \\\\ 1, &x \\geq 0. \\end{cases} \\end{align}\\] Does \\(X_n\\) converge to \\(X\\) in distribution? How about in probability? Prove or disprove these statement. R: Plot the CDF of \\(X_n\\) for \\(n = 1, 2, 5, 10, 100, 1000\\). Solution. Let us first check convergence in distribution. \\[\\begin{align} \\lim_{n \\rightarrow \\infty} F_{X_n}(x) &= \\lim_{n \\rightarrow \\infty} \\phi (\\sqrt(n) x). \\end{align}\\] We have two cases, for \\(x < 0\\) and \\(x > 0\\). We do not need to check for \\(x = 0\\), since \\(F_X\\) is not continuous in that point. \\[\\begin{align} \\lim_{n \\rightarrow \\infty} \\phi (\\sqrt(n) x) = \\begin{cases} 0, & x < 0 \\\\ 1, & x > 0. \\end{cases} \\end{align}\\] This is the same as \\(F_X\\). Let us now check convergence in probability. Since \\(X\\) is a point-mass distribution at zero, we have \\[\\begin{align} \\lim_{n \\rightarrow \\infty} P(|X_n| > \\epsilon) &= \\lim_{n \\rightarrow \\infty} (P(X_n > \\epsilon) + P(X_n < -\\epsilon)) \\\\ &= \\lim_{n \\rightarrow \\infty} (1 - P(X_n < \\epsilon) + P(X_n < -\\epsilon)) \\\\ &= \\lim_{n \\rightarrow \\infty} (1 - \\phi(\\sqrt{n} \\epsilon) + \\phi(- \\sqrt{n} \\epsilon)) \\\\ &= 0. \\end{align}\\] n <- c(1,2,5,10,100,1000) ggplot(data = data.frame(x = seq(-5, 5, by = 0.01)), aes(x = x)) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/1), aes(color = "sd = 1/1")) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/2), aes(color = "sd = 1/2")) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/5), aes(color = "sd = 1/5")) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/10), aes(color = "sd = 1/10")) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/100), aes(color = "sd = 1/100")) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/1000), aes(color = "sd = 1/1000")) + stat_function(fun = pnorm, args = list(mean = 0, sd = 1/10000), aes(color = "sd = 1/10000")) Exercise 11.4 Let \\(X_i\\) be i.i.d. and \\(\\mu = E(X_1)\\). Let variance of \\(X_1\\) be finite. Show that the mean of \\(X_i\\), \\(\\bar{X}_n = \\frac{1}{n}\\sum_{i=1}^n X_i\\) converges in quadratic mean to \\(\\mu\\). Solution. \\[\\begin{align} \\lim_{n \\rightarrow \\infty} E(|\\bar{X_n} - \\mu|^2) &= \\lim_{n \\rightarrow \\infty} E(\\bar{X_n}^2 - 2 \\bar{X_n} \\mu + \\mu^2) \\\\ &= \\lim_{n \\rightarrow \\infty} (E(\\bar{X_n}^2) - 2 \\mu E(\\frac{\\sum_{i=1}^n X_i}{n}) + \\mu^2) \\\\ &= \\lim_{n \\rightarrow \\infty} E(\\bar{X_n})^2 + \\lim_{n \\rightarrow \\infty} Var(\\bar{X_n}) - 2 \\mu^2 + \\mu^2 \\\\ &= \\lim_{n \\rightarrow \\infty} \\frac{n^2 \\mu^2}{n^2} + \\lim_{n \\rightarrow \\infty} \\frac{\\sigma^2}{n} - \\mu^2 \\\\ &= \\mu^2 - \\mu^2 + \\lim_{n \\rightarrow \\infty} \\frac{\\sigma^2}{n} \\\\ &= 0. \\end{align}\\] "],["lt.html", "Chapter 12 Limit theorems", " Chapter 12 Limit theorems This chapter deals with limit theorems. The students are expected to acquire the following knowledge: Theoretical Monte Carlo integration convergence. Difference between weak and strong law of large numbers. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } Exercise 12.1 Show that Monte Carlo integration converges almost surely to the true integral of a bounded function. Solution. Let \\(g\\) be a function defined on \\(\\Omega\\). Let \\(X_i\\), \\(i = 1,...,n\\) be i.i.d. (multivariate) uniform random variables with bounds defined on \\(\\Omega\\). Let \\(Y_i\\) = \\(g(X_i)\\). Then it follows that \\(Y_i\\) are also i.i.d. random variables and their expected value is \\(E[g(X)] = \\int_{\\Omega} g(x) f_X(x) dx = \\frac{1}{V_{\\Omega}} \\int_{\\Omega} g(x) dx\\). By the strong law of large numbers, we have \\[\\begin{equation} \\frac{1}{n}\\sum_{i=1}^n Y_i \\xrightarrow{\\text{a.s.}} E[g(X)]. \\end{equation}\\] It follows that \\[\\begin{equation} V_{\\Omega} \\frac{1}{n}\\sum_{i=1}^n Y_i \\xrightarrow{\\text{a.s.}} \\int_{\\Omega} g(x) dx. \\end{equation}\\] Exercise 12.2 Let \\(X\\) be a geometric random variable with probability 0.5 and support in positive integers. Let \\(Y = 2^X (-1)^X X^{-1}\\). Find the expected value of \\(Y\\) by using conditional convergence (this variable does not have an expected value in the conventional sense – the series is not absolutely convergent). R: Draw \\(10000\\) samples from a geometric distribution with probability 0.5 and support in positive integers to get \\(X\\). Then calculate \\(Y\\) and plot the means at each iteration (sample). Additionally, plot the expected value calculated in a. Try it with different seeds. What do you notice? Solution. \\[\\begin{align*} E[Y] &= \\sum_{x=1}^{\\infty} \\frac{2^x (-1)^x}{x} 0.5^x \\\\ &= \\sum_{x=1}^{\\infty} \\frac{(-1)^x}{x} \\\\ &= - \\sum_{x=1}^{\\infty} \\frac{(-1)^{x+1}}{x} \\\\ &= - \\ln(2) \\end{align*}\\] set.seed(3) x <- rgeom(100000, prob = 0.5) + 1 y <- 2^x * (-1)^x * x^{-1} y_means <- cumsum(y) / seq_along(y) df <- data.frame(x = 1:length(y_means), y = y_means) ggplot(data = df, aes(x = x, y = y)) + geom_line() + geom_hline(yintercept = -log(2)) "],["eb.html", "Chapter 13 Estimation basics 13.1 ECDF 13.2 Properties of estimators", " Chapter 13 Estimation basics This chapter deals with estimation basics. The students are expected to acquire the following knowledge: Biased and unbiased estimators. Consistent estimators. Empirical cumulative distribution function. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 13.1 ECDF Exercise 13.1 (ECDF intuition) Take any univariate continuous distribution that is readily available in R and plot its CDF (\\(F\\)). Draw one sample (\\(n = 1\\)) from the chosen distribution and draw the ECDF (\\(F_n\\)) of that one sample. Use the definition of the ECDF, not an existing function in R. Implementation hint: ECDFs are always piecewise constant - they only jump at the sampled values and by \\(1/n\\). Repeat (b) for \\(n = 5, 10, 100, 1000...\\) Theory says that \\(F_n\\) should converge to \\(F\\). Can you observe that? For \\(n = 100\\) repeat the process \\(m = 20\\) times and plot every \\(F_n^{(m)}\\). Theory says that \\(F_n\\) will converge to \\(F\\) the slowest where \\(F\\) is close to 0.5 (where the variance is largest). Can you observe that? library(ggplot2) set.seed(1) ggplot(data = data.frame(x = seq(-5, 5, by = 0.01))) + # stat_function(aes(x = x), fun = pbeta, args = list(shape1 = 1, shape2 = 2)) stat_function(aes(x = x), fun = pnorm, args = list(mean = 0, sd = 1)) one_samp <- rnorm(1) X <- data.frame(x = c(-5, one_samp, 5), y = c(0,1,1)) ggplot(data = data.frame(x = seq(-5, 5, by = 0.01))) + # stat_function(aes(x = x), fun = pbeta, args = list(shape1 = 1, shape2 = 2)) stat_function(aes(x = x), fun = pnorm, args = list(mean = 0, sd = 1)) + geom_step(data = X, aes(x = x, y = y)) N <- c(5, 10, 100, 1000) X <- NULL for (n in N) { tmp <- rnorm(n) tmp_X <- data.frame(x = c(-5, sort(tmp), 5), y = c(0, seq(1/n, 1, by = 1/n), 1), n = n) X <- rbind(X, tmp_X) } ggplot(data = data.frame(x = seq(-5, 5, by = 0.01))) + # stat_function(aes(x = x), fun = pbeta, args = list(shape1 = 1, shape2 = 2)) stat_function(aes(x = x), fun = pnorm, args = list(mean = 0, sd = 1)) + geom_step(data = X, aes(x = x, y = y, color = as.factor(n))) + labs(color = "N") 13.2 Properties of estimators Exercise 13.2 Show that the sample average is, as an estimator of the mean: unbiased, consistent, asymptotically normal. Solution. \\[\\begin{align*} E[\\frac{1}{n} \\sum_{i=1}^n X_i] &= \\frac{1}{n} \\sum_{i=i}^n E[X_i] \\\\ &= E[X]. \\end{align*}\\] \\[\\begin{align*} \\lim_{n \\rightarrow \\infty} P(|\\frac{1}{n} \\sum_{i=1}^n X_i - E[X]| > \\epsilon) &= \\lim_{n \\rightarrow \\infty} P((\\frac{1}{n} \\sum_{i=1}^n X_i - E[X])^2 > \\epsilon^2) \\\\ & \\leq \\lim_{n \\rightarrow \\infty} \\frac{E[(\\frac{1}{n} \\sum_{i=1}^n X_i - E[X])^2]}{\\epsilon^2} & \\text{Markov inequality} \\\\ & = \\lim_{n \\rightarrow \\infty} \\frac{E[(\\frac{1}{n} \\sum_{i=1}^n X_i)^2 - 2 \\frac{1}{n} \\sum_{i=1}^n X_i E[X] + E[X]^2]}{\\epsilon^2} \\\\ & = \\lim_{n \\rightarrow \\infty} \\frac{E[(\\frac{1}{n} \\sum_{i=1}^n X_i)^2] - 2 E[X]^2 + E[X]^2}{\\epsilon^2} \\\\ &= 0 \\end{align*}\\] For the last equality see the solution to ??. Follows directly from the CLT. Exercise 13.3 (Consistent but biased estimator) Show that sample variance (the plug-in estimator of variance) is a biased estimator of variance. Show that sample variance is a consistent estimator of variance. Show that the estimator with (\\(N-1\\)) (Bessel correction) is unbiased. Solution. \\[\\begin{align*} E[\\frac{1}{n} \\sum_{i=1}^n (Y_i - \\bar{Y})^2] &= \\frac{1}{n} \\sum_{i=1}^n E[(Y_i - \\bar{Y})^2] \\\\ &= \\frac{1}{n} \\sum_{i=1}^n E[Y_i^2] - 2 E[Y_i \\bar{Y}] + \\bar{Y}^2)] \\\\ &= \\frac{1}{n} \\sum_{i=1}^n E[Y_i^2 - 2 Y_i \\bar{Y} + \\bar{Y}^2] \\\\ &= \\frac{1}{n} \\sum_{i=1}^n E[Y_i^2 - \\frac{2}{n} Y_i^2 - \\frac{2}{n} \\sum_{i \\neq j} Y_i Y_j + \\frac{1}{n^2}\\sum_j \\sum_{k \\neq j} Y_j Y_k + \\frac{1}{n^2} \\sum_j Y_j^2] \\\\ &= \\frac{1}{n} \\sum_{i=1}^n \\frac{n - 2}{n} (\\sigma^2 + \\mu^2) - \\frac{2}{n} (n - 1) \\mu^2 + \\frac{1}{n^2}n(n-1)\\mu^2 + \\frac{1}{n^2}n(\\sigma^2 + \\mu^2) \\\\ &= \\frac{n-1}{n}\\sigma^2 \\\\ < \\sigma^2. \\end{align*}\\] Let \\(S_n\\) denote the sample variance. Then we can write it as \\[\\begin{align*} S_n &= \\frac{1}{n} \\sum_{i=1}^n (X_i - \\bar{X})^2 = \\frac{1}{n} \\sum_{i=1}^n (X_i - \\mu)^2 + 2(X_i - \\mu)(\\mu - \\bar{X}) + (\\mu - \\bar{X})^2. \\end{align*}\\] Now \\(\\bar{X}\\) converges in probability (by WLLN) to \\(\\mu\\) therefore the right terms converge in probability to zero. The left term converges in probability to \\(\\sigma^2\\), also by WLLN. Therefore the sample variance is a consistent estimatior of the variance. The denominator changes in the second-to-last line of a., therefore the last line is now equality. Exercise 13.4 (Estimating the median) Show that the sample median is an unbiased estimator of the median for N\\((\\mu, \\sigma^2)\\). Show that the sample median is an unbiased estimator of the mean for any distribution with symmetric density. Hint 1: The pdf of an order statistic is \\(f_{X_{(k)}}(x) = \\frac{n!}{(n - k)!(k - 1)!}f_X(x)\\Big(F_X(x)^{k-1} (1 - F_X(x)^{n - k}) \\Big)\\). Hint 2: A distribution is symmetric when \\(X\\) and \\(2a - X\\) have the same distribution for some \\(a\\). Solution. Let \\(Z_i\\), \\(i = 1,...,n\\) be i.i.d. variables with a symmetric distribution and let \\(Z_{k:n}\\) denote the \\(k\\)-th order statistic. We will distinguish two cases, when \\(n\\) is odd and when \\(n\\) is even. Let first \\(n = 2m + 1\\) be odd. Then the sample median is \\(M = Z_{m+1:2m+1}\\). Its PDF is \\[\\begin{align*} f_M(x) = (m+1)\\binom{2m + 1}{m}f_Z(x)\\Big(F_Z(x)^m (1 - F_Z(x)^m) \\Big). \\end{align*}\\] For every symmetric distribution, it holds that \\(F_X(x) = 1 - F(2a - x)\\). Let \\(a = \\mu\\), the population mean. Plugging this into the PDF, we get that \\(f_M(x) = f_M(2\\mu -x)\\). It follows that \\[\\begin{align*} E[M] &= E[2\\mu - M] \\\\ 2E[M] &= 2\\mu \\\\ E[M] &= \\mu. \\end{align*}\\] Now let \\(n = 2m\\) be even. Then the sample median is \\(M = \\frac{Z_{m:2m} + Z_{m+1:2m}}{2}\\). It can be shown, that the joint PDF of these terms is also symmetric. Therefore, similar to the above \\[\\begin{align*} E[M] &= E[\\frac{Z_{m:2m} + Z_{m+1:2m}}{2}] \\\\ &= E[\\frac{2\\mu - M + 2\\mu - M}{2}] \\\\ &= E[2\\mu - M]. \\end{align*}\\] The above also proves point a. as the median and the mean are the same in normal distribution. Exercise 13.5 (Matrix trace estimation) The Hutchinson trace estimator [1] is an estimator of the trace of a symmetric positive semidefinite matrix A that relies on Monte Carlo sampling. The estimator is defined as \\[\\begin{align*} \\textrm{tr}(A) \\approx \\frac{1}{n} \\Sigma_{i=1}^n z_i^T A z_i, &\\\\ z_i \\sim_{\\mathrm{IID}} \\textrm{Uniform}(\\{-1, 1\\}^m), & \\end{align*}\\] where \\(A \\in \\mathbb{R}^{m \\times m}\\) is a symmetric positive semidefinite matrix. Elements of each vector \\(z_i\\) are either \\(-1\\) or \\(1\\) with equal probability. This is also called a Rademacher distribution. Data scientists often want the trace of a Hessian to obtain valuable curvature information for a loss function. Per [2], an example is classifying ten digits based on \\((28,28)\\) grayscale images (i.e. MNIST data) using logistic regression. The number of parameters is \\(m = 28^2 \\cdot 10 = 7840\\) and the size of the Hessian is \\(m^2\\), roughly \\(6 \\cdot 10^6\\). The diagonal average is equal to the average eigenvalue, which may be useful for optimization; in MCMC contexts, this would be useful for preconditioners and step size optimization. Computing Hessians (as a means of getting eigenvalue information) is often intractable, but Hessian-vector products can be computed faster by autodifferentiation (with e.g. Tensorflow, Pytorch, Jax). This is one motivation for the use of a stochastic trace estimator as outlined above. References: A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines (Hutchinson, 1990) A Modern Analysis of Hutchinson’s Trace Estimator (Skorski, 2020) Prove that the Hutchinson trace estimator is an unbiased estimator of the trace. Solution. We first simplify our task: \\[\\begin{align} \\mathbb{E}\\left[\\frac{1}{n} \\Sigma_{i=1}^n z_i^T A z_i \\right] &= \\frac{1}{n} \\Sigma_{i=1}^n \\mathbb{E}\\left[z_i^T A z_i \\right] \\\\ &= \\mathbb{E}\\left[z_i^T A z_i \\right], \\end{align}\\] where the second equality is due to having \\(n\\) IID vectors \\(z_i\\). We now only need to show that \\(\\mathbb{E}\\left[z^T A z \\right] = \\mathrm{tr}(A)\\). We omit the index due to all vectors being IID: \\[\\begin{align} \\mathrm{tr}(A) &= \\mathrm{tr}(AI) \\\\ &= \\mathrm{tr}(A\\mathbb{E}[zz^T]) \\\\ &= \\mathbb{E}[\\mathrm{tr}(Azz^T)] \\\\ &= \\mathbb{E}[\\mathrm{tr}(z^TAz)] \\\\ &= \\mathbb{E}[z^TAz]. \\end{align}\\] This concludes the proof. We clarify some equalities below. The second equality assumes that \\(\\mathbb{E}[zz^T] = I\\). By noting that the mean of the Rademacher distribution is 0, we have \\[\\begin{align} \\mathrm{Cov}[z, z] &= \\mathbb{E}[(z - \\mathbb{E}[z])(z - \\mathbb{E}[z])^T] \\\\ &= \\mathbb{E}[zz^T]. \\end{align}\\] Dimensions of \\(z\\) are independent, so \\(\\mathrm{Cov}[z, z]_{ij} = 0\\) for \\(i \\neq j\\). The diagonal will contain variances, which are equal to \\(1\\) for all dimensions \\(k = 1 \\dots m\\): \\(\\mathrm{Var}[z^{(k)}] = \\mathbb{E}[z^{(k)}z^{(k)}] - \\mathbb{E}[z^{(k)}]^2 = 1 - 0 = 1\\). It follows that the covariance is an identity matrix. Note that this is a general result for vectors with IID dimensions sampled from a distribution with mean 0 and variance 1. We could therefore use something else instead of the Rademacher, e.g. \\(z ~ N(0, I)\\). The third equality uses the fact that the expectation of a trace equals the trace of an expectation. If \\(X\\) is a random matrix, then \\(\\mathbb{E}[X]_{ij} = \\mathbb{E}[X_{ij}]\\). Therefore: \\[\\begin{align} \\mathrm{tr}(\\mathbb{E}[X]) &= \\Sigma_{i=1}^m(\\mathbb{E}[X]_{ii}) \\\\ &= \\Sigma_{i=1}^m(\\mathbb{E}[X_{ii}]) \\\\ &= \\mathbb{E}[\\Sigma_{i=1}^m(X_{ii})] \\\\ &= \\mathbb{E}[\\mathrm{tr}(X)], \\end{align}\\] where we used the linearity of the expectation in the third step. The fourth equality uses the fact that \\(\\mathrm{tr}(AB) = \\mathrm{tr}(BA)\\) for any matrices \\(A \\in \\mathbb{R}^{n \\times m}, B \\in \\mathbb{R}^{m \\times n}\\). The last inequality uses the fact that the trace of a \\(1 \\times 1\\) matrix is just its element. "],["boot.html", "Chapter 14 Bootstrap", " Chapter 14 Bootstrap This chapter deals with bootstrap. The students are expected to acquire the following knowledge: How to use bootstrap to generate coverage intervals. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } Exercise 14.1 Ideally, a \\(1-\\alpha\\) CI would have \\(1-\\alpha\\) coverage. That is, say a 95% CI should, in the long run, contain the true value of the parameter 95% of the time. In practice, it is impossible to assess the coverage of our CI method, because we rarely know the true parameter. In simulation, however, we can. Let’s assess the coverage of bootstrap percentile intervals. Pick a univariate distribution with readily available mean and one that you can easily sample from. Draw \\(n = 30\\) random samples from the chosen distribution and use the bootstrap (with large enough m) and percentile CI method to construct 95% CI. Repeat the process many times and count how many times the CI contains the true mean. That is, compute the actual coverage probability (don’t forget to include the standard error of the coverage probability!). What can you observe? Try one or two different distributions. What can you observe? Repeat (b) and (c) using BCa intervals (R package boot). How does the coverage compare to percentile intervals? As (d) but using intervals based on asymptotic normality (+/- 1.96 SE). How do results from (b), (d), and (e) change if we increase the sample size to n = 200? What about n = 5? library(boot) set.seed(0) nit <- 1000 # Repeat the process "many times" alpha <- 0.05 # CI parameter nboot <- 100 # m parameter for bootstrap ("large enough m") # f: change this to 200 or 5. nsample <- 30 # n = 30 random samples from the chosen distribution. Comment out BCa code if it breaks. covers <- matrix(nrow = nit, ncol = 3) covers_BCa <- matrix(nrow = nit, ncol = 3) covers_asymp_norm <- matrix(nrow = nit, ncol = 3) isin <- function (x, lower, upper) { (x > lower) & (x < upper) } for (j in 1:nit) { # Repeating many times # a: pick a univariate distribution - standard normal x1 <- rnorm(nsample) # c: one or two different distributions - beta and poisson x2 <- rbeta(nsample, 1, 2) x3 <- rpois(nsample, 5) X1 <- matrix(data = NA, nrow = nsample, ncol = nboot) X2 <- matrix(data = NA, nrow = nsample, ncol = nboot) X3 <- matrix(data = NA, nrow = nsample, ncol = nboot) for (i in 1:nboot) { X1[ ,i] <- sample(x1, nsample, replace = T) X2[ ,i] <- sample(x2, nsample, T) X3[ ,i] <- sample(x3, nsample, T) } X1_func <- apply(X1, 2, mean) X2_func <- apply(X2, 2, mean) X3_func <- apply(X3, 2, mean) X1_quant <- quantile(X1_func, probs = c(alpha / 2, 1 - alpha / 2)) X2_quant <- quantile(X2_func, probs = c(alpha / 2, 1 - alpha / 2)) X3_quant <- quantile(X3_func, probs = c(alpha / 2, 1 - alpha / 2)) covers[j,1] <- (0 > X1_quant[1]) & (0 < X1_quant[2]) covers[j,2] <- ((1 / 3) > X2_quant[1]) & ((1 / 3) < X2_quant[2]) covers[j,3] <- (5 > X3_quant[1]) & (5 < X3_quant[2]) mf <- function (x, i) return(mean(x[i])) bootX1 <- boot(x1, statistic = mf, R = nboot) bootX2 <- boot(x2, statistic = mf, R = nboot) bootX3 <- boot(x3, statistic = mf, R = nboot) X1_quant_BCa <- boot.ci(bootX1, type = "bca")$bca X2_quant_BCa <- boot.ci(bootX2, type = "bca")$bca X3_quant_BCa <- boot.ci(bootX3, type = "bca")$bca covers_BCa[j,1] <- (0 > X1_quant_BCa[4]) & (0 < X1_quant_BCa[5]) covers_BCa[j,2] <- ((1 / 3) > X2_quant_BCa[4]) & ((1 / 3) < X2_quant_BCa[5]) covers_BCa[j,3] <- (5 > X3_quant_BCa[4]) & (5 < X3_quant_BCa[5]) # e: estimate mean and standard error # sample mean: x1_bar <- mean(x1) x2_bar <- mean(x2) x3_bar <- mean(x3) # standard error (of the sample mean) estimate: sample standard deviation / sqrt(n) x1_bar_SE <- sd(x1) / sqrt(nsample) x2_bar_SE <- sd(x2) / sqrt(nsample) x3_bar_SE <- sd(x3) / sqrt(nsample) covers_asymp_norm[j,1] <- isin(0, x1_bar - 1.96 * x1_bar_SE, x1_bar + 1.96 * x1_bar_SE) covers_asymp_norm[j,2] <- isin(1/3, x2_bar - 1.96 * x2_bar_SE, x2_bar + 1.96 * x2_bar_SE) covers_asymp_norm[j,3] <- isin(5, x3_bar - 1.96 * x3_bar_SE, x3_bar + 1.96 * x3_bar_SE) } apply(covers, 2, mean) ## [1] 0.918 0.925 0.905 apply(covers, 2, sd) / sqrt(nit) ## [1] 0.008680516 0.008333333 0.009276910 apply(covers_BCa, 2, mean) ## [1] 0.927 0.944 0.927 apply(covers_BCa, 2, sd) / sqrt(nit) ## [1] 0.008230355 0.007274401 0.008230355 apply(covers_asymp_norm, 2, mean) ## [1] 0.939 0.937 0.930 apply(covers_asymp_norm, 2, sd) / sqrt(nit) ## [1] 0.007572076 0.007687008 0.008072494 Exercise 14.2 You are given a sample of independent observations from a process of interest: Index 1 2 3 4 5 6 7 8 X 7 2 4 6 4 5 9 10 Compute the plug-in estimate of mean and 95% symmetric CI based on asymptotic normality. Use the plug-in estimate of SE. Same as (a), but use the unbiased estimate of SE. Apply nonparametric bootstrap with 1000 bootstrap replications and estimate the 95% CI for the mean with percentile-based CI. # a x <- c(7, 2, 4, 6, 4, 5, 9, 10) n <- length(x) mu <- mean(x) SE <- sqrt(mean((x - mu)^2)) / sqrt(n) SE ## [1] 0.8915839 z <- qnorm(1 - 0.05 / 2) c(mu - z * SE, mu + z * SE) ## [1] 4.127528 7.622472 # b SE <- sd(x) / sqrt(n) SE ## [1] 0.9531433 c(mu - z * SE, mu + z * SE) ## [1] 4.006873 7.743127 # c set.seed(0) m <- 1000 T_mean <- function(x) {mean(x)} est_boot <- array(NA, m) for (i in 1:m) { x_boot <- x[sample(1:n, n, rep = T)] est_boot[i] <- T_mean(x_boot) } quantile(est_boot, p = c(0.025, 0.975)) ## 2.5% 97.5% ## 4.250 7.625 Exercise 14.3 We are given a sample of 10 independent paired (bivariate) observations: Index 1 2 3 4 5 6 7 8 9 10 X 1.26 -0.33 1.33 1.27 0.41 -1.54 -0.93 -0.29 -0.01 2.40 Y 2.64 0.33 0.48 0.06 -0.88 -2.14 -2.21 0.95 0.83 1.45 Compute Pearson correlation between X and Y. Use the cor.test() from R to estimate a 95% CI for the estimate from (a). Apply nonparametric bootstrap with 1000 bootstrap replications and estimate the 95% CI for the Pearson correlation with percentile-based CI. Compare CI from (b) and (c). Are they similar? How would the bootstrap estimation of CI change if we were interested in Spearman or Kendall correlation instead? x <- c(1.26, -0.33, 1.33, 1.27, 0.41, -1.54, -0.93, -0.29, -0.01, 2.40) y <- c(2.64, 0.33, 0.48, 0.06, -0.88, -2.14, -2.21, 0.95, 0.83, 1.45) # a cor(x, y) ## [1] 0.6991247 # b res <- cor.test(x, y) res$conf.int[1:2] ## [1] 0.1241458 0.9226238 # c set.seed(0) m <- 1000 n <- length(x) T_cor <- function(x, y) {cor(x, y)} est_boot <- array(NA, m) for (i in 1:m) { idx <- sample(1:n, n, rep = T) # !!! important to use same indices to keep dependency between x and y est_boot[i] <- T_cor(x[idx], y[idx]) } quantile(est_boot, p = c(0.025, 0.975)) ## 2.5% 97.5% ## 0.2565537 0.9057664 # d # Yes, but the bootstrap CI is more narrow. # e # We just use the functions for Kendall/Spearman coefficients instead: T_kendall <- function(x, y) {cor(x, y, method = "kendall")} T_spearman <- function(x, y) {cor(x, y, method = "spearman")} # Put this in a function that returns the CI bootstrap_95_ci <- function(x, y, t, m = 1000) { n <- length(x) est_boot <- array(NA, m) for (i in 1:m) { idx <- sample(1:n, n, rep = T) # !!! important to use same indices to keep dependency between x and y est_boot[i] <- t(x[idx], y[idx]) } quantile(est_boot, p = c(0.025, 0.975)) } bootstrap_95_ci(x, y, T_kendall) ## 2.5% 97.5% ## -0.08108108 0.78378378 bootstrap_95_ci(x, y, T_spearman) ## 2.5% 97.5% ## -0.1701115 0.8867925 Exercise 14.4 In this problem we will illustrate the use of the nonparametric bootstrap for estimating CIs of regression model coefficients. Load the longley dataset from base R with data(longley). Use lm() to apply linear regression using “Employed” as the target (dependent) variable and all other variables as the predictors (independent). Using lm() results, print the estimated regression coefficients and standard errors. Estimate 95% CI for the coefficients using +/- 1.96 * SE. Use nonparametric bootstrap with 100 replications to estimate the SE of the coefficients from (b). Compare the SE from (c) with those from (b). # a data(longley) # b res <- lm(Employed ~ . , longley) tmp <- data.frame(summary(res)$coefficients[,1:2]) tmp$LB <- tmp[,1] - 1.96 * tmp[,2] tmp$UB <- tmp[,1] + 1.96 * tmp[,2] tmp ## Estimate Std..Error LB UB ## (Intercept) -3.482259e+03 8.904204e+02 -5.227483e+03 -1.737035e+03 ## GNP.deflator 1.506187e-02 8.491493e-02 -1.513714e-01 1.814951e-01 ## GNP -3.581918e-02 3.349101e-02 -1.014616e-01 2.982320e-02 ## Unemployed -2.020230e-02 4.883997e-03 -2.977493e-02 -1.062966e-02 ## Armed.Forces -1.033227e-02 2.142742e-03 -1.453204e-02 -6.132495e-03 ## Population -5.110411e-02 2.260732e-01 -4.942076e-01 3.919994e-01 ## Year 1.829151e+00 4.554785e-01 9.364136e-01 2.721889e+00 # c set.seed(0) m <- 100 n <- nrow(longley) T_coef <- function(x) { lm(Employed ~ . , x)$coefficients } est_boot <- array(NA, c(m, ncol(longley))) for (i in 1:m) { idx <- sample(1:n, n, rep = T) est_boot[i,] <- T_coef(longley[idx,]) } SE <- apply(est_boot, 2, sd) SE ## [1] 1.826011e+03 1.605981e-01 5.693746e-02 8.204892e-03 3.802225e-03 ## [6] 3.907527e-01 9.414436e-01 # Show the standard errors around coefficients library(ggplot2) library(reshape2) df <- data.frame(index = 1:7, bootstrap_SE = SE, lm_SE = tmp$Std..Error) melted_df <- melt(df[2:nrow(df), ], id.vars = "index") # Ignore bias which has a really large magnitude ggplot(melted_df, aes(x = index, y = value, fill = variable)) + geom_bar(stat="identity", position="dodge") + xlab("Coefficient") + ylab("Standard error") # + scale_y_continuous(trans = "log") # If you want to also plot bias Exercise 14.5 This exercise shows a shortcoming of the bootstrap method when using the plug in estimator for the maximum. Compute the 95% bootstrap CI for the maximum of a standard normal distribution. Compute the 95% bootstrap CI for the maximum of a binomial distribution with n = 15 and p = 0.2. Repeat (b) using p = 0.9. Why is the result different? # bootstrap CI for maximum alpha <- 0.05 T_max <- function(x) {max(x)} # Equal to T_max = max bootstrap <- function(x, t, m = 1000) { n <- length(x) values <- rep(0, m) for (i in 1:m) { values[i] <- t(sample(x, n, replace = T)) } quantile(values, probs = c(alpha / 2, 1 - alpha / 2)) } # a # Meaningless, as the normal distribution can yield arbitrarily large values. x <- rnorm(100) bootstrap(x, T_max) ## 2.5% 97.5% ## 1.819425 2.961743 # b x <- rbinom(100, size = 15, prob = 0.2) # min = 0, max = 15 bootstrap(x, T_max) ## 2.5% 97.5% ## 6 7 # c x <- rbinom(100, size = 15, prob = 0.9) # min = 0, max = 15 bootstrap(x, T_max) ## 2.5% 97.5% ## 15 15 # Observation: to estimate the maximum, we need sufficient probability mass near the maximum value the distribution can yield. # Using bootstrap is pointless when there is too little mass near the true maximum. # In general, bootstrap will fail when estimating the CI for the maximum. "],["ml.html", "Chapter 15 Maximum likelihood 15.1 Deriving MLE 15.2 Fisher information 15.3 The German tank problem", " Chapter 15 Maximum likelihood This chapter deals with maximum likelihood estimation. The students are expected to acquire the following knowledge: How to derive MLE. Applying MLE in R. Calculating and interpreting Fisher information. Practical use of MLE. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 15.1 Deriving MLE Exercise 15.1 Derive the maximum likelihood estimator of variance for N\\((\\mu, \\sigma^2)\\). Compare with results from 13.3. What does that say about the MLE estimator? Solution. The mean is assumed constant, so we have the likelihood \\[\\begin{align} L(\\sigma^2; y) &= \\prod_{i=1}^n \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(y_i - \\mu)^2}{2 \\sigma^2}} \\\\ &= \\frac{1}{\\sqrt{2 \\pi \\sigma^2}^n} e^{\\frac{-\\sum_{i=1}^n (y_i - \\mu)^2}{2 \\sigma^2}} \\end{align}\\] We need to find the maximum of this function. We first observe that we can replace \\(\\frac{-\\sum_{i=1}^n (y_i - \\mu)^2}{2}\\) with a constant \\(c\\), since none of the terms are dependent on \\(\\sigma^2\\). Additionally, the term \\(\\frac{1}{\\sqrt{2 \\pi}^n}\\) does not affect the calculation of the maximum. So now we have \\[\\begin{align} L(\\sigma^2; y) &= (\\sigma^2)^{-\\frac{n}{2}} e^{\\frac{c}{\\sigma^2}}. \\end{align}\\] Differentiating we get \\[\\begin{align} \\frac{d}{d \\sigma^2} L(\\sigma^2; y) &= (\\sigma^2)^{-\\frac{n}{2}} \\frac{d}{d \\sigma^2} e^{\\frac{c}{\\sigma^2}} + e^{\\frac{c}{\\sigma^2}} \\frac{d}{d \\sigma^2} (\\sigma^2)^{-\\frac{n}{2}} \\\\ &= - (\\sigma^2)^{-\\frac{n}{2}} e^{\\frac{c}{\\sigma^2}} \\frac{c}{(\\sigma^2)^2} - e^{\\frac{c}{\\sigma^2}} \\frac{n}{2} (\\sigma^2)^{-\\frac{n + 2}{2}} \\\\ &= - (\\sigma^2)^{-\\frac{n + 4}{2}} e^{\\frac{c}{\\sigma^2}} c - e^{\\frac{c}{\\sigma^2}} \\frac{n}{2} (\\sigma^2)^{-\\frac{n + 2}{2}} \\\\ &= - e^{\\frac{c}{\\sigma^2}} (\\sigma^2)^{-\\frac{n + 4}{2}} \\Big(c + \\frac{n}{2}\\sigma^2 \\Big). \\end{align}\\] To get the maximum, this has to equal to 0, so \\[\\begin{align} c + \\frac{n}{2}\\sigma^2 &= 0 \\\\ \\sigma^2 &= -\\frac{2c}{n} \\\\ \\sigma^2 &= \\frac{\\sum_{i=1}^n (Y_i - \\mu)^2}{n}. \\end{align}\\] The MLE estimator is biased. Exercise 15.2 (Multivariate normal distribution) Derive the maximum likelihood estimate for the mean and covariance matrix of the multivariate normal. Simulate \\(n = 40\\) samples from a bivariate normal distribution (choose non-trivial parameters, that is, mean \\(\\neq 0\\) and covariance \\(\\neq 0\\)). Compute the MLE for the sample. Overlay the data with an ellipse that is determined by the MLE and an ellipse that is determined by the chosen true parameters. Repeat b. several times and observe how the estimates (ellipses) vary around the true value. Hint: For the derivation of MLE, these identities will be helpful: \\(\\frac{\\partial b^T a}{\\partial a} = \\frac{\\partial a^T b}{\\partial a} = b\\), \\(\\frac{\\partial a^T A a}{\\partial a} = (A + A^T)a\\), \\(\\frac{\\partial \\text{tr}(BA)}{\\partial A} = B^T\\), \\(\\frac{\\partial \\ln |A|}{\\partial A} = (A^{-1})^T\\), \\(a^T A a = \\text{tr}(a^T A a) = \\text{tr}(a a^T A) = \\text{tr}(Aaa^T)\\). Solution. The log likelihood of the MVN distribution is \\[\\begin{align*} l(\\mu, \\Sigma ; x) &= -\\frac{1}{2}\\Big(\\sum_{i=1}^n k\\ln(2\\pi) + |\\Sigma| + (x_i - \\mu)^T \\Sigma^{-1} (x_i - \\mu)\\Big) \\\\ &= -\\frac{n}{2}\\ln|\\Sigma| + -\\frac{1}{2}\\Big(\\sum_{i=1}^n(x_i - \\mu)^T \\Sigma^{-1} (x_i - \\mu)\\Big) + c, \\end{align*}\\] where \\(c\\) is a constant with respect to \\(\\mu\\) and \\(\\Sigma\\). To find the MLE we first need to find partial derivatives. Let us start with \\(\\mu\\). \\[\\begin{align*} \\frac{\\partial}{\\partial \\mu}l(\\mu, \\Sigma ; x) &= \\frac{\\partial}{\\partial \\mu} -\\frac{1}{2}\\Big(\\sum_{i=1}^n x_i^T \\Sigma^{-1} x_i - x_i^T \\Sigma^{-1} \\mu - \\mu^T \\Sigma^{-1} x_i + \\mu^T \\Sigma^{-1} \\mu \\Big) \\\\ &= -\\frac{1}{2}\\Big(\\sum_{i=1}^n - \\Sigma^{-1} x_i - \\Sigma^{-1} x_i + 2 \\Sigma^{-1} \\mu \\Big) \\\\ &= -\\Sigma^{-1}\\Big(\\sum_{i=1}^n - x_i + \\mu \\Big). \\end{align*}\\] Equating above with zero, we get \\[\\begin{align*} \\sum_{i=1}^n - x_i + \\mu &= 0 \\\\ \\hat{\\mu} = \\frac{1}{n} \\sum_{i=1}^n x_i, \\end{align*}\\] which is the dimension-wise empirical mean. Now for the covariance matrix \\[\\begin{align*} \\frac{\\partial}{\\partial \\Sigma^{-1}}l(\\mu, \\Sigma ; x) &= \\frac{\\partial}{\\partial \\Sigma^{-1}} -\\frac{n}{2}\\ln|\\Sigma| + -\\frac{1}{2}\\Big(\\sum_{i=1}^n(x_i - \\mu)^T \\Sigma^{-1} (x_i - \\mu)\\Big) \\\\ &= \\frac{\\partial}{\\partial \\Sigma^{-1}} -\\frac{n}{2}\\ln|\\Sigma| + -\\frac{1}{2}\\Big(\\sum_{i=1}^n \\text{tr}((x_i - \\mu)^T \\Sigma^{-1} (x_i - \\mu))\\Big) \\\\ &= \\frac{\\partial}{\\partial \\Sigma^{-1}} -\\frac{n}{2}\\ln|\\Sigma| + -\\frac{1}{2}\\Big(\\sum_{i=1}^n \\text{tr}((\\Sigma^{-1} (x_i - \\mu) (x_i - \\mu)^T )\\Big) \\\\ &= \\frac{n}{2}\\Sigma + -\\frac{1}{2}\\Big(\\sum_{i=1}^n (x_i - \\mu) (x_i - \\mu)^T \\Big). \\end{align*}\\] Equating above with zero, we get \\[\\begin{align*} \\hat{\\Sigma} = \\frac{1}{n}\\sum_{i=1}^n (x_i - \\mu) (x_i - \\mu)^T. \\end{align*}\\] set.seed(1) n <- 40 mu <- c(1, -2) Sigma <- matrix(data = c(2, -1.6, -1.6, 1.8), ncol = 2) X <- mvrnorm(n = n, mu = mu, Sigma = Sigma) colnames(X) <- c("X1", "X2") X <- as.data.frame(X) # plot.new() tru_ellip <- ellipse(mu, Sigma, draw = FALSE) colnames(tru_ellip) <- c("X1", "X2") tru_ellip <- as.data.frame(tru_ellip) mu_est <- apply(X, 2, mean) tmp <- as.matrix(sweep(X, 2, mu_est)) Sigma_est <- (1 / n) * t(tmp) %*% tmp est_ellip <- ellipse(mu_est, Sigma_est, draw = FALSE) colnames(est_ellip) <- c("X1", "X2") est_ellip <- as.data.frame(est_ellip) ggplot(data = X, aes(x = X1, y = X2)) + geom_point() + geom_path(data = tru_ellip, aes(x = X1, y = X2, color = "truth")) + geom_path(data = est_ellip, aes(x = X1, y = X2, color = "estimated")) + labs(color = "type") Exercise 15.3 (Logistic regression) Logistic regression is a popular discriminative model when our target variable is binary (categorical with 2 values). One of the ways of looking at logistic regression is that it is linear regression but instead of using the linear term as the mean of a normal RV, we use it as the mean of a Bernoulli RV. Of course, the mean of a Bernoulli is bounded on \\([0,1]\\), so, to avoid non-sensical values, we squeeze the linear between 0 and 1 with the inverse logit function inv_logit\\((z) = 1 / (1 + e^{-z})\\). This leads to the following model: \\(y_i | \\beta, x_i \\sim \\text{Bernoulli}(\\text{inv_logit}(\\beta x_i))\\). Explicitly write the likelihood function of beta. Implement the likelihood function in R. Use black-box box-constraint optimization (for example, optim() with L-BFGS) to find the maximum likelihood estimate for beta for \\(x\\) and \\(y\\) defined below. Plot the estimated probability as a function of the independent variable. Compare with the truth. Let \\(y2\\) be a response defined below. Will logistic regression work well on this dataset? Why not? How can we still use the model, without changing it? inv_log <- function (z) { return (1 / (1 + exp(-z))) } set.seed(1) x <- rnorm(100) y <- rbinom(100, size = 1, prob = inv_log(1.2 * x)) y2 <- rbinom(100, size = 1, prob = inv_log(1.2 * x + 1.4 * x^2)) Solution. \\[\\begin{align*} l(\\beta; x, y) &= p(y | x, \\beta) \\\\ &= \\ln(\\prod_{i=1}^n \\text{inv_logit}(\\beta x_i)^{y_i} (1 - \\text{inv_logit}(\\beta x_i))^{1 - y_i}) \\\\ &= \\sum_{i=1}^n y_i \\ln(\\text{inv_logit}(\\beta x_i)) + (1 - y_i) \\ln(1 - \\text{inv_logit}(\\beta x_i)). \\end{align*}\\] set.seed(1) inv_log <- function (z) { return (1 / (1 + exp(-z))) } x <- rnorm(100) y <- x y <- rbinom(100, size = 1, prob = inv_log(1.2 * x)) l_logistic <- function (beta, X, y) { logl <- -sum(y * log(inv_log(as.vector(beta %*% X))) + (1 - y) * log((1 - inv_log(as.vector(beta %*% X))))) return(logl) } my_optim <- optim(par = 0.5, fn = l_logistic, method = "L-BFGS-B", lower = 0, upper = 10, X = x, y = y) my_optim$par ## [1] 1.166558 truth_p <- data.frame(x = x, prob = inv_log(1.2 * x), type = "truth") est_p <- data.frame(x = x, prob = inv_log(my_optim$par * x), type = "estimated") plot_df <- rbind(truth_p, est_p) ggplot(data = plot_df, aes(x = x, y = prob, color = type)) + geom_point(alpha = 0.3) y2 <- rbinom(2000, size = 1, prob = inv_log(1.2 * x + 1.4 * x^2)) X2 <- cbind(x, x^2) my_optim2 <- optim(par = c(0, 0), fn = l_logistic, method = "L-BFGS-B", lower = c(0, 0), upper = c(2, 2), X = t(X2), y = y2) my_optim2$par ## [1] 1.153656 1.257649 tmp <- sweep(data.frame(x = x, x2 = x^2), 2, my_optim2$par, FUN = "*") tmp <- tmp[ ,1] + tmp[ ,2] truth_p <- data.frame(x = x, prob = inv_log(1.2 * x + 1.4 * x^2), type = "truth") est_p <- data.frame(x = x, prob = inv_log(tmp), type = "estimated") plot_df <- rbind(truth_p, est_p) ggplot(data = plot_df, aes(x = x, y = prob, color = type)) + geom_point(alpha = 0.3) Exercise 15.4 (Linear regression) For the data generated below, do the following: Compute the least squares (MLE) estimate of coefficients beta using the matrix exact solution. Compute the MLE by minimizing the sum of squared residuals using black-box optimization (optim()). Compute the MLE by using the output built-in linear regression (lm() ). Compare (a-c and the true coefficients). Compute 95% CI on the beta coefficients using the output of built-in linear regression. Compute 95% CI on the beta coefficients by using (a or b) and the bootstrap with percentile method for CI. Compare with d. set.seed(1) n <- 100 x1 <- rnorm(n) x2 <- rnorm(n) x3 <- rnorm(n) X <- cbind(x1, x2, x3) beta <- c(0.2, 0.6, -1.2) y <- as.vector(t(beta %*% t(X))) + rnorm(n, sd = 0.2) set.seed(1) n <- 100 x1 <- rnorm(n) x2 <- rnorm(n) x3 <- rnorm(n) X <- cbind(x1, x2, x3) beta <- c(0.2, 0.6, -1.2) y <- as.vector(t(beta %*% t(X))) + rnorm(n, sd = 0.2) LS_fun <- function (beta, X, y) { return(sum((y - beta %*% t(X))^2)) } my_optim <- optim(par = c(0, 0, 0), fn = LS_fun, lower = -5, upper = 5, X = X, y = y, method = "L-BFGS-B") my_optim$par ## [1] 0.1898162 0.5885946 -1.1788264 df <- data.frame(y = y, x1 = x1, x2 = x2, x3 = x3) my_lm <- lm(y ~ x1 + x2 + x3 - 1, data = df) my_lm ## ## Call: ## lm(formula = y ~ x1 + x2 + x3 - 1, data = df) ## ## Coefficients: ## x1 x2 x3 ## 0.1898 0.5886 -1.1788 # matrix solution beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y beta_hat ## [,1] ## x1 0.1898162 ## x2 0.5885946 ## x3 -1.1788264 out <- summary(my_lm) out$coefficients[ ,2] ## x1 x2 x3 ## 0.02209328 0.02087542 0.01934506 # bootstrap CI nboot <- 1000 beta_boot <- matrix(data = NA, ncol = length(beta), nrow = nboot) for (i in 1:nboot) { inds <- sample(1:n, n, replace = T) new_df <- df[inds, ] X_tmp <- as.matrix(new_df[ ,-1]) y_tmp <- new_df[ ,1] # print(nrow(new_df)) tmp_beta <- solve(t(X_tmp) %*% X_tmp) %*% t(X_tmp) %*% y_tmp beta_boot[i, ] <- tmp_beta } apply(beta_boot, 2, mean) ## [1] 0.1893281 0.5887068 -1.1800738 apply(beta_boot, 2, quantile, probs = c(0.025, 0.975)) ## [,1] [,2] [,3] ## 2.5% 0.1389441 0.5436911 -1.221560 ## 97.5% 0.2386295 0.6363102 -1.140416 out$coefficients[ ,2] ## x1 x2 x3 ## 0.02209328 0.02087542 0.01934506 Exercise 15.5 (Principal component analysis) Load the olympic data set from package ade4. The data show decathlon results for 33 men in 1988 Olympic Games. This data set serves as a great example of finding the latent structure in the data, as there are certain characteristics of the athletes that make them excel at different events. For example an explosive athlete will do particulary well in sprints and long jumps. Perform PCA (prcomp) on the data set and interpret the first 2 latent dimensions. Hint: Standardize the data first to get meaningful results. Use MLE to estimate the covariance of the standardized multivariate distribution. Decompose the estimated covariance matrix with the eigendecomposition. Compare the eigenvectors to the output of PCA. data(olympic) X <- olympic$tab X_scaled <- scale(X) my_pca <- prcomp(X_scaled) summary(my_pca) ## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## Standard deviation 1.8488 1.6144 0.97123 0.9370 0.74607 0.70088 0.65620 ## Proportion of Variance 0.3418 0.2606 0.09433 0.0878 0.05566 0.04912 0.04306 ## Cumulative Proportion 0.3418 0.6025 0.69679 0.7846 0.84026 0.88938 0.93244 ## PC8 PC9 PC10 ## Standard deviation 0.55389 0.51667 0.31915 ## Proportion of Variance 0.03068 0.02669 0.01019 ## Cumulative Proportion 0.96312 0.98981 1.00000 autoplot(my_pca, data = X, loadings = TRUE, loadings.colour = 'blue', loadings.label = TRUE, loadings.label.size = 3) Sigma_est <- (1 / nrow(X_scaled)) * t(X_scaled) %*% X_scaled Sigma_dec <- eigen(Sigma_est) Sigma_dec$vectors ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 0.4158823 0.1488081 -0.26747198 -0.08833244 -0.442314456 0.03071237 ## [2,] -0.3940515 -0.1520815 -0.16894945 -0.24424963 0.368913901 -0.09378242 ## [3,] -0.2691057 0.4835374 0.09853273 -0.10776276 -0.009754680 0.23002054 ## [4,] -0.2122818 0.0278985 -0.85498656 0.38794393 -0.001876311 0.07454380 ## [5,] 0.3558474 0.3521598 -0.18949642 0.08057457 0.146965351 -0.32692886 ## [6,] 0.4334816 0.0695682 -0.12616012 -0.38229029 -0.088802794 0.21049130 ## [7,] -0.1757923 0.5033347 0.04609969 0.02558404 0.019358607 0.61491241 ## [8,] -0.3840821 0.1495820 0.13687235 0.14396548 -0.716743474 -0.34776037 ## [9,] -0.1799436 0.3719570 -0.19232803 -0.60046566 0.095582043 -0.43744387 ## [10,] 0.1701426 0.4209653 0.22255233 0.48564231 0.339772188 -0.30032419 ## [,7] [,8] [,9] [,10] ## [1,] 0.2543985 0.663712826 -0.10839531 0.10948045 ## [2,] 0.7505343 0.141264141 0.04613910 0.05580431 ## [3,] -0.1106637 0.072505560 0.42247611 0.65073655 ## [4,] -0.1351242 -0.155435871 -0.10206505 0.11941181 ## [5,] 0.1413388 -0.146839303 0.65076229 -0.33681395 ## [6,] 0.2725296 -0.639003579 -0.20723854 0.25971800 ## [7,] 0.1439726 0.009400445 -0.16724055 -0.53450315 ## [8,] 0.2732665 -0.276873049 -0.01766443 -0.06589572 ## [9,] -0.3419099 0.058519366 -0.30619617 -0.13093187 ## [10,] 0.1868704 0.007310045 -0.45688227 0.24311846 my_pca$rotation ## PC1 PC2 PC3 PC4 PC5 PC6 ## 100 -0.4158823 0.1488081 0.26747198 -0.08833244 -0.442314456 0.03071237 ## long 0.3940515 -0.1520815 0.16894945 -0.24424963 0.368913901 -0.09378242 ## poid 0.2691057 0.4835374 -0.09853273 -0.10776276 -0.009754680 0.23002054 ## haut 0.2122818 0.0278985 0.85498656 0.38794393 -0.001876311 0.07454380 ## 400 -0.3558474 0.3521598 0.18949642 0.08057457 0.146965351 -0.32692886 ## 110 -0.4334816 0.0695682 0.12616012 -0.38229029 -0.088802794 0.21049130 ## disq 0.1757923 0.5033347 -0.04609969 0.02558404 0.019358607 0.61491241 ## perc 0.3840821 0.1495820 -0.13687235 0.14396548 -0.716743474 -0.34776037 ## jave 0.1799436 0.3719570 0.19232803 -0.60046566 0.095582043 -0.43744387 ## 1500 -0.1701426 0.4209653 -0.22255233 0.48564231 0.339772188 -0.30032419 ## PC7 PC8 PC9 PC10 ## 100 0.2543985 -0.663712826 0.10839531 -0.10948045 ## long 0.7505343 -0.141264141 -0.04613910 -0.05580431 ## poid -0.1106637 -0.072505560 -0.42247611 -0.65073655 ## haut -0.1351242 0.155435871 0.10206505 -0.11941181 ## 400 0.1413388 0.146839303 -0.65076229 0.33681395 ## 110 0.2725296 0.639003579 0.20723854 -0.25971800 ## disq 0.1439726 -0.009400445 0.16724055 0.53450315 ## perc 0.2732665 0.276873049 0.01766443 0.06589572 ## jave -0.3419099 -0.058519366 0.30619617 0.13093187 ## 1500 0.1868704 -0.007310045 0.45688227 -0.24311846 15.2 Fisher information Exercise 15.6 Let us assume a Poisson likelihood. Derive the MLE estimate of the mean. Derive the Fisher information. For the data below compute the MLE and construct confidence intervals. Use bootstrap to construct the CI for the mean. Compare with c) and discuss. x <- c(2, 5, 3, 1, 2, 1, 0, 3, 0, 2) Solution. The log likelihood of the Poisson is \\[\\begin{align*} l(\\lambda; x) = \\sum_{i=1}^n x_i \\ln \\lambda - n \\lambda - \\sum_{i=1}^n \\ln x_i! \\end{align*}\\] Taking the derivative and equating with 0 we get \\[\\begin{align*} \\frac{1}{\\hat{\\lambda}}\\sum_{i=1}^n x_i - n &= 0 \\\\ \\hat{\\lambda} &= \\frac{1}{n} \\sum_{i=1}^n x_i. \\end{align*}\\] Since \\(\\lambda\\) is the mean parameter, this was expected. For the Fischer information, we first need the second derivative, which is \\[\\begin{align*} - \\lambda^{-2} \\sum_{i=1}^n x_i. \\\\ \\end{align*}\\] Now taking the expectation of the negative of the above, we get \\[\\begin{align*} E[\\lambda^{-2} \\sum_{i=1}^n x_i] &= \\lambda^{-2} E[\\sum_{i=1}^n x_i] \\\\ &= \\lambda^{-2} n \\lambda \\\\ &= \\frac{n}{\\lambda}. \\end{align*}\\] set.seed(1) x <- c(2, 5, 3, 1, 2, 1, 0, 3, 0, 2) lambda_hat <- mean(x) finfo <- length(x) / lambda_hat mle_CI <- c(lambda_hat - 1.96 * sqrt(1 / finfo), lambda_hat + 1.96 * sqrt(1 / finfo)) boot_lambda <- c() nboot <- 1000 for (i in 1:nboot) { tmp_x <- sample(x, length(x), replace = T) boot_lambda[i] <- mean(tmp_x) } boot_CI <- c(quantile(boot_lambda, 0.025), quantile(boot_lambda, 0.975)) mle_CI ## [1] 1.045656 2.754344 boot_CI ## 2.5% 97.5% ## 1.0 2.7 Exercise 15.7 Find the Fisher information matrix for the Gamma distribution. Generate 20 samples from a Gamma distribution and plot a confidence ellipse of the inverse of Fisher information matrix around the ML estimates of the parameters. Also plot the theoretical values. Repeat the sampling several times. What do you observe? Discuss what a non-diagonal Fisher matrix implies. Hint: The digamma function is defined as \\(\\psi(x) = \\frac{\\frac{d}{dx} \\Gamma(x)}{\\Gamma(x)}\\). Additionally, you do not need to evaluate \\(\\frac{d}{dx} \\psi(x)\\). To calculate its value in R, use package numDeriv. Solution. The log likelihood of the Gamma is \\[\\begin{equation*} l(\\alpha, \\beta; x) = n \\alpha \\ln \\beta - n \\ln \\Gamma(\\alpha) + (\\alpha - 1) \\sum_{i=1}^n \\ln x_i - \\beta \\sum_{i=1}^n x_i. \\end{equation*}\\] Let us calculate the derivatives. \\[\\begin{align*} \\frac{\\partial}{\\partial \\alpha} l(\\alpha, \\beta; x) &= n \\ln \\beta - n \\psi(\\alpha) + \\sum_{i=1}^n \\ln x_i, \\\\ \\frac{\\partial}{\\partial \\beta} l(\\alpha, \\beta; x) &= \\frac{n \\alpha}{\\beta} - \\sum_{i=1}^n x_i, \\\\ \\frac{\\partial^2}{\\partial \\alpha \\beta} l(\\alpha, \\beta; x) &= \\frac{n}{\\beta}, \\\\ \\frac{\\partial^2}{\\partial \\alpha^2} l(\\alpha, \\beta; x) &= - n \\frac{\\partial}{\\partial \\alpha} \\psi(\\alpha), \\\\ \\frac{\\partial^2}{\\partial \\beta^2} l(\\alpha, \\beta; x) &= - \\frac{n \\alpha}{\\beta^2}. \\end{align*}\\] The Fisher information matrix is then \\[\\begin{align*} I(\\alpha, \\beta) = - E[ \\begin{bmatrix} - n \\psi'(\\alpha) & \\frac{n}{\\beta} \\\\ \\frac{n}{\\beta} & - \\frac{n \\alpha}{\\beta^2} \\end{bmatrix} ] = \\begin{bmatrix} n \\psi'(\\alpha) & - \\frac{n}{\\beta} \\\\ - \\frac{n}{\\beta} & \\frac{n \\alpha}{\\beta^2} \\end{bmatrix} \\end{align*}\\] A non-diagonal Fisher matrix implies that the parameter estimates are linearly dependent. set.seed(1) n <- 20 pars_theor <- c(5, 2) x <- rgamma(n, 5, 2) # MLE for alpha and beta log_lik <- function (pars, x) { n <- length(x) return (- (n * pars[1] * log(pars[2]) - n * log(gamma(pars[1])) + (pars[1] - 1) * sum(log(x)) - pars[2] * sum(x))) } my_optim <- optim(par = c(1,1), fn = log_lik, method = "L-BFGS-B", lower = c(0.001, 0.001), upper = c(8, 8), x = x) pars_mle <- my_optim$par fish_mat <- matrix(data = NA, nrow = 2, ncol = 2) fish_mat[1,2] <- - n / pars_mle[2] fish_mat[2,1] <- - n / pars_mle[2] fish_mat[2,2] <- (n * pars_mle[1]) / (pars_mle[2]^2) fish_mat[1,1] <- n * grad(digamma, pars_mle[1]) fish_mat_inv <- solve(fish_mat) est_ellip <- ellipse(pars_mle, fish_mat_inv, draw = FALSE) colnames(est_ellip) <- c("X1", "X2") est_ellip <- as.data.frame(est_ellip) ggplot() + geom_point(data = data.frame(x = pars_mle[1], y = pars_mle[2]), aes(x = x, y = y)) + geom_path(data = est_ellip, aes(x = X1, y = X2)) + geom_point(aes(x = pars_theor[1], y = pars_theor[2]), color = "red") + geom_text(aes(x = pars_theor[1], y = pars_theor[2], label = "Theoretical parameters"), color = "red", nudge_y = -0.2) 15.3 The German tank problem Exercise 15.8 (The German tank problem) During WWII the allied intelligence were faced with an important problem of estimating the total production of certain German tanks, such as the Panther. What turned out to be a successful approach was to estimate the maximum from the serial numbers of the small sample of captured or destroyed tanks (describe the statistical model used). What assumptions were made by using the above model? Do you think they are reasonable assumptions in practice? Show that the plug-in estimate for the maximum (i.e. the maximum of the sample) is a biased estimator. Derive the maximum likelihood estimate of the maximum. Check that the following estimator is not biased: \\(\\hat{n} = \\frac{k + 1}{k}m - 1\\). Solution. The data are the serial numbers of the tanks. The parameter is \\(n\\), the total production of the tank. The distribution of the serial numbers is a discrete uniform distribution over all serial numbers. One of the assumptions is that we have i.i.d samples, however in practice this might not be true, as some tanks produced later could be sent to the field later, therefore already in theory we would not be able to recover some values from the population. To find the expected value we first need to find the distribution of \\(m\\). Let us start with the CDF. \\[\\begin{align*} F_m(x) = P(Y_1 < x,...,Y_k < x). \\end{align*}\\] If \\(x < k\\) then \\(F_m(x) = 0\\) and if \\(x \\geq 1\\) then \\(F_m(x) = 1\\). What about between those values. So the probability that the maximum value is less than or equal to \\(m\\) is just the number of possible draws from \\(Y\\) that are all smaller than \\(m\\), divided by all possible draws. This is \\(\\frac{{x}\\choose{k}}{{n}\\choose{k}}\\). The PDF on the suitable bounds is then \\[\\begin{align*} P(m = x) = F_m(x) - F_m(x - 1) = \\frac{\\binom{x}{k} - \\binom{x - 1}{k}}{\\binom{n}{k}} = \\frac{\\binom{x - 1}{k - 1}}{\\binom{n}{k}}. \\end{align*}\\] Now we can calculate the expected value of \\(m\\) using some combinatorial identities. \\[\\begin{align*} E[m] &= \\sum_{i = k}^n i \\frac{{i - 1}\\choose{k - 1}}{{n}\\choose{k}} \\\\ &= \\sum_{i = k}^n i \\frac{\\frac{(i - 1)!}{(k - 1)!(i - k)!}}{{n}\\choose{k}} \\\\ &= \\frac{k}{\\binom{n}{k}}\\sum_{i = k}^n \\binom{i}{k} \\\\ &= \\frac{k}{\\binom{n}{k}} \\binom{n + 1}{k + 1} \\\\ &= \\frac{k(n + 1)}{k + 1}. \\end{align*}\\] The bias of this estimator is then \\[\\begin{align*} E[m] - n = \\frac{k(n + 1)}{k + 1} - n = \\frac{k - n}{k + 1}. \\end{align*}\\] The probability that we observed our sample \\(Y = {Y_1, Y_2,...,,Y_k}\\) given \\(n\\) is \\(\\frac{1}{{n}\\choose{k}}\\). We need to find such \\(n^*\\) that this function is maximized. Additionally, we have a constraint that \\(n^* \\geq m = \\max{(Y)}\\). Let us plot this function for \\(m = 10\\) and \\(k = 4\\). library(ggplot2) my_fun <- function (x, m, k) { tmp <- 1 / (choose(x, k)) tmp[x < m] <- 0 return (tmp) } x <- 1:20 y <- my_fun(x, 10, 4) df <- data.frame(x = x, y = y) ggplot(data = df, aes(x = x, y = y)) + geom_line() ::: {.solution} (continued) We observe that the maximum of this function lies at the maximum value of the sample. Therefore \\(n^* = m\\) and ML estimate equals the plug-in estimate. \\[\\begin{align*} E[\\hat{n}] &= \\frac{k + 1}{k} E[m] - 1 \\\\ &= \\frac{k + 1}{k} \\frac{k(n + 1)}{k + 1} - 1 \\\\ &= n. \\end{align*}\\] ::: "],["nhst.html", "Chapter 16 Null hypothesis significance testing", " Chapter 16 Null hypothesis significance testing This chapter deals with null hypothesis significance testing. The students are expected to acquire the following knowledge: Binomial test. t-test. Chi-squared test. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } Exercise 16.1 (Binomial test) We assume \\(y_i \\in \\{0,1\\}\\), \\(i = 1,...,n\\) and \\(y_i | \\theta = 0.5 \\sim i.i.d.\\) Bernoulli\\((\\theta)\\). The test statistic is \\(X = \\sum_{i=1}^n\\) and the rejection region R is defined as the region where the probability of obtaining such or more extreme \\(X\\) given \\(\\theta = 0.5\\) is less than 0.05. Derive and plot the power function of the test for \\(n=100\\). What is the significance level of this test if \\(H0: \\theta = 0.5\\)? At which values of X will we reject the null hypothesis? # a # First we need the rejection region, so we need to find X_min and X_max n <- 100 qbinom(0.025, n, 0.5) ## [1] 40 qbinom(0.975, n, 0.5) ## [1] 60 pbinom(40, n, 0.5) ## [1] 0.02844397 pbinom(60, n, 0.5) ## [1] 0.9823999 X_min <- 39 X_max <- 60 thetas <- seq(0, 1, by = 0.01) beta_t <- 1 - pbinom(X_max, size = n, prob = thetas) + pbinom(X_min, size = n, prob = thetas) plot(beta_t) # b # The significance level is beta_t[51] ## [1] 0.0352002 # We will reject the null hypothesis at X values below X_min and above X_max. Exercise 16.2 (Long-run guarantees of the t-test) Generate a sample of size \\(n = 10\\) from the standard normal. Use the two-sided t-test with \\(H0: \\mu = 0\\) and record the p-value. Can you reject H0 at 0.05 significance level? (before simulating) If we repeated (b) many times, what would be the relative frequency of false positives/Type I errors (rejecting the null that is true)? What would be the relative frequency of false negatives /Type II errors (retaining the null when the null is false)? (now simulate b and check if the simulation results match your answer in b) Similar to (a-c) but now we generate data from N(-0.5, 1). Similar to (a-c) but now we generate data from N(\\(\\mu\\), 1) where we every time pick a different \\(\\mu < 0\\) and use a one-sided test \\(H0: \\mu <= 0\\). set.seed(2) # a x <- rnorm(10) my_test <- t.test(x, alternative = "two.sided", mu = 0) my_test ## ## One Sample t-test ## ## data: x ## t = 0.6779, df = 9, p-value = 0.5149 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## -0.4934661 0.9157694 ## sample estimates: ## mean of x ## 0.2111516 # we can not reject the null hypothesis # b # The expected value of false positives would be 0.05. The expected value of # true negatives would be 0, as there are no negatives (the null hypothesis is # always the truth). nit <- 1000 typeIerr <- vector(mode = "logical", length = nit) typeIIerr <- vector(mode = "logical", length = nit) for (i in 1:nit) { x <- rnorm(10) my_test <- t.test(x, alternative = "two.sided", mu = 0) if (my_test$p.value < 0.05) { typeIerr[i] <- T } else { typeIerr[i] <- F } } mean(typeIerr) ## [1] 0.052 sd(typeIerr) / sqrt(nit) ## [1] 0.007024624 # d # We can not estimate the percentage of true negatives, but it will probably be # higher than 0.05. There will be no false positives as the null hypothesis is # always false. typeIIerr <- vector(mode = "logical", length = nit) for (i in 1:nit) { x <- rnorm(10, -0.5) my_test <- t.test(x, alternative = "two.sided", mu = 0) if (my_test$p.value < 0.05) { typeIIerr[i] <- F } else { typeIIerr[i] <- T } } mean(typeIIerr) ## [1] 0.719 sd(typeIIerr) / sqrt(nit) ## [1] 0.01422115 # e # The expected value of false positives would be lower than 0.05. The expected # value of true negatives would be 0, as there are no negatives (the null # hypothesis is always the truth). typeIerr <- vector(mode = "logical", length = nit) for (i in 1:nit) { u <- runif(1, -1, 0) x <- rnorm(10, u) my_test <- t.test(x, alternative = "greater", mu = 0) if (my_test$p.value < 0.05) { typeIerr[i] <- T } else { typeIerr[i] <- F } } mean(typeIerr) ## [1] 0.012 sd(typeIerr) / sqrt(nit) ## [1] 0.003444977 Exercise 16.3 (T-test, confidence intervals, and bootstrap) Sample \\(n=20\\) from a standard normal distribution and calculate the p-value using t-test, confidence intervals based on normal distribution, and bootstrap. Repeat this several times and check how many times we rejected the null hypothesis (made a type I error). Hint: For the confidence intervals you can use function CI from the Rmisc package. set.seed(1) library(Rmisc) nit <- 1000 n_boot <- 100 t_logic <- rep(F, nit) boot_logic <- rep(F, nit) norm_logic <- rep(F, nit) for (i in 1:nit) { x <- rnorm(20) my_test <- t.test(x) my_CI <- CI(x) if (my_test$p.value <= 0.05) t_logic[i] <- T boot_tmp <- vector(mode = "numeric", length = n_boot) for (j in 1:n_boot) { tmp_samp <- sample(x, size = 20, replace = T) boot_tmp[j] <- mean(tmp_samp) } if ((quantile(boot_tmp, 0.025) >= 0) | (quantile(boot_tmp, 0.975) <= 0)) { boot_logic[i] <- T } if ((my_CI[3] >= 0) | (my_CI[1] <= 0)) { norm_logic[i] <- T } } mean(t_logic) ## [1] 0.053 sd(t_logic) / sqrt(nit) ## [1] 0.007088106 mean(boot_logic) ## [1] 0.093 sd(boot_logic) / sqrt(nit) ## [1] 0.009188876 mean(norm_logic) ## [1] 0.053 sd(norm_logic) / sqrt(nit) ## [1] 0.007088106 Exercise 16.4 (Chi-squared test) Show that the \\(\\chi^2 = \\sum_{i=1}^k \\frac{(O_i - E_i)^2}{E_i}\\) test statistic is approximately \\(\\chi^2\\) distributed when we have two categories. Let us look at the US voting data here. Compare the number of voters who voted for Trump or Hillary depending on their income (less or more than 100.000 dollars per year). Manually calculate the chi-squared statistic, compare to the chisq.test in R, and discuss the results. Visualize the test. Solution. Let \\(X_i\\) be binary variables, \\(i = 1,...,n\\). We can then express the test statistic as \\[\\begin{align} \\chi^2 = &\\frac{(O_i - np)^2}{np} + \\frac{(n - O_i - n(1 - p))^2}{n(1 - p)} \\\\ &= \\frac{(O_i - np)^2}{np(1 - p)} \\\\ &= (\\frac{O_i - np}{\\sqrt{np(1 - p)}})^2. \\end{align}\\] When \\(n\\) is large, this distrbution is approximately normal with \\(\\mu = np\\) and \\(\\sigma^2 = np(1 - p)\\) (binomial converges in distribution to standard normal). By definition, the chi-squared distribution with \\(k\\) degrees of freedom is a sum of squares of \\(k\\) independent standard normal random variables. n <- 24588 less100 <- round(0.66 * n * c(0.49, 0.45, 0.06)) # some rounding, but it should not affect results more100 <- round(0.34 * n * c(0.47, 0.47, 0.06)) x <- rbind(less100, more100) colnames(x) <- c("Clinton", "Trump", "other/no answer") print(x) ## Clinton Trump other/no answer ## less100 7952 7303 974 ## more100 3929 3929 502 chisq.test(x) ## ## Pearson's Chi-squared test ## ## data: x ## X-squared = 9.3945, df = 2, p-value = 0.00912 x ## Clinton Trump other/no answer ## less100 7952 7303 974 ## more100 3929 3929 502 csum <- apply(x, 2, sum) rsum <- apply(x, 1, sum) chi2 <- (x[1,1] - csum[1] * rsum[1] / sum(x))^2 / (csum[1] * rsum[1] / sum(x)) + (x[1,2] - csum[2] * rsum[1] / sum(x))^2 / (csum[2] * rsum[1] / sum(x)) + (x[1,3] - csum[3] * rsum[1] / sum(x))^2 / (csum[3] * rsum[1] / sum(x)) + (x[2,1] - csum[1] * rsum[2] / sum(x))^2 / (csum[1] * rsum[2] / sum(x)) + (x[2,2] - csum[2] * rsum[2] / sum(x))^2 / (csum[2] * rsum[2] / sum(x)) + (x[2,3] - csum[3] * rsum[2] / sum(x))^2 / (csum[3] * rsum[2] / sum(x)) chi2 ## Clinton ## 9.394536 1 - pchisq(chi2, df = 2) ## Clinton ## 0.009120161 x <- seq(0, 15, by = 0.01) df <- data.frame(x = x) ggplot(data = df, aes(x = x)) + stat_function(fun = dchisq, args = list(df = 2)) + geom_segment(aes(x = chi2, y = 0, xend = chi2, yend = dchisq(chi2, df = 2))) + stat_function(fun = dchisq, args = list(df = 2), xlim = c(chi2, 15), geom = "area", fill = "red") "],["bi.html", "Chapter 17 Bayesian inference 17.1 Conjugate priors 17.2 Posterior sampling", " Chapter 17 Bayesian inference This chapter deals with Bayesian inference. The students are expected to acquire the following knowledge: How to set prior distribution. Compute posterior distribution. Compute posterior predictive distribution. Use sampling for inference. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 17.1 Conjugate priors Exercise 17.1 (Poisson-gamma model) Let us assume a Poisson likelihood and a gamma prior on the Poisson mean parameter (this is a conjugate prior). Derive posterior Below we have some data, which represents number of goals in a football match. Choose sensible prior for this data (draw the gamma density if necessary), justify it. Compute the posterior. Compute an interval such that the probability that the true mean is in there is 95%. What is the probability that the true mean is greater than 2.5? Back to theory: Compute prior predictive and posterior predictive. Discuss why the posterior predictive is overdispersed and not Poisson? Draw a histogram of the prior predictive and posterior predictive for the data from (b). Discuss. Generate 10 and 100 random samples from a Poisson distribution and compare the posteriors with a flat prior, and a prior concentrated away from the truth. x <- c(3, 2, 1, 1, 5, 4, 0, 0, 4, 3) Solution. \\[\\begin{align*} p(\\lambda | X) &= \\frac{p(X | \\lambda) p(\\lambda)}{\\int_0^\\infty p(X | \\lambda) p(\\lambda) d\\lambda} \\\\ &\\propto p(X | \\lambda) p(\\lambda) \\\\ &= \\Big(\\prod_{i=1}^n \\frac{1}{x_i!} \\lambda^{x_i} e^{-\\lambda}\\Big) \\frac{\\beta^\\alpha}{\\Gamma(\\alpha)} \\lambda^{\\alpha - 1} e^{-\\beta \\lambda} \\\\ &\\propto \\lambda^{\\sum_{i=1}^n x_i + \\alpha - 1} e^{- \\lambda (n + \\beta)} \\\\ \\end{align*}\\] We recognize this as the shape of a gamma distribution, therefore \\[\\begin{align*} \\lambda | X \\sim \\text{gamma}(\\alpha + \\sum_{i=1}^n x_i, \\beta + n) \\end{align*}\\] For the prior predictive, we have \\[\\begin{align*} p(x^*) &= \\int_0^\\infty p(x^*, \\lambda) d\\lambda \\\\ &= \\int_0^\\infty p(x^* | \\lambda) p(\\lambda) d\\lambda \\\\ &= \\int_0^\\infty \\frac{1}{x^*!} \\lambda^{x^*} e^{-\\lambda} \\frac{\\beta^\\alpha}{\\Gamma(\\alpha)} \\lambda^{\\alpha - 1} e^{-\\beta \\lambda} d\\lambda \\\\ &= \\frac{\\beta^\\alpha}{\\Gamma(x^* + 1)\\Gamma(\\alpha)} \\int_0^\\infty \\lambda^{x^* + \\alpha - 1} e^{-\\lambda (1 + \\beta)} d\\lambda \\\\ &= \\frac{\\beta^\\alpha}{\\Gamma(x^* + 1)\\Gamma(\\alpha)} \\frac{\\Gamma(x^* + \\alpha)}{(1 + \\beta)^{x^* + \\alpha}} \\int_0^\\infty \\frac{(1 + \\beta)^{x^* + \\alpha}}{\\Gamma(x^* + \\alpha)} \\lambda^{x^* + \\alpha - 1} e^{-\\lambda (1 + \\beta)} d\\lambda \\\\ &= \\frac{\\beta^\\alpha}{\\Gamma(x^* + 1)\\Gamma(\\alpha)} \\frac{\\Gamma(x^* + \\alpha)}{(1 + \\beta)^{x^* + \\alpha}} \\\\ &= \\frac{\\Gamma(x^* + \\alpha)}{\\Gamma(x^* + 1)\\Gamma(\\alpha)} (\\frac{\\beta}{1 + \\beta})^\\alpha (\\frac{1}{1 + \\beta})^{x^*}, \\end{align*}\\] which we recognize as the negative binomial distribution with \\(r = \\alpha\\) and \\(p = \\frac{1}{\\beta + 1}\\). For the posterior predictive, the calculation is the same, only now the parameters are \\(r = \\alpha + \\sum_{i=1}^n x_i\\) and \\(p = \\frac{1}{\\beta + n + 1}\\). There are two sources of uncertainty in the predictive distribution. First is the uncertainty about the population. Second is the variability in sampling from the population. When \\(n\\) is large, the latter is going to be very small. But when \\(n\\) is small, the latter is going to be higher, resulting in an overdispersed predictive distribution. x <- c(3, 2, 1, 1, 5, 4, 0, 0, 4, 3) # b # quick visual check of the prior ggplot(data = data.frame(x = seq(0, 5, by = 0.01)), aes(x = x)) + stat_function(fun = dgamma, args = list(shape = 1, rate = 1)) palpha <- 1 pbeta <- 1 alpha_post <- palpha + sum(x) beta_post <- pbeta + length(x) ggplot(data = data.frame(x = seq(0, 5, by = 0.01)), aes(x = x)) + stat_function(fun = dgamma, args = list(shape = alpha_post, rate = beta_post)) # probability of being higher than 2.5 1 - pgamma(2.5, alpha_post, beta_post) ## [1] 0.2267148 # interval qgamma(c(0.025, 0.975), alpha_post, beta_post) ## [1] 1.397932 3.137390 # d prior_pred <- rnbinom(1000, size = palpha, prob = 1 - 1 / (pbeta + 1)) post_pred <- rnbinom(1000, size = palpha + sum(x), prob = 1 - 1 / (pbeta + 10 + 1)) df <- data.frame(prior = prior_pred, posterior = post_pred) df <- gather(df) ggplot(df, aes(x = value, fill = key)) + geom_histogram(position = "dodge") # e set.seed(1) x1 <- rpois(10, 2.5) x2 <- rpois(100, 2.5) alpha_flat <- 1 beta_flat <- 0.1 alpha_conc <- 50 beta_conc <- 10 n <- 10000 df_flat <- data.frame(x1 = rgamma(n, alpha_flat + sum(x1), beta_flat + 10), x2 = rgamma(n, alpha_flat + sum(x2), beta_flat + 100), type = "flat") df_flat <- tidyr::gather(df_flat, key = "key", value = "value", - type) df_conc <- data.frame(x1 = rgamma(n, alpha_conc + sum(x1), beta_conc + 10), x2 = rgamma(n, alpha_conc + sum(x2), beta_conc + 100), type = "conc") df_conc <- tidyr::gather(df_conc, key = "key", value = "value", - type) df <- rbind(df_flat, df_conc) ggplot(data = df, aes(x = value, color = type)) + facet_wrap(~ key) + geom_density() 17.2 Posterior sampling Exercise 17.2 (Bayesian logistic regression) In Chapter 15 we implemented a MLE for logistic regression (see the code below). For this model, conjugate priors do not exist, which complicates the calculation of the posterior. However, we can use sampling from the numerator of the posterior, using rejection sampling. Set a sensible prior distribution on \\(\\beta\\) and use rejection sampling to find the posterior distribution. In a) you will get a distribution of parameter \\(\\beta\\). Plot the probabilities (as in exercise 15.3) for each sample of \\(\\beta\\) and compare to the truth. Hint: We can use rejection sampling even for functions which are not PDFs – they do not have to sum/integrate to 1. We just need to use a suitable envelope that we know how to sample from. For example, here we could use a uniform distribution and scale it suitably. set.seed(1) inv_log <- function (z) { return (1 / (1 + exp(-z))) } x <- rnorm(100) y <- x y <- rbinom(100, size = 1, prob = inv_log(1.2 * x)) l_logistic <- function (beta, X, y) { logl <- -sum(y * log(inv_log(as.vector(beta %*% X))) + (1 - y) * log((1 - inv_log(as.vector(beta %*% X))))) return(logl) } my_optim <- optim(par = 0.5, fn = l_logistic, method = "L-BFGS-B", lower = 0, upper = 10, X = x, y = y) my_optim$par # Let's say we believe that the mean of beta is 0.5. Since we are not very sure # about this, we will give it a relatively high variance. So a normal prior with # mean 0.5 and standard deviation 5. But there is no right solution to this, # this is basically us expressing our prior belief in the parameter values. set.seed(1) inv_log <- function (z) { return (1 / (1 + exp(-z))) } x <- rnorm(100) y <- x y <- rbinom(100, size = 1, prob = inv_log(1.2 * x)) l_logistic <- function (beta, X, y) { logl <- -sum(y * log(inv_log(as.vector(beta %*% X))) + (1 - y) * log((1 - inv_log(as.vector(beta %*% X))))) if (is.nan(logl)) logl <- Inf return(logl) } my_optim <- optim(par = 0.5, fn = l_logistic, method = "L-BFGS-B", lower = 0, upper = 10, X = x, y = y) my_optim$par ## [1] 1.166558 f_logistic <- function (beta, X, y) { logl <- prod(inv_log(as.vector(beta %*% X))^y * (1 - inv_log(as.vector(beta %*% X)))^(1 - y)) return(logl) } a <- seq(0, 3, by = 0.01) my_l <- c() for (i in a) { my_l <- c(my_l, f_logistic(i, x, y) * dnorm(i, 0.5, 5)) } plot(my_l) envlp <- 10^(-25.8) * dunif(a, -5, 5) # found by trial and error tmp <- data.frame(envel = envlp, l = my_l, t = a) tmp <- gather(tmp, key = "key", value = "value", - t) ggplot(tmp, aes(x = t, y = value, color = key)) + geom_line() # envelope OK set.seed(1) nsamps <- 1000 samps <- c() for (i in 1:nsamps) { tmp <- runif(1, -5, 5) u <- runif(1, 0, 1) if (u < (f_logistic(tmp, x, y) * dnorm(tmp, 0.5, 5)) / (10^(-25.8) * dunif(tmp, -5, 5))) { samps <- c(samps, tmp) } } plot(density(samps)) mean(samps) ## [1] 1.211578 median(samps) ## [1] 1.204279 truth_p <- data.frame(x = x, prob = inv_log(1.2 * x), type = "truth") preds <- inv_log(x %*% t(samps)) preds <- gather(cbind(as.data.frame(preds), x = x), key = "key", "value" = value, - x) ggplot(preds, aes(x = x, y = value)) + geom_line(aes(group = key), color = "gray", alpha = 0.7) + geom_point(data = truth_p, aes(y = prob), color = "red", alpha = 0.7) + theme_bw() "],["distributions-intutition.html", "Chapter 18 Distributions intutition 18.1 Discrete distributions 18.2 Continuous distributions", " Chapter 18 Distributions intutition This chapter is intended to help you familiarize yourself with the different probability distributions you will encounter in this course. You will need to use Appendix B extensively as a reference for the basic properties of distributions, so keep it close! .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } 18.1 Discrete distributions Exercise 18.1 (Bernoulli intuition 1) The simplest distribution you will encounter is the Bernoulli distribution. It is a discrete probability distribution used to represent the outcome of a yes/no question. It has one parameter \\(0 \\leq p \\leq 1\\), which is the probability of success. The probability of failure is \\(q = (1-p)\\). A classic way to think about a Bernoulli trial (a yes/no experiment) is a coin flip. Real coins are fair, meaning the probability of either heads (1) or tails (0) are the same, so \\(p=0.5\\) as shown below in figure a. Alternatively we may want to represent a process that doesn’t have equal probabilities of outcomes like “Will a throw of a fair die result in a 6?”. In this case \\(p=\\frac{1}{6}\\), shown in figure b. Using your knowledge of the Bernoulli distribution use the throw of a fair die to think of events, such that: \\(p = 0.5\\) \\(p = \\frac{5}{6}\\) \\(q = \\frac{2}{3}\\) Solution. An event that is equally likely to happen or not happen i.e. \\(p = 0.5\\) would be throwing an even number. More formally we can name this event \\(A\\) and write: \\(A = \\{2,4,6\\}\\), its probability being \\(P(A) = 0.5\\) An example of an event with \\(p = \\frac{5}{6}\\) would be throwing a number greater than 1. Defined as \\(B = \\{2,3,4,5,6\\}\\). We need an event that fails \\(\\frac{2}{3}\\) of the time. Alternatively we can reverse the problem and find an event that succeeds \\(\\frac{1}{3}\\) of the time, since: \\(q = 1 - p \\implies p = 1 - q = \\frac{1}{3}\\). The event that our outcome is divisible by 3: \\(C = \\{3, 6\\}\\) satisfies this condition. Exercise 18.2 (Binomial intuition 1) The binomial distribution is a generalization of the Bernoulli distribution. Instead of considering a single Bernoulli trial, we now consider a sum of a sequence of \\(n\\) trials, which are independent and have the same parameter \\(p\\). So the binomial distribution has two parameters \\(n\\) - the number of trials and \\(p\\) - the probability of success for each trial. If we return to our coin flip representation, we now flip a coin several times. The binomial distribution will give us the probabilities of all possible outcomes. Below we show the distribution for a series of 10 coin flips with a fair coin (left) and a biased coin (right). The numbers on the x axis represent the number of times the coin landed heads. Using your knowledge of the binomial distribution: Take the pmf of the binomial distribution and plug in \\(n=1\\), check that it is in fact equivalent to a Bernoulli distribution. In our examples we show the graph of a binomial distribution over 10 trials with \\(p=0.8\\). If we take a look at the graph, it appears as though the probabilities of getting 0,1, 2 or 3 heads in 10 flips are zero. Is it actually zero? Check by plugging in the values into the pmf. Solution. The pmf of a binomial distribution is \\(\\binom{n}{k} p^k (1 - p)^{n - k}\\), now we insert \\(n=1\\) to get: \\[\\binom{1}{k} p^k (1 - p)^{1 - k}\\] Not quite equivalent to a Bernoulli, however note that the support of the binomial distribution is defined as \\(k \\in \\{0,1,\\dots,n\\}\\), so in our case \\(k = \\{0,1\\}\\), then: \\[\\binom{1}{0} = \\binom{1}{1} = 1\\] we get: \\(p^k (1 - p)^{1 - k}\\) ,the Bernoulli distribution. As we already know \\(p=0.8, n=10\\), so: \\[\\binom{10}{0} 0.8^0 (1 - 0.8)^{10 - 0} = 1.024 \\cdot 10^{-7}\\] \\[\\binom{10}{1} 0.8^1 (1 - 0.8)^{10 - 1} = 4.096 \\cdot 10^{-6}\\] \\[\\binom{10}{2} 0.8^2 (1 - 0.8)^{10 - 2} = 7.3728 \\cdot 10^{-5}\\] \\[\\binom{10}{3} 0.8^3 (1 - 0.8)^{10 - 3} = 7.86432\\cdot 10^{-4}\\] So the probabilities are not zero, just very small. Exercise 18.3 (Poisson intuition 1) Below are shown 3 different graphs of the Poisson distribution. Your task is to replicate them on your own in R by varying the \\(\\lambda\\) parameter. Hint: You can use dpois() to get the probabilities. library(ggplot2) library(gridExtra) x = 0:15 # Create Poisson data data1 <- data.frame(x = x, y = dpois(x, lambda = 0.1)) data2 <- data.frame(x = x, y = dpois(x, lambda = 1)) data3 <- data.frame(x = x, y = dpois(x, lambda = 7.5)) # Create individual ggplot objects plot1 <- ggplot(data1, aes(x, y)) + geom_col() + xlab("x") + ylab("Probability") + ylim(0,1) plot2 <- ggplot(data2, aes(x, y)) + geom_col() + xlab("x") + ylab(NULL) + ylim(0,1) plot3 <- ggplot(data3, aes(x, y)) + geom_col() + xlab("x") + ylab(NULL) + ylim(0,1) # Combine the plots grid.arrange(plot1, plot2, plot3, ncol = 3) Exercise 18.4 (Poisson intuition 2) The Poisson distribution is a discrete probability distribution that models the probability of a given number of events occuring within processes where events occur at a constant mean rate and independently of each other - a Poisson process. It has a single parameter \\(\\lambda\\), which represents the constant mean rate. A classic example of a scenario that can be modeled using the Poisson distribution is the number of calls received at a call center in a day (or in fact any other time interval). Suppose you work in a call center and have some understanding of probability distributions. You overhear your supervisor mentioning that the call center receives an average of 2.5 calls per day. Using your knowledge of the Poisson distribution, calculate: The probability you will get no calls today. The probability you will get more than 5 calls today. Solution. First recall the Poisson pmf: \\[p(k) = \\frac{\\lambda^k e^{-\\lambda}}{k!}\\] as stated previously our parameter \\(\\lambda = 2.5\\) To get the probability of no calls we simply plug in \\(k = 0\\), so: \\[p(0) = \\frac{2.5^0 e^{-2.5}}{0!} = e^{-2.5} \\approx 0.082\\] The support of the Poisson distribution is non-negative integers. So if we wanted to calculate the probability of getting more than 5 calls we would need to add up the probabilities of getting 6 calls and 7 calls and so on up to infinity. Let us instead remember that the sum of all probabilties will be 1, we will reverse the problem and instead ask “What is the probability we get 5 calls or less?”. We can subtract the probability of the opposite outcome (the complement) from 1 to get the probability of our original question. \\[P(k > 5) = 1 - P(k \\leq 5)\\] \\[P(k \\leq 5) = \\sum_{i=0}^{5} p(i) = p(0) + p(1) + p(2) + p(3) + p(4) + p(5) =\\] \\[= \\frac{2.5^0 e^{-2.5}}{0!} + \\frac{2.5^1 e^{-2.5}}{1!} + \\dots =\\] \\[=0.957979\\] So the probability of geting more than 5 calls will be \\(1 - 0.957979 = 0.042021\\) Exercise 18.5 (Geometric intuition 1) The geometric distribution is a discrete distribution that models the number of failures before the first success in a sequence of independent Bernoulli trials. It has a single parameter \\(p\\), representing the probability of success and its support is all non-negative integers \\(\\{0,1,2,\\dots\\}\\). NOTE: There is an alternative way to think about this distribution, one that models the number of trials before the first success. The difference is subtle yet significant and you are likely to encounter both forms. The key to telling them apart is to check their support, since the number of trials has to be at least \\(1\\), for this case we have \\(\\{1,2,\\dots\\}\\). In the graph below we show the pmf of a geometric distribution with \\(p=0.5\\). This can be thought of as the number of successive failures (tails) in the flip of a fair coin. You can see that there’s a 50% chance you will have zero failures i.e. you will flip a heads on your very first attempt. But there is some smaller chance that you will flip a sequence of tails in a row, with longer sequences having ever lower probability. Create an equivalent graph that represents the probability of rolling a 6 with a fair 6-sided die. Use the formula for the mean of the geometric distribution and determine the average number of failures before you roll a 6. Look up the alternative form of the geometric distribtuion and again use the formula for the mean to determine the average number of trials up to and including rolling a 6. Solution. Parameter p (the probability of success) for rolling a 6 is \\(p=\\frac{1}{6}\\). library(ggplot2) # Parameters p <- 1/6 x_vals <- 0:9 # Starting from 0 probs <- dgeom(x_vals, p) # Data data <- data.frame(x_vals, probs) # Plot ggplot(data, aes(x=x_vals, y=probs)) + geom_segment(aes(xend=x_vals, yend=0), color="black", size=1) + geom_point(color="red", size=2) + labs(x = "Number of trials", y = "Probability") + theme_minimal() + scale_x_continuous(breaks = x_vals) # This line ensures integer x-axis labels ::: {.solution} b) The expected value of a random variable (the mean) is denoted as \\(E[X]\\). \\[E[X] = \\frac{1-p}{p}= \\frac{1- \\frac{1}{6}}{\\frac{1}{6}} = \\frac{5}{6}\\cdot 6 = 5\\] On average we will fail 5 times before we roll our first 6. The alternative form of this distribution (with support on all positive integers) has a slightly different formula for the mean. This change reflects the difference in the way we posed our question: \\[E[X] = \\frac{1}{p} = \\frac{1}{\\frac{1}{6}} = 6\\] On average we will have to throw the die 6 times before we roll a 6. ::: 18.2 Continuous distributions Exercise 18.6 (Uniform intuition 1) The need for a randomness is a common problem. A practical solution are so-called random number generators (RNGs). The simplest RNG one would think of is choosing a set of numbers and having the generator return a number at random, where the probability of returning any number from this set is the same. If this set is an interval of real numbers, then we’ve basically described the continuous uniform distribution. It has two parameters \\(a\\) and \\(b\\), which define the beginning and end of its support respectively. Let’s think about the mean intuitively. Think of the area under the graph as a geometric shape. The expected value or mean of a distribution is the x-axis value of its center of mass. Given parameters \\(a\\) and \\(b\\) what is your intuitive guess of the mean for the uniform distribution? A special case of the uniform distribution is the standard uniform distribution with \\(a=0\\) and \\(b=1\\). Write the pdf \\(f(x)\\) of this particular distribution. Solution. The center of mass is the center of the square from \\(a\\) to \\(b\\) and from 0 to \\(\\frac{1}{b-a}\\). Its value on the x-axis is the midpoint between \\(a\\) and \\(b\\), so \\(\\frac{a+b}{2}\\) Inserting the parameter values we get:\\[f(x) = \\begin{cases} 1 & \\text{if } 0 \\leq x \\leq 1 \\\\ 0 & \\text{otherwise} \\end{cases} \\] Notice how the pdf is just a constant \\(1\\) across all values of \\(x \\in [0,1]\\). Here it is important to distinguish between probability and probability density. The density may be 1, but the probability is not and while discrete distributions never exceed 1 on the y-axis, continuous distributions can go as high as you like. Exercise 18.7 (Normal intuition 1) The normal distribution, also known as the Gaussian distribution, is a continuous distribution that encompasses the entire real number line. It has two parameters: the mean, denoted by \\(\\mu\\), and the variance, represented by \\(\\sigma^2\\). Its shape resembles the iconic bell curve. The position of its peak is determined by the parameter \\(\\mu\\), while the variance determines the spread or width of the curve. A smaller variance results in a sharper, narrower peak, while a larger variance leads to a broader, more spread-out curve. Below, we graph the distribution of IQ scores for two different populations. We aim to identify individuals with an IQ at or above 140 for an experiment. We can identify them reliably; however, we only have time to examine one of the two groups. Which group should we investigate to have the best chance of finding such individuals? NOTE: The graph below displays the parameter \\(\\sigma\\), which is the square root of the variance, more commonly referred to as the standard deviation. Keep this in mind when solving the problems. Insert the values of either population into the pdf of a normal distribution and determine which one has a higher density at \\(x=140\\). Generate the graph yourself and zoom into the relevant area to graphically verify your answer. To determine probability density, we can use the pdf. However, if we wish to know the proportion of the population that falls within certain parameters, we would need to integrate the pdf. Fortunately, the integrals of common distributions are well-established. This integral gives us the cumulative distribution function \\(F(x)\\) (CDF). BONUS: Look up the CDF of the normal distribution and input the appropriate values to determine the percentage of each population that comprises individuals with an IQ of 140 or higher. Solution. Group 1: \\(\\mu = 100, \\sigma=10 \\rightarrow \\sigma^2 = 100\\) \\[\\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(x - \\mu)^2}{2 \\sigma^2}} = \\frac{1}{\\sqrt{2 \\pi 100}} e^{-\\frac{(140 - 100)^2}{2 \\cdot 100}} \\approx 1.34e-05\\] Group 2: \\(\\mu = 105, \\sigma=8 \\rightarrow \\sigma^2 = 64\\) \\[\\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(x - \\mu)^2}{2 \\sigma^2}} = \\frac{1}{\\sqrt{2 \\pi 64}} e^{-\\frac{(140 - 105)^2}{2 \\cdot 64}} \\approx 3.48e-06\\] So despite the fact that group 1 has a lower average IQ, we are more likely to find 140 IQ individuals in this group. library(ggplot2) library(tidyr) # Create data x <- seq(135, 145, by = 0.01) # Adjusting the x range to account for the larger standard deviations df <- data.frame(x = x) # Define the IQ distributions df$IQ_mu100_sd10 <- dnorm(df$x, mean = 100, sd = 10) df$IQ_mu105_sd8 <- dnorm(df$x, mean = 105, sd = 8) # Convert from wide to long format for ggplot2 df_long <- gather(df, distribution, density, -x) # Ensure the levels of the 'distribution' factor match our desired order df_long$distribution <- factor(df_long$distribution, levels = c("IQ_mu100_sd10", "IQ_mu105_sd8")) # Plot ggplot(df_long, aes(x = x, y = density, color = distribution)) + geom_line() + labs(x = "IQ Score", y = "Density") + scale_color_manual( name = "IQ Distribution", values = c(IQ_mu100_sd10 = "red", IQ_mu105_sd8 = "blue"), labels = c("Group 1 (µ=100, σ=10)", "Group 2 (µ=105, σ=8)") ) + theme_minimal() ::: {.solution} c. The CDF of the normal distribution is \\(\\Phi(x) = \\frac{1}{2} \\left[ 1 + \\text{erf} \\left( \\frac{x - \\mu}{\\sigma \\sqrt{2}} \\right) \\right]\\). The CDF is defined as the integral of the distribution density up to x. So to get the total percentage of individuals with IQ at 140 or higher we will need to subtract the value from 1. Group 1: \\[1 - \\Phi(140) = \\frac{1}{2} \\left[ 1 + \\text{erf} \\left( \\frac{140 - 100}{10 \\sqrt{2}} \\right) \\right] \\approx 3.17e-05 \\] Group 2 : \\[1 - \\Phi(140) = \\frac{1}{2} \\left[ 1 + \\text{erf} \\left( \\frac{140 - 105}{8 \\sqrt{2}} \\right) \\right] \\approx 6.07e-06 \\] So roughly 0.003% and 0.0006% of individuals in groups 1 and 2 respectively have an IQ at or above 140. ::: Exercise 18.8 (Beta intuition 1) The beta distribution is a continuous distribution defined on the unit interval \\([0,1]\\). It has two strictly positive paramters \\(\\alpha\\) and \\(\\beta\\), which determine its shape. Its support makes it especially suitable to model distribtuions of percentages and proportions. Below you’ve been provided with some code that you can copy into Rstudio. Once you run the code, an interactive Shiny app will appear and you will be able to manipulate the graph of the beta distribution. Play around with the parameters to get: A symmetric bell curve A bowl-shaped curve The standard uniform distribution is actually a special case of the beta distribution. Find the exact parameters \\(\\alpha\\) and \\(\\beta\\). Once you do, prove the equality by inserting the values into our pdf. Hint: The beta function is evaluated as \\(\\text{B}(a,b) = \\frac{\\Gamma(a)\\Gamma(b)}{\\Gamma(a+b)}\\), the gamma function for positive integers \\(n\\) is evaluated as \\(\\Gamma(n)= (n-1)!\\) # Install and load necessary packages install.packages(c("shiny", "ggplot2")) library(shiny) library(ggplot2) # The Shiny App ui <- fluidPage( titlePanel("Beta Distribution Viewer"), sidebarLayout( sidebarPanel( sliderInput("alpha", "Alpha:", min = 0.1, max = 10, value = 2, step = 0.1), sliderInput("beta", "Beta:", min = 0.1, max = 10, value = 2, step = 0.1) ), mainPanel( plotOutput("betaPlot") ) ) ) server <- function(input, output) { output$betaPlot <- renderPlot({ x <- seq(0, 1, by = 0.01) y <- dbeta(x, shape1 = input$alpha, shape2 = input$beta) ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) + geom_line() + labs(x = "Value", y = "Density") + theme_minimal() }) } shinyApp(ui = ui, server = server) Solution. Possible solution \\(\\alpha = \\beta= 5\\) Possible solution \\(\\alpha = \\beta= 0.5\\) The correct parameters are \\(\\alpha = 1, \\beta=1\\), to prove the equality we insert them into the beta pdf: \\[\\frac{x^{\\alpha - 1} (1 - x)^{\\beta - 1}}{\\text{B}(\\alpha, \\beta)} = \\frac{x^{1 - 1} (1 - x)^{1 - 1}}{\\text{B}(1, 1)} = \\frac{1}{\\frac{\\Gamma(1)\\Gamma(1)}{\\Gamma(1+1)}}= \\frac{1}{\\frac{(1-1)!(1-1)!}{(2-1)!}} = 1\\] Exercise 18.9 (Exponential intuition 1) The exponential distribution represents the distributon of time between events in a Poisson process. It is the continuous analogue of the geometric distribution. It has a single parameter \\(\\lambda\\), which is strictly positive and represents the constant rate of the corresponding Poisson process. The support is all positive reals, since time between events is non-negative, but not bound upwards. Let’s revisit the call center from our Poisson problem. We get 2.5 calls per day on average, this is our rate parameter \\(\\lambda\\). A work day is 8 hours. What is the mean time between phone calls? The cdf \\(F(x)\\) tells us what percentage of calls occur within x amount of time of each other. You want to take an hour long lunch break but are worried about missing calls. Calculate the probability of missing at least one call if you’re gone for an hour. Hint: The cdf is \\(F(x) = \\int_{-\\infty}^{x} f(x) dx\\) Solution. Taking \\(\\lambda = \\frac{2.5 \\text{ calls}}{8 \\text{ hours}} = \\frac{1 \\text{ call}}{3.2 \\text{ hours}}\\) \\[E[X] = \\frac{1}{\\lambda} = \\frac{3.2 \\text{ hours}}{\\text{call}}\\] First we derive the CDF, we can integrate from 0 instead of \\(-\\infty\\), since we have no support in the negatives: \\[\\begin{align} F(x) &= \\int_{0}^{x} \\lambda e^{-\\lambda t} dt \\\\ &= \\lambda \\int_{0}^{x} e^{-\\lambda t} dt \\\\ &= \\lambda (\\frac{1}{-\\lambda}e^{-\\lambda t} |_{0}^{x}) \\\\ &= \\lambda(\\frac{1}{\\lambda} - \\frac{1}{\\lambda} e^{-\\lambda x}) \\\\ &= 1 - e^{-\\lambda x}. \\end{align}\\] Then we just evaluate it for a time of 1 hour: \\[F(1 \\text{ hour}) = 1 - e^{-\\frac{1 \\text{ call}}{3.2 \\text{ hours}} \\cdot 1 \\text{ hour}}= 1 - e^{-\\frac{1 \\text{ call}}{3.2 \\text{ hours}}} \\approx 0.268\\] So we have about a 27% chance of missing at least one call if we’re gone for an hour. Exercise 18.10 (Gamma intuition 1) The gamma distribution is a continuous distribution with by two parameters, \\(\\alpha\\) and \\(\\beta\\), both greater than 0. These parameters afford the distribution a broad range of shapes, leading to it being commonly referred to as a family of distributions. Given its support over the positive real numbers, it is well suited for modeling a diverse range of positive-valued phenomena. The exponential distribution is actually just a particular form of the gamma distribution. What are the values of \\(\\alpha\\) and \\(\\beta\\)? Copy the code from our beta distribution Shiny app and modify it to simulate the gamma distribution. Then get it to show the exponential. Solution. Let’s start by taking a look at the pdfs of the two distributions side by side: \\[\\frac{\\beta^\\alpha}{\\Gamma(\\alpha)} x^{\\alpha - 1}e^{-\\beta x} = \\lambda e^{-\\lambda x}\\] The \\(x^{\\alpha - 1}\\) term is not found anywhere in the pdf of the exponential so we need to eliminate it by setting \\(\\alpha = 1\\). This also makes the fraction evaluate to \\(\\frac{\\beta^1}{\\Gamma(1)} = \\beta\\), which leaves us with \\[\\beta \\cdot e^{-\\beta x}\\] Now we can see that \\(\\beta = \\lambda\\) and \\(\\alpha = 1\\). # Install and load necessary packages install.packages(c("shiny", "ggplot2")) library(shiny) library(ggplot2) # The Shiny App ui <- fluidPage( titlePanel("Gamma Distribution Viewer"), sidebarLayout( sidebarPanel( sliderInput("shape", "Shape (α):", min = 0.1, max = 10, value = 2, step = 0.1), sliderInput("scale", "Scale (β):", min = 0.1, max = 10, value = 2, step = 0.1) ), mainPanel( plotOutput("gammaPlot") ) ) ) server <- function(input, output) { output$gammaPlot <- renderPlot({ x <- seq(0, 25, by = 0.1) y <- dgamma(x, shape = input$shape, scale = input$scale) ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) + geom_line() + labs(x = "Value", y = "Density") + theme_minimal() }) } shinyApp(ui = ui, server = server) "],["A1.html", "A R programming language A.1 Basic characteristics A.2 Why R? A.3 Setting up A.4 R basics A.5 Functions A.6 Other tips A.7 Further reading and references", " A R programming language A.1 Basic characteristics R is free software for statistical computing and graphics. It is widely used by statisticians, scientists, and other professionals for software development and data analysis. It is an interpreted language and therefore the programs do not need compilation. A.2 Why R? R is one of the main two languages used for statistics and machine learning (the other being Python). Pros Libraries. Comprehensive collection of statistical and machine learning packages. Easy to code. Open source. Anyone can access R and develop new methods. Additionally, it is relatively simple to get source code of established methods. Large community. The use of R has been rising for some time, in industry and academia. Therefore a large collection of blogs and tutorials exists, along with people offering help on pages like StackExchange and CrossValidated. Integration with other languages and LaTeX. New methods. Many researchers develop R packages based on their research, therefore new methods are available soon after development. Cons Slow. Programs run slower than in other programming languages, however this can be somewhat ammended by effective coding or integration with other languages. Memory intensive. This can become a problem with large data sets, as they need to be stored in the memory, along with all the information the models produce. Some packages are not as good as they should be, or have poor documentation. Object oriented programming in R can be very confusing and complex. A.3 Setting up https://www.r-project.org/. A.3.1 RStudio RStudio is the most widely used IDE for R. It is free, you can download it from https://rstudio.com/. While console R is sufficient for the requirements of this course, we recommend the students install RStudio for its better user interface. A.3.2 Libraries for data science Listed below are some of the more useful libraries (packages) for data science. Students are also encouraged to find other useful packages. dplyr Efficient data manipulation. Part of the wider package collection called tidyverse. ggplot2 Plotting based on grammar of graphics. stats Several statistical models. rstan Bayesian inference using Hamiltonian Monte Carlo. Very flexible model building. MCMCpack Bayesian inference. rmarkdown, knitr, and bookdown Dynamic reports (for example such as this one). devtools Package development. A.4 R basics A.4.1 Variables and types Important information and tips: no type declaration define variables with <- instead of = (although both work, there is a slight difference, additionally most of the packages use the arrow) for strings use \"\" for comments use # change types with as.type() functions no special type for single character like C++ for example n <- 20 x <- 2.7 m <- n # m gets value 20 my_flag <- TRUE student_name <- "Luka" typeof(n) ## [1] "double" typeof(student_name) ## [1] "character" typeof(my_flag) ## [1] "logical" typeof(as.integer(n)) ## [1] "integer" typeof(as.character(n)) ## [1] "character" A.4.2 Basic operations n + x ## [1] 22.7 n - x ## [1] 17.3 diff <- n - x # variable diff gets the difference between n and x diff ## [1] 17.3 n * x ## [1] 54 n / x ## [1] 7.407407 x^2 ## [1] 7.29 sqrt(x) ## [1] 1.643168 n > 2 * n ## [1] FALSE n == n ## [1] TRUE n == 2 * n ## [1] FALSE n != n ## [1] FALSE paste(student_name, "is", n, "years old") ## [1] "Luka is 20 years old" A.4.3 Vectors use c() to combine elements into vectors can only contain one type of variable if different types are provided, all are transformed to the most basic type in the vector access elements by indexes or logical vectors of the same length a scalar value is regarded as a vector of length 1 1:4 # creates a vector of integers from 1 to 4 ## [1] 1 2 3 4 student_ages <- c(20, 23, 21) student_names <- c("Luke", "Jen", "Mike") passed <- c(TRUE, TRUE, FALSE) length(student_ages) ## [1] 3 # access by index student_ages[2] ## [1] 23 student_ages[1:2] ## [1] 20 23 student_ages[2] <- 24 # change values # access by logical vectors student_ages[passed == TRUE] # same as student_ages[passed] ## [1] 20 24 student_ages[student_names %in% c("Luke", "Mike")] ## [1] 20 21 student_names[student_ages > 20] ## [1] "Jen" "Mike" A.4.3.1 Operations with vectors most operations are element-wise if we operate on vectors of different lengths, the shorter vector periodically repeats its elements until it reaches the length of the longer one a <- c(1, 3, 5) b <- c(2, 2, 1) d <- c(6, 7) a + b ## [1] 3 5 6 a * b ## [1] 2 6 5 a + d ## Warning in a + d: longer object length is not a multiple of shorter object ## length ## [1] 7 10 11 a + 2 * b ## [1] 5 7 7 a > b ## [1] FALSE TRUE TRUE b == a ## [1] FALSE FALSE FALSE a %*% b # vector multiplication, not element-wise ## [,1] ## [1,] 13 A.4.4 Factors vectors of finite predetermined classes suitable for categorical variables ordinal (ordered) or nominal (unordered) car_brand <- factor(c("Audi", "BMW", "Mercedes", "BMW"), ordered = FALSE) car_brand ## [1] Audi BMW Mercedes BMW ## Levels: Audi BMW Mercedes freq <- factor(x = NA, levels = c("never","rarely","sometimes","often","always"), ordered = TRUE) freq[1:3] <- c("rarely", "sometimes", "rarely") freq ## [1] rarely sometimes rarely ## Levels: never < rarely < sometimes < often < always freq[4] <- "quite_often" # non-existing level, returns NA ## Warning in `[<-.factor`(`*tmp*`, 4, value = "quite_often"): invalid factor ## level, NA generated freq ## [1] rarely sometimes rarely <NA> ## Levels: never < rarely < sometimes < often < always A.4.5 Matrices two-dimensional generalizations of vectors my_matrix <- matrix(c(1, 2, 1, 5, 4, 2), nrow = 2, byrow = TRUE) my_matrix ## [,1] [,2] [,3] ## [1,] 1 2 1 ## [2,] 5 4 2 my_square_matrix <- matrix(c(1, 3, 2, 3), nrow = 2) my_square_matrix ## [,1] [,2] ## [1,] 1 2 ## [2,] 3 3 my_matrix[1,2] # first row, second column ## [1] 2 my_matrix[2, ] # second row ## [1] 5 4 2 my_matrix[ ,3] # third column ## [1] 1 2 A.4.5.1 Matrix functions and operations most operation element-wise mind the dimensions when using matrix multiplication %*% nrow(my_matrix) # number of matrix rows ## [1] 2 ncol(my_matrix) # number of matrix columns ## [1] 3 dim(my_matrix) # matrix dimension ## [1] 2 3 t(my_matrix) # transpose ## [,1] [,2] ## [1,] 1 5 ## [2,] 2 4 ## [3,] 1 2 diag(my_matrix) # the diagonal of the matrix as vector ## [1] 1 4 diag(1, nrow = 3) # creates a diagonal matrix ## [,1] [,2] [,3] ## [1,] 1 0 0 ## [2,] 0 1 0 ## [3,] 0 0 1 det(my_square_matrix) # matrix determinant ## [1] -3 my_matrix + 2 * my_matrix ## [,1] [,2] [,3] ## [1,] 3 6 3 ## [2,] 15 12 6 my_matrix * my_matrix # element-wise multiplication ## [,1] [,2] [,3] ## [1,] 1 4 1 ## [2,] 25 16 4 my_matrix %*% t(my_matrix) # matrix multiplication ## [,1] [,2] ## [1,] 6 15 ## [2,] 15 45 my_vec <- as.vector(my_matrix) # transform to vector my_vec ## [1] 1 5 2 4 1 2 A.4.6 Arrays multi-dimensional generalizations of matrices my_array <- array(c(1, 2, 3, 4, 5, 6, 7, 8), dim = c(2, 2, 2)) my_array[1, 1, 1] ## [1] 1 my_array[2, 2, 1] ## [1] 4 my_array[1, , ] ## [,1] [,2] ## [1,] 1 5 ## [2,] 3 7 dim(my_array) ## [1] 2 2 2 A.4.7 Data frames basic data structure for analysis differ from matrices as columns can be of different types student_data <- data.frame("Name" = student_names, "Age" = student_ages, "Pass" = passed) student_data ## Name Age Pass ## 1 Luke 20 TRUE ## 2 Jen 24 TRUE ## 3 Mike 21 FALSE colnames(student_data) <- c("name", "age", "pass") # change column names student_data[1, ] ## name age pass ## 1 Luke 20 TRUE student_data[ ,colnames(student_data) %in% c("name", "pass")] ## name pass ## 1 Luke TRUE ## 2 Jen TRUE ## 3 Mike FALSE student_data$pass # access column by name ## [1] TRUE TRUE FALSE student_data[student_data$pass == TRUE, ] ## name age pass ## 1 Luke 20 TRUE ## 2 Jen 24 TRUE A.4.8 Lists useful for storing different data structures access elements with double square brackets elements can be named first_list <- list(student_ages, my_matrix, student_data) second_list <- list(student_ages, my_matrix, student_data, first_list) first_list[[1]] ## [1] 20 24 21 second_list[[4]] ## [[1]] ## [1] 20 24 21 ## ## [[2]] ## [,1] [,2] [,3] ## [1,] 1 2 1 ## [2,] 5 4 2 ## ## [[3]] ## name age pass ## 1 Luke 20 TRUE ## 2 Jen 24 TRUE ## 3 Mike 21 FALSE second_list[[4]][[1]] # first element of the fourth element of second_list ## [1] 20 24 21 length(second_list) ## [1] 4 second_list[[length(second_list) + 1]] <- "add_me" # append an element names(first_list) <- c("Age", "Matrix", "Data") first_list$Age ## [1] 20 24 21 A.4.9 Loops mostly for loop for loop can iterate over an arbitrary vector # iterate over consecutive natural numbers my_sum <- 0 for (i in 1:10) { my_sum <- my_sum + i } my_sum ## [1] 55 # iterate over an arbirary vector my_sum <- 0 some_numbers <- c(2, 3.5, 6, 100) for (i in some_numbers) { my_sum <- my_sum + i } my_sum ## [1] 111.5 A.5 Functions for help use ?function_name A.5.1 Writing functions We can write our own functions with function(). In the brackets, we define the parameters the function gets, and in curly brackets we define what the function does. We use return() to return values. sum_first_n_elements <- function (n) { my_sum <- 0 for (i in 1:n) { my_sum <- my_sum + i } return (my_sum) } sum_first_n_elements(10) ## [1] 55 A.6 Other tips Use set.seed(arbitrary_number) at the beginning of a script to set the seed and ensure replication. To dynamically set the working directory in R Studio to the parent folder of a R script use setwd(dirname(rstudioapi::getSourceEditorContext()$path)). To avoid slow R loops use the apply family of functions. See ?apply and ?lapply. To make your data manipulation (and therefore your life) a whole lot easier, use the dplyr package. Use getAnywhere(function_name) to get the source code of any function. Use browser for debugging. See ?browser. A.7 Further reading and references Getting started with R Studio: https://www.youtube.com/watch?v=lVKMsaWju8w Official R manuals: https://cran.r-project.org/manuals.html Cheatsheets: https://www.rstudio.com/resources/cheatsheets/ Workshop on R, dplyr, ggplot2, and R Markdown: https://github.com/bstatcomp/Rworkshop "],["distributions.html", "B Probability distributions", " B Probability distributions Name parameters support pdf/pmf mean variance Bernoulli \\(p \\in [0,1]\\) \\(k \\in \\{0,1\\}\\) \\(p^k (1 - p)^{1 - k}\\) 1.12 \\(p\\) 7.1 \\(p(1-p)\\) 7.1 binomial \\(n \\in \\mathbb{N}\\), \\(p \\in [0,1]\\) \\(k \\in \\{0,1,\\dots,n\\}\\) \\(\\binom{n}{k} p^k (1 - p)^{n - k}\\) 4.4 \\(np\\) 7.2 \\(np(1-p)\\) 7.2 Poisson \\(\\lambda > 0\\) \\(k \\in \\mathbb{N}_0\\) \\(\\frac{\\lambda^k e^{-\\lambda}}{k!}\\) 4.6 \\(\\lambda\\) 7.3 \\(\\lambda\\) 7.3 geometric \\(p \\in (0,1]\\) \\(k \\in \\mathbb{N}_0\\) \\(p(1-p)^k\\) 4.5 \\(\\frac{1 - p}{p}\\) 7.4 \\(\\frac{1 - p}{p^2}\\) 9.3 normal \\(\\mu \\in \\mathbb{R}\\), \\(\\sigma^2 > 0\\) \\(x \\in \\mathbb{R}\\) \\(\\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(x - \\mu)^2}{2 \\sigma^2}}\\) 4.12 \\(\\mu\\) 7.8 \\(\\sigma^2\\) 7.8 uniform \\(a,b \\in \\mathbb{R}\\), \\(a < b\\) \\(x \\in [a,b]\\) \\(\\frac{1}{b-a}\\) 4.9 \\(\\frac{a+b}{2}\\) \\(\\frac{(b-a)^2}{12}\\) beta \\(\\alpha,\\beta > 0\\) \\(x \\in [0,1]\\) \\(\\frac{x^{\\alpha - 1} (1 - x)^{\\beta - 1}}{\\text{B}(\\alpha, \\beta)}\\) 4.10 \\(\\frac{\\alpha}{\\alpha + \\beta}\\) 7.6 \\(\\frac{\\alpha \\beta}{(\\alpha + \\beta)^2(\\alpha + \\beta + 1)}\\) 7.6 gamma \\(\\alpha,\\beta > 0\\) \\(x \\in (0, \\infty)\\) \\(\\frac{\\beta^\\alpha}{\\Gamma(\\alpha)} x^{\\alpha - 1}e^{-\\beta x}\\) 4.11 \\(\\frac{\\alpha}{\\beta}\\) 7.5 \\(\\frac{\\alpha}{\\beta^2}\\) 7.5 exponential \\(\\lambda > 0\\) \\(x \\in [0, \\infty)\\) \\(\\lambda e^{-\\lambda x}\\) 4.8 \\(\\frac{1}{\\lambda}\\) 7.7 \\(\\frac{1}{\\lambda^2}\\) 7.7 logistic \\(\\mu \\in \\mathbb{R}\\), \\(s > 0\\) \\(x \\in \\mathbb{R}\\) \\(\\frac{e^{-\\frac{x - \\mu}{s}}}{s(1 + e^{-\\frac{x - \\mu}{s}})^2}\\) 4.13 \\(\\mu\\) \\(\\frac{s^2 \\pi^2}{3}\\) negative binomial \\(r \\in \\mathbb{N}\\), \\(p \\in [0,1]\\) \\(k \\in \\mathbb{N}_0\\) \\(\\binom{k + r - 1}{k}(1-p)^r p^k\\) 4.7 \\(\\frac{rp}{1 - p}\\) 9.2 \\(\\frac{rp}{(1 - p)^2}\\) 9.2 multinomial \\(n \\in \\mathbb{N}\\), \\(k \\in \\mathbb{N}\\) \\(p_i \\in [0,1]\\), \\(\\sum p_i = 1\\) \\(x_i \\in \\{0,..., n\\}\\), \\(i \\in \\{1,...,k\\}\\), \\(\\sum{x_i} = n\\) \\(\\frac{n!}{x_1!x_2!...x_k!} p_1^{x_1} p_2^{x_2}...p_k^{x_k}\\) 8.1 \\(np_i\\) \\(np_i(1-p_i)\\) "],["references.html", "References", " References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] diff --git a/docs/uprobspaces.html b/docs/uprobspaces.html index 4989568..53ba810 100644 --- a/docs/uprobspaces.html +++ b/docs/uprobspaces.html @@ -20,10 +20,10 @@ - + - + diff --git a/index.Rmd b/index.Rmd index 09b1aca..41c8501 100644 --- a/index.Rmd +++ b/index.Rmd @@ -1,6 +1,6 @@ --- title: "Principles of Uncertainty -- exercises" -author: "Gregor Pirš and Erik Štrumbelj" +author: "Gregor Pirš, Erik Štrumbelj, David Nabergoj and Leon Hvastja" date: "`r Sys.Date()`" site: bookdown::bookdown_site documentclass: book