Merge pull request #1 from UBC-STAT/2024-site

2024 site
UBC-STAT · Jan 9, 2024 · ce55229 · ce55229
2 parents ebbaf58 + aa9ec63
commit ce55229
Show file tree

Hide file tree

Showing 175 changed files with 20,513 additions and 1,648 deletions.
diff --git a/.Rprofile b/.Rprofile
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -17,20 +17,6 @@ jobs:
       - name: Set up Quarto
         uses: quarto-dev/quarto-actions/setup@v2
 
-      - name: Install libcurl on Linux
-        if: runner.os == 'Linux'
-        run: sudo apt-get update -y && sudo apt-get install -y libcurl4-openssl-dev
-
-      - name: Install R
-        uses: r-lib/actions/setup-r@v2
-        with:
-          r-version: '4.2.0'
-
-      - name: Install R Dependencies
-        uses: r-lib/actions/setup-renv@v2
-        with:
-          cache-version: 1
-
       - name: Render and Publish
         uses: quarto-dev/quarto-actions/publish@v2
         with:

diff --git a/_freeze/schedule/index/execute-results/html.json b/_freeze/schedule/index/execute-results/html.json
@@ -0,0 +1,16 @@
+{
+  "hash": "c2b921b8b26d6d86e1b90497439df844",
+  "result": {
+    "markdown": "---\ntitle: \"<i class='bi bi-calendar3'></i> Lecture slides and handouts\"\n---\n\n\n\n::: .callout-note\nTerm 2 runs 8 January to 13 April. \n:::\n\n::: {.callout-warning}\nSchedule is on Canvas. All dates subject to change.\n:::\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n\n## Lecture slides\n\n* [Bootstrap](slides/bootstrap.qmd)\n* [Cluster computing](slides/cluster-computing.qmd)\n* [Git and GitHub](slides/git.qmd)\n* [Skills for graduate students](slides/grad-school.qmd)\n* [Tips for organization](slides/organization.qmd)\n* [Giving presentations](slides/presentations.qmd)\n* [Time series](slides/time-series.qmd)\n* [Unit tests](slides/unit-tests.qmd)\n\n## Handouts\n\n* [Recommended books and sources](handouts/reference-books.qmd)\n* [Formatting consulting reports](handouts/report-formatting.qmd)\n\n\n<!--\n\n\n## Guest lecturers\n\n-->\n",
+    "supporting": [
+      "index_files"
+    ],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
diff --git a/_freeze/schedule/slides/bootstrap/execute-results/html.json b/_freeze/schedule/slides/bootstrap/execute-results/html.json
@@ -0,0 +1,20 @@
+{
+  "hash": "d2869649cdb959b50f0e8ab08e8f9e05",
+  "result": {
+    "markdown": "---\nlecture: \"The bootstrap\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 09 January 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n\n\n## {background-image=\"https://www.azquotes.com/picture-quotes/quote-i-believe-in-pulling-yourself-up-by-your-own-bootstraps-i-believe-it-is-possible-i-saw-stephen-colbert-62-38-03.jpg\" background-size=\"contain\"}\n\n\n## {background-image=\"http://rackjite.com/wp-content/uploads/rr11014aa.jpg\" background-size=\"contain\"}\n\n\n## In statistics...\n\nThe \"bootstrap\" works. And well.\n\nIt's good for \"second-level\" analysis.\n\n* \"First-level\" analyses are things like $\\hat\\beta$, $\\hat y$, an estimator of the center (a median), etc.\n\n* \"Second-level\" are things like $\\Var{\\hat\\beta}$, a confidence interval for $\\hat y$, or a median, etc.\n\nYou usually get these \"second-level\" properties from \"the sampling distribution of an estimator\"\n\n. . .\n\nBut what if you don't know the sampling distribution? Or you're skeptical of the CLT argument?\n\n\n## Refresher on sampling distributions\n\n1. If $X_i$ are iid Normal $(0,\\sigma^2)$, then $\\Var{\\bar{X}} = \\sigma^2 / n$.\n1. If $X_i$ are iid and $n$ is big, then $\\Var{\\bar{X}} \\approx \\Var{X_1} / n$.\n1. If $X_i$ are iid Binomial $(m, p)$, then $\\Var{\\bar{X}} = mp(1-p) / n$\n\n\n\n## Example of unknown sampling distribution\n\nI estimate a LDA on some data.\n\nI get a new $x_0$ and produce $\\hat{Pr}(y_0 =1 \\given x_0)$.\n\nCan I get a 95% confidence interval for $Pr(y_0=1 \\given x_0)$?\n\n. . .\n\nThe bootstrap gives this to you.\n\n\n\n\n## Procedure\n\n1. Resample your training data w/ replacement.\n2. Calculate a LDA on this sample.\n3. Produce a new prediction, call it $\\widehat{Pr}_b(y_0 =1 \\given x_0)$.\n4. Repeat 1-3 $b = 1,\\ldots,B$ times.\n5. CI: $\\left[2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(1-\\alpha/2),\\ 2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(\\alpha/2)\\right]$\n\n\n\n$\\hat{F}$ is the \"empirical\" distribution of the bootstraps. \n\n\n## Very basic example\n\n* Let $X_i\\sim Exponential(1/5)$. The pdf is $f(x) = \\frac{1}{5}e^{-x/5}$\n\n\n* I know if I estimate the mean with $\\bar{X}$, then by the CLT (if $n$ is big), \n\n$$\\frac{\\sqrt{n}(\\bar{X}-E[X])}{s} \\approx N(0, 1).$$\n\n\n* This gives me a 95% confidence interval like\n$$\\bar{X} \\pm 2 \\frac{s}{\\sqrt{n}}$$\n\n\n* But I don't want to estimate the mean, I want to estimate the median.\n\n\n---\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nggplot(data.frame(x = c(0, 12)), aes(x)) + \n  stat_function(fun = function(x) dexp(x, 1 / 5), \n                color = \"orange\", linewidth = 2) +\n  theme_bw(base_size = 24) +\n  geom_vline(xintercept = 5, color = \"darkblue\", linewidth = 2) + # mean\n  geom_vline(xintercept = qexp(.5,1/5), color = \"red\", linewidth = 2) + # median\n  ylab(\"f(x)\") +\n  annotate(\"label\", x = c(2.5, 5.5, 10), y = c(.15, .15, .05), \n           label = c(\"median\", \"bar(x)\", \"pdf\"), parse = TRUE,\n           color = c(\"red\", \"darkblue\", \"orange\"), size = 10)\n```\n\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Now what\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n* I give you a sample of size 500, you give me the sample median.\n\n* How do you get a CI?\n\n* You can use the bootstrap!\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(2022-11-01)\nx <- rexp(n, 1 / 5)\n(med <- median(x)) # sample median\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.669627\n```\n:::\n\n```{.r .cell-code}\nB <- 100\nalpha <- 0.05\nbootMed <- function() median(sample(x, replace = TRUE)) # resample, and get the median\nFhat <- replicate(B, bootMed()) # repeat B times, \"empirical distribution\"\nCI <- 2 * med - quantile(Fhat, probs = c(1 - alpha / 2, alpha / 2))\n```\n:::\n\n\n---\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n## {background-image=\"gfx/boot1.png\" background-size=\"contain\"}\n\n## {background-image=\"gfx/boot2.png\" background-size=\"contain\"}\n\n## Slightly harder example\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(fatcats, aes(Bwt, Hwt)) + \n  geom_point(color = \"darkblue\") + \n  xlab(\"Cat body weight (Kg)\") + \n  theme_bw(base_size = 18) +\n  ylab(\"Cat heart weight (g)\")\n```\n\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nsummary(cats.lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = Hwt ~ 0 + Bwt, data = fatcats)\n\nResiduals:\n    Min      1Q  Median      3Q     Max \n-6.9293 -1.0460 -0.1407  0.8298 16.2536 \n\nCoefficients:\n    Estimate Std. Error t value Pr(>|t|)    \nBwt  3.81895    0.07678   49.74   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.549 on 143 degrees of freedom\nMultiple R-squared:  0.9454,\tAdjusted R-squared:  0.945 \nF-statistic:  2474 on 1 and 143 DF,  p-value: < 2.2e-16\n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n       2.5 %  97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n::::\n\n\n## When we fit models, we examine diagnostics\n\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqqnorm(residuals(cats.lm))\nqqline(residuals(cats.lm))\n```\n\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-8-1.svg){fig-align='center'}\n:::\n:::\n\n\n\nThe tails are too fat, I don't believe that CI...\n:::\n\n::: {.column width=\"50%\"}\n\n\nWe bootstrap\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\nfor (b in 1:B) {\n  samp <- sample(1:nrow(fatcats), replace = TRUE)\n  newcats <- fatcats[samp, ] # new data\n  bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # Bootstrap CI\n  quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   97.5%     2.5% \n3.654977 3.955927 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n       2.5 %  97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n::::\n\n\n## An alternative\n\n* So far, I didn't use any information about the data-generating process. \n\n* We've done the [non-parametric bootstrap]{.secondary}\n\n* This is easiest, and most common for most cases.\n\n. . .\n\n[But there's another version]{.secondary}\n\n* You could try a \"parametric bootstrap\"\n\n* This assumes knowledge about the DGP\n\n## Same data\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n[Non-parametric bootstrap]{.secondary}\n\nSame as before\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\nfor (b in 1:B) {\n  samp <- sample(1:nrow(fatcats), replace = TRUE)\n  newcats <- fatcats[samp, ] # new data\n  bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # NP Bootstrap CI\n  quantile(bhats, probs = c(1-alpha/2, alpha/2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   97.5%     2.5% \n3.673559 3.970251 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n       2.5 %  97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n[Parametric bootstrap]{.secondary}\n\n1. Assume that the linear model is TRUE.\n2. Then, $\\texttt{Hwt}_i = \\widehat{\\beta}\\times \\texttt{Bwt}_i + \\widehat{e}_i$, $\\widehat{e}_i \\approx \\epsilon_i$\n3. The $\\epsilon_i$ is random $\\longrightarrow$ just resample $\\widehat{e}_i$.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nnewcats <- fatcats\nfor (b in 1:B) {\n  samp <- sample(residuals(cats.lm), replace = TRUE)\n  newcats$Hwt <- predict(cats.lm) + samp # new data\n  bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # Parametric Bootstrap CI\n  quantile(bhats, probs = c(1 - alpha/2, alpha/2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   97.5%     2.5% \n3.665531 3.961896 \n```\n:::\n:::\n\n\n:::\n::::\n\n## Bootstrap error sources\n\n\n[Simulation error]{.secondary}:\n\nusing only $B$ samples to estimate $F$ with $\\hat{F}$.\n\n[Statistical error]{.secondary}:\n\nour data depended on a sample from the population. We don't have the whole population so we make an error by using a sample \n\n(Note: this part is what __always__ happens with data, and what the science of statistics analyzes.)\n\n[Specification error]{.secondary}:\n\nIf we use the parametric bootstrap, and our model is wrong, then we are overconfident.\n\n\n\n",
+    "supporting": [
+      "bootstrap_files"
+    ],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {
+      "include-after-body": [
+        "\n<script>\n  // htmlwidgets need to know to resize themselves when slides are shown/hidden.\n  // Fire the \"slideenter\" event (handled by htmlwidgets.js) when the current\n  // slide changes (different for each slide format).\n  (function () {\n    // dispatch for htmlwidgets\n    function fireSlideEnter() {\n      const event = window.document.createEvent(\"Event\");\n      event.initEvent(\"slideenter\", true, true);\n      window.document.dispatchEvent(event);\n    }\n\n    function fireSlideChanged(previousSlide, currentSlide) {\n      fireSlideEnter();\n\n      // dispatch for shiny\n      if (window.jQuery) {\n        if (previousSlide) {\n          window.jQuery(previousSlide).trigger(\"hidden\");\n        }\n        if (currentSlide) {\n          window.jQuery(currentSlide).trigger(\"shown\");\n        }\n      }\n    }\n\n    // hookup for slidy\n    if (window.w3c_slidy) {\n      window.w3c_slidy.add_observer(function (slide_num) {\n        // slide_num starts at position 1\n        fireSlideChanged(null, w3c_slidy.slides[slide_num - 1]);\n      });\n    }\n\n  })();\n</script>\n\n"
+      ]
+    },
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}