From e4e5f33583a9ff436482f5ce5f3a6156402fb604 Mon Sep 17 00:00:00 2001
From: David Nabergoj <david.nabergoj@fri.uni-lj.si>
Date: Thu, 11 Jan 2024 17:07:03 +0100
Subject: [PATCH] Confidence -> coverage

---
 14-bootstrap.Rmd       | 2 +-
 docs/boot.html         | 2 +-
 docs/search_index.json | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/14-bootstrap.Rmd b/14-bootstrap.Rmd
index c979f5a..8be3af7 100644
--- a/14-bootstrap.Rmd
+++ b/14-bootstrap.Rmd
@@ -380,7 +380,7 @@ bootstrap(x, T_max)
 ```
 </div> 
 
-```{exercise, name = "Practical - and fictional - confidence interval comparison"}
+```{exercise, name = "Practical - and fictional - coverage interval comparison"}
 In this exercise, we investigate how different kinds of CI's behave as we vary the number of measurements.
 
 The story behind the data: it's 2025 and we've discovered that Slovenia has rich deposits of a rare mineral called Moustachium, which can be used to accelerate moustache growth. This mineral is highly sought, so the government has decided to contract two different companies to provide information on where to begin mining. Both companies investigated mining sites in each statistical region and gave their best estimate of the average Moustachium concentration in tonnes per square kilometer. The Data Science team has been called to estimate the uncertainty in these estimates and help avoid mining in the wrong region.
diff --git a/docs/boot.html b/docs/boot.html
index 1c24040..b0e77d4 100644
--- a/docs/boot.html
+++ b/docs/boot.html
@@ -717,7 +717,7 @@ <h1><span class="header-section-number">Chapter 14</span> Bootstrap<a href="boot
 <span id="cb44-3"><a href="boot.html#cb44-3" aria-hidden="true" tabindex="-1"></a><span class="co"># In general, bootstrap will fail when estimating the CI for the maximum.</span></span></code></pre></div>
 </div>
 <div class="exercise">
-<p><span id="exr:unnamed-chunk-12" class="exercise"><strong>Exercise 14.6  (Practical - and fictional - confidence interval comparison) </strong></span>In this exercise, we investigate how different kinds of CI’s behave as we vary the number of measurements.</p>
+<p><span id="exr:unnamed-chunk-12" class="exercise"><strong>Exercise 14.6  (Practical - and fictional - coverage interval comparison) </strong></span>In this exercise, we investigate how different kinds of CI’s behave as we vary the number of measurements.</p>
 <p>The story behind the data: it’s 2025 and we’ve discovered that Slovenia has rich deposits of a rare mineral called Moustachium, which can be used to accelerate moustache growth. This mineral is highly sought, so the government has decided to contract two different companies to provide information on where to begin mining. Both companies investigated mining sites in each statistical region and gave their best estimate of the average Moustachium concentration in tonnes per square kilometer. The Data Science team has been called to estimate the uncertainty in these estimates and help avoid mining in the wrong region.</p>
 <p>Generate synthetic data with the script below:</p>
 <pre><code>set.seed(0)
diff --git a/docs/search_index.json b/docs/search_index.json
index 97c659d..6f2fbb8 100644
--- a/docs/search_index.json
+++ b/docs/search_index.json
@@ -1 +1 @@
-[["boot.html", "Chapter 14 Bootstrap", " Chapter 14 Bootstrap This chapter deals with bootstrap. The students are expected to acquire the following knowledge: How to use bootstrap to generate coverage intervals. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } Exercise 14.1 Ideally, a \\(1-\\alpha\\) CI would have \\(1-\\alpha\\) coverage. That is, say a 95% CI should, in the long run, contain the true value of the parameter 95% of the time. In practice, it is impossible to assess the coverage of our CI method, because we rarely know the true parameter. In simulation, however, we can. Let’s assess the coverage of bootstrap percentile intervals. Pick a univariate distribution with readily available mean and one that you can easily sample from. Draw \\(n = 30\\) random samples from the chosen distribution and use the bootstrap (with large enough m) and percentile CI method to construct 95% CI. Repeat the process many times and count how many times the CI contains the true mean. That is, compute the actual coverage probability (don’t forget to include the standard error of the coverage probability!). What can you observe? Try one or two different distributions. What can you observe? Repeat (b) and (c) using BCa intervals (R package boot). How does the coverage compare to percentile intervals? As (d) but using intervals based on asymptotic normality (+/- 1.96 SE). How do results from (b), (d), and (e) change if we increase the sample size to n = 200? What about n = 5? library(boot) set.seed(0) nit &lt;- 1000 # Repeat the process &quot;many times&quot; alpha &lt;- 0.05 # CI parameter nboot &lt;- 100 # m parameter for bootstrap (&quot;large enough m&quot;) # f: change this to 200 or 5. nsample &lt;- 30 # n = 30 random samples from the chosen distribution. Comment out BCa code if it breaks. covers &lt;- matrix(nrow = nit, ncol = 3) covers_BCa &lt;- matrix(nrow = nit, ncol = 3) covers_asymp_norm &lt;- matrix(nrow = nit, ncol = 3) isin &lt;- function (x, lower, upper) { (x &gt; lower) &amp; (x &lt; upper) } for (j in 1:nit) { # Repeating many times # a: pick a univariate distribution - standard normal x1 &lt;- rnorm(nsample) # c: one or two different distributions - beta and poisson x2 &lt;- rbeta(nsample, 1, 2) x3 &lt;- rpois(nsample, 5) X1 &lt;- matrix(data = NA, nrow = nsample, ncol = nboot) X2 &lt;- matrix(data = NA, nrow = nsample, ncol = nboot) X3 &lt;- matrix(data = NA, nrow = nsample, ncol = nboot) for (i in 1:nboot) { X1[ ,i] &lt;- sample(x1, nsample, replace = T) X2[ ,i] &lt;- sample(x2, nsample, T) X3[ ,i] &lt;- sample(x3, nsample, T) } X1_func &lt;- apply(X1, 2, mean) X2_func &lt;- apply(X2, 2, mean) X3_func &lt;- apply(X3, 2, mean) X1_quant &lt;- quantile(X1_func, probs = c(alpha / 2, 1 - alpha / 2)) X2_quant &lt;- quantile(X2_func, probs = c(alpha / 2, 1 - alpha / 2)) X3_quant &lt;- quantile(X3_func, probs = c(alpha / 2, 1 - alpha / 2)) covers[j,1] &lt;- (0 &gt; X1_quant[1]) &amp; (0 &lt; X1_quant[2]) covers[j,2] &lt;- ((1 / 3) &gt; X2_quant[1]) &amp; ((1 / 3) &lt; X2_quant[2]) covers[j,3] &lt;- (5 &gt; X3_quant[1]) &amp; (5 &lt; X3_quant[2]) mf &lt;- function (x, i) return(mean(x[i])) bootX1 &lt;- boot(x1, statistic = mf, R = nboot) bootX2 &lt;- boot(x2, statistic = mf, R = nboot) bootX3 &lt;- boot(x3, statistic = mf, R = nboot) X1_quant_BCa &lt;- boot.ci(bootX1, type = &quot;bca&quot;)$bca X2_quant_BCa &lt;- boot.ci(bootX2, type = &quot;bca&quot;)$bca X3_quant_BCa &lt;- boot.ci(bootX3, type = &quot;bca&quot;)$bca covers_BCa[j,1] &lt;- (0 &gt; X1_quant_BCa[4]) &amp; (0 &lt; X1_quant_BCa[5]) covers_BCa[j,2] &lt;- ((1 / 3) &gt; X2_quant_BCa[4]) &amp; ((1 / 3) &lt; X2_quant_BCa[5]) covers_BCa[j,3] &lt;- (5 &gt; X3_quant_BCa[4]) &amp; (5 &lt; X3_quant_BCa[5]) # e: estimate mean and standard error # sample mean: x1_bar &lt;- mean(x1) x2_bar &lt;- mean(x2) x3_bar &lt;- mean(x3) # standard error (of the sample mean) estimate: sample standard deviation / sqrt(n) x1_bar_SE &lt;- sd(x1) / sqrt(nsample) x2_bar_SE &lt;- sd(x2) / sqrt(nsample) x3_bar_SE &lt;- sd(x3) / sqrt(nsample) covers_asymp_norm[j,1] &lt;- isin(0, x1_bar - 1.96 * x1_bar_SE, x1_bar + 1.96 * x1_bar_SE) covers_asymp_norm[j,2] &lt;- isin(1/3, x2_bar - 1.96 * x2_bar_SE, x2_bar + 1.96 * x2_bar_SE) covers_asymp_norm[j,3] &lt;- isin(5, x3_bar - 1.96 * x3_bar_SE, x3_bar + 1.96 * x3_bar_SE) } apply(covers, 2, mean) ## [1] 0.918 0.925 0.905 apply(covers, 2, sd) / sqrt(nit) ## [1] 0.008680516 0.008333333 0.009276910 apply(covers_BCa, 2, mean) ## [1] 0.927 0.944 0.927 apply(covers_BCa, 2, sd) / sqrt(nit) ## [1] 0.008230355 0.007274401 0.008230355 apply(covers_asymp_norm, 2, mean) ## [1] 0.939 0.937 0.930 apply(covers_asymp_norm, 2, sd) / sqrt(nit) ## [1] 0.007572076 0.007687008 0.008072494 Exercise 14.2 You are given a sample of independent observations from a process of interest: Index 1 2 3 4 5 6 7 8 X 7 2 4 6 4 5 9 10 Compute the plug-in estimate of mean and 95% symmetric CI based on asymptotic normality. Use the plug-in estimate of SE. Same as (a), but use the unbiased estimate of SE. Apply nonparametric bootstrap with 1000 bootstrap replications and estimate the 95% CI for the mean with percentile-based CI. # a x &lt;- c(7, 2, 4, 6, 4, 5, 9, 10) n &lt;- length(x) mu &lt;- mean(x) SE &lt;- sqrt(mean((x - mu)^2)) / sqrt(n) SE ## [1] 0.8915839 z &lt;- qnorm(1 - 0.05 / 2) c(mu - z * SE, mu + z * SE) ## [1] 4.127528 7.622472 # b SE &lt;- sd(x) / sqrt(n) SE ## [1] 0.9531433 c(mu - z * SE, mu + z * SE) ## [1] 4.006873 7.743127 # c set.seed(0) m &lt;- 1000 T_mean &lt;- function(x) {mean(x)} est_boot &lt;- array(NA, m) for (i in 1:m) { x_boot &lt;- x[sample(1:n, n, rep = T)] est_boot[i] &lt;- T_mean(x_boot) } quantile(est_boot, p = c(0.025, 0.975)) ## 2.5% 97.5% ## 4.250 7.625 Exercise 14.3 We are given a sample of 10 independent paired (bivariate) observations: Index 1 2 3 4 5 6 7 8 9 10 X 1.26 -0.33 1.33 1.27 0.41 -1.54 -0.93 -0.29 -0.01 2.40 Y 2.64 0.33 0.48 0.06 -0.88 -2.14 -2.21 0.95 0.83 1.45 Compute Pearson correlation between X and Y. Use the cor.test() from R to estimate a 95% CI for the estimate from (a). Apply nonparametric bootstrap with 1000 bootstrap replications and estimate the 95% CI for the Pearson correlation with percentile-based CI. Compare CI from (b) and (c). Are they similar? How would the bootstrap estimation of CI change if we were interested in Spearman or Kendall correlation instead? x &lt;- c(1.26, -0.33, 1.33, 1.27, 0.41, -1.54, -0.93, -0.29, -0.01, 2.40) y &lt;- c(2.64, 0.33, 0.48, 0.06, -0.88, -2.14, -2.21, 0.95, 0.83, 1.45) # a cor(x, y) ## [1] 0.6991247 # b res &lt;- cor.test(x, y) res$conf.int[1:2] ## [1] 0.1241458 0.9226238 # c set.seed(0) m &lt;- 1000 n &lt;- length(x) T_cor &lt;- function(x, y) {cor(x, y)} est_boot &lt;- array(NA, m) for (i in 1:m) { idx &lt;- sample(1:n, n, rep = T) # !!! important to use same indices to keep dependency between x and y est_boot[i] &lt;- T_cor(x[idx], y[idx]) } quantile(est_boot, p = c(0.025, 0.975)) ## 2.5% 97.5% ## 0.2565537 0.9057664 # d # Yes, but the bootstrap CI is more narrow. # e # We just use the functions for Kendall/Spearman coefficients instead: T_kendall &lt;- function(x, y) {cor(x, y, method = &quot;kendall&quot;)} T_spearman &lt;- function(x, y) {cor(x, y, method = &quot;spearman&quot;)} # Put this in a function that returns the CI bootstrap_95_ci &lt;- function(x, y, t, m = 1000) { n &lt;- length(x) est_boot &lt;- array(NA, m) for (i in 1:m) { idx &lt;- sample(1:n, n, rep = T) # !!! important to use same indices to keep dependency between x and y est_boot[i] &lt;- t(x[idx], y[idx]) } quantile(est_boot, p = c(0.025, 0.975)) } bootstrap_95_ci(x, y, T_kendall) ## 2.5% 97.5% ## -0.08108108 0.78378378 bootstrap_95_ci(x, y, T_spearman) ## 2.5% 97.5% ## -0.1701115 0.8867925 Exercise 14.4 In this problem we will illustrate the use of the nonparametric bootstrap for estimating CIs of regression model coefficients. Load the longley dataset from base R with data(longley). Use lm() to apply linear regression using “Employed” as the target (dependent) variable and all other variables as the predictors (independent). Using lm() results, print the estimated regression coefficients and standard errors. Estimate 95% CI for the coefficients using +/- 1.96 * SE. Use nonparametric bootstrap with 100 replications to estimate the SE of the coefficients from (b). Compare the SE from (c) with those from (b). # a data(longley) # b res &lt;- lm(Employed ~ . , longley) tmp &lt;- data.frame(summary(res)$coefficients[,1:2]) tmp$LB &lt;- tmp[,1] - 1.96 * tmp[,2] tmp$UB &lt;- tmp[,1] + 1.96 * tmp[,2] tmp ## Estimate Std..Error LB UB ## (Intercept) -3.482259e+03 8.904204e+02 -5.227483e+03 -1.737035e+03 ## GNP.deflator 1.506187e-02 8.491493e-02 -1.513714e-01 1.814951e-01 ## GNP -3.581918e-02 3.349101e-02 -1.014616e-01 2.982320e-02 ## Unemployed -2.020230e-02 4.883997e-03 -2.977493e-02 -1.062966e-02 ## Armed.Forces -1.033227e-02 2.142742e-03 -1.453204e-02 -6.132495e-03 ## Population -5.110411e-02 2.260732e-01 -4.942076e-01 3.919994e-01 ## Year 1.829151e+00 4.554785e-01 9.364136e-01 2.721889e+00 # c set.seed(0) m &lt;- 100 n &lt;- nrow(longley) T_coef &lt;- function(x) { lm(Employed ~ . , x)$coefficients } est_boot &lt;- array(NA, c(m, ncol(longley))) for (i in 1:m) { idx &lt;- sample(1:n, n, rep = T) est_boot[i,] &lt;- T_coef(longley[idx,]) } SE &lt;- apply(est_boot, 2, sd) SE ## [1] 1.826011e+03 1.605981e-01 5.693746e-02 8.204892e-03 3.802225e-03 ## [6] 3.907527e-01 9.414436e-01 # Show the standard errors around coefficients library(ggplot2) library(reshape2) df &lt;- data.frame(index = 1:7, bootstrap_SE = SE, lm_SE = tmp$Std..Error) melted_df &lt;- melt(df[2:nrow(df), ], id.vars = &quot;index&quot;) # Ignore bias which has a really large magnitude ggplot(melted_df, aes(x = index, y = value, fill = variable)) + geom_bar(stat=&quot;identity&quot;, position=&quot;dodge&quot;) + xlab(&quot;Coefficient&quot;) + ylab(&quot;Standard error&quot;) # + scale_y_continuous(trans = &quot;log&quot;) # If you want to also plot bias Exercise 14.5 This exercise shows a shortcoming of the bootstrap method when using the plug in estimator for the maximum. Compute the 95% bootstrap CI for the maximum of a standard normal distribution. Compute the 95% bootstrap CI for the maximum of a binomial distribution with n = 15 and p = 0.2. Repeat (b) using p = 0.9. Why is the result different? # bootstrap CI for maximum alpha &lt;- 0.05 T_max &lt;- function(x) {max(x)} # Equal to T_max = max bootstrap &lt;- function(x, t, m = 1000) { n &lt;- length(x) values &lt;- rep(0, m) for (i in 1:m) { values[i] &lt;- t(sample(x, n, replace = T)) } quantile(values, probs = c(alpha / 2, 1 - alpha / 2)) } # a # Meaningless, as the normal distribution can yield arbitrarily large values. x &lt;- rnorm(100) bootstrap(x, T_max) ## 2.5% 97.5% ## 1.819425 2.961743 # b x &lt;- rbinom(100, size = 15, prob = 0.2) # min = 0, max = 15 bootstrap(x, T_max) ## 2.5% 97.5% ## 6 7 # c x &lt;- rbinom(100, size = 15, prob = 0.9) # min = 0, max = 15 bootstrap(x, T_max) ## 2.5% 97.5% ## 15 15 # Observation: to estimate the maximum, we need sufficient probability mass near the maximum value the distribution can yield. # Using bootstrap is pointless when there is too little mass near the true maximum. # In general, bootstrap will fail when estimating the CI for the maximum. Exercise 14.6 (Practical - and fictional - confidence interval comparison) In this exercise, we investigate how different kinds of CI’s behave as we vary the number of measurements. The story behind the data: it’s 2025 and we’ve discovered that Slovenia has rich deposits of a rare mineral called Moustachium, which can be used to accelerate moustache growth. This mineral is highly sought, so the government has decided to contract two different companies to provide information on where to begin mining. Both companies investigated mining sites in each statistical region and gave their best estimate of the average Moustachium concentration in tonnes per square kilometer. The Data Science team has been called to estimate the uncertainty in these estimates and help avoid mining in the wrong region. Generate synthetic data with the script below: set.seed(0) library(comprehenr) regions &lt;- c(&quot;pomurska&quot;, &quot;podravska&quot;, &quot;koroska&quot;, &quot;savinjska&quot;, &quot;zasavska&quot;, &quot;posavska&quot;, &quot;JV Slovenija&quot;, &quot;primorsko-notranjska&quot;, &quot;osrednjeslovenska&quot;, &quot;gorenjska&quot;, &quot;goriska&quot;, &quot;obalno-kraska&quot;) region_rates &lt;- seq(1.3, 2.3, length.out=length(regions)) region_rates &lt;- region_rates[sample.int(length(regions), length(regions))] make_dataset &lt;- function(n_contractors) { measurements &lt;- matrix(nrow=length(regions), ncol=n_contractors) for (i in 1:length(regions)) { measurements[i,] &lt;- rgamma(n_contractors, 5.0, region_rates[i]) } df &lt;- data.frame(measurements) row.names(df) &lt;- regions names(df) &lt;- to_vec(for(i in 1:n_contractors) paste(&quot;Contractor&quot;, i)) return(df) } set.seed(0) df_2025 &lt;- make_dataset(2) set.seed(0) df_2027 &lt;- make_dataset(10) set.seed(0) df_2028 &lt;- make_dataset(100) set.seed(0) df_2029 &lt;- make_dataset(1000) saveRDS(df_2025, file=&quot;moustachium_2025.Rda&quot;) saveRDS(df_2027, file=&quot;moustachium_2027.Rda&quot;) saveRDS(df_2028, file=&quot;moustachium_2028.Rda&quot;) saveRDS(df_2029, file=&quot;moustachium_2029.Rda&quot;) Estimate the average concentration for different regions. Estimate the average concentration uncertainty using 95% CI’s (asymptotic normality with biased and unbiased standard error, standard bootstrap CI, bootstrap percentile CI). Visualize uncertainties with a histogram and discuss the best location to start mining. The year is 2027 and the government has decided to contract 10 companies. Rerun the code with new measurements and discuss how CI’s change. Technological advancements in robotics have enabled site surveys on a massive scale. Repeat the last point for 100 surveyor robots in 2028 and 1000 surveyor robots in 2029. library(ggplot2) library(dplyr) library(data.table) set.seed(0) input_dataset_path = &quot;moustachium_2025.Rda&quot; # Change this for points d and e output_plot_path = &quot;moustachium_2025.pdf&quot; # Change this for points d and e df &lt;- readRDS(input_dataset_path) # Data comes from here n_contractors &lt;- ncol(df) results_df &lt;- data.frame(region=row.names(df)) # Store CI bounds here # 1. average concentration for different mining sites results_df$average_concetration &lt;- rowMeans(df) # CI for the mean based on asymptotic normality (biased SE estimate) biased_SE &lt;- sqrt(apply(df, 1, function(vec) {sum((vec - mean(vec))^2) / length(vec)}) / n_contractors) results_df$ci95.an.biased_var.low &lt;- results_df$average_concetration - 1.96 * biased_SE results_df$ci95.an.biased_var.high &lt;- results_df$average_concetration + 1.96 * biased_SE # CI for the mean based on asymptotic normality (unbiased SE estimate) unbiased_SE &lt;- sqrt(apply(df, 1, var) / n_contractors) results_df$ci95.an.unbiased_var.low &lt;- results_df$average_concetration - 1.96 * unbiased_SE results_df$ci95.an.unbiased_var.high &lt;- results_df$average_concetration + 1.96 * unbiased_SE # Standard bootstrap CI with 1000 samples bootstrap_variance &lt;- function(data, n_samples) { # n_samples is m in pseudocode output &lt;- numeric(n_samples) for (i in 1:n_samples) { index &lt;- sample(1:length(data), length(data), rep = TRUE) resampled_data &lt;- data[index] output[i] &lt;- mean(resampled_data) } return(var(output)) } bootstrap_1000_sd &lt;- sqrt(apply(df, 1, function(vec){bootstrap_variance(vec, 1000)})) results_df$ci95.bootstrap.standard.1000.low &lt;- results_df$average_concetration - 1.96 * bootstrap_1000_sd results_df$ci95.bootstrap.standard.1000.high &lt;- results_df$average_concetration + 1.96 * bootstrap_1000_sd # Bootstrap percentile CI with 1000 samples bootstrap_quantile &lt;- function(data, functional, n_samples, probs) { # n_samples is m in pseudocode output &lt;- numeric(n_samples) for (i in 1:n_samples) { index &lt;- sample(1:length(data), length(data), rep = TRUE) resampled_data &lt;- data[index] output[i] &lt;- functional(resampled_data) } return(quantile(output, probs=probs)) } results_df$ci95.bootstrap.percentile.1000.low &lt;- apply(df, 1, function(vec){bootstrap_quantile(vec, mean, 1000, 0.025)}) results_df$ci95.bootstrap.percentile.1000.high &lt;- apply(df, 1, function(vec){bootstrap_quantile(vec, mean, 1000, 0.975)}) results_df ## region average_concetration ci95.an.biased_var.low ## 1 pomurska 2.814731 1.5351811 ## 2 podravska 2.646518 1.5358919 ## 3 koroska 2.010216 0.5956186 ## 4 savinjska 4.618001 4.4057369 ## 5 zasavska 2.458873 2.0050840 ## 6 posavska 2.153802 1.9001244 ## 7 JV Slovenija 2.433503 1.6860397 ## 8 primorsko-notranjska 3.165394 2.9640430 ## 9 osrednjeslovenska 3.696875 3.5592419 ## 10 gorenjska 1.341931 0.2784547 ## 11 goriska 2.767328 2.3255569 ## 12 obalno-kraska 1.580711 1.4533751 ## ci95.an.biased_var.high ci95.an.unbiased_var.low ci95.an.unbiased_var.high ## 1 4.094281 1.005174095 4.624288 ## 2 3.757144 1.075855548 4.217180 ## 3 3.424813 0.009673183 4.010759 ## 4 4.830264 4.317814385 4.918187 ## 5 2.912662 1.817118318 3.100628 ## 6 2.407479 1.795047746 2.512556 ## 7 3.180965 1.376430415 3.490575 ## 8 3.366746 2.880640556 3.450148 ## 9 3.834508 3.502232367 3.891518 ## 10 2.405407 -0.162051549 2.845913 ## 11 3.209099 2.142569481 3.392086 ## 12 1.708047 1.400630772 1.760792 ## ci95.bootstrap.standard.1000.low ci95.bootstrap.standard.1000.high ## 1 1.5397542 4.089708 ## 2 1.5388631 3.754173 ## 3 0.5492603 3.471171 ## 4 4.4062860 4.829715 ## 5 1.9938049 2.923942 ## 6 1.9010514 2.406552 ## 7 1.6932573 3.173748 ## 8 2.9670216 3.363767 ## 9 3.5602064 3.833544 ## 10 0.2845999 2.399262 ## 11 2.3293359 3.205320 ## 12 1.4543352 1.707087 ## ci95.bootstrap.percentile.1000.low ci95.bootstrap.percentile.1000.high ## 1 1.8914878 3.737975 ## 2 1.8451596 3.447876 ## 3 0.9895308 3.030901 ## 4 4.4648444 4.771157 ## 5 2.1314473 2.786299 ## 6 1.9707640 2.336840 ## 7 1.8941800 2.972825 ## 8 3.0201118 3.310677 ## 9 3.5975676 3.796183 ## 10 0.5745928 2.109269 ## 11 2.4485735 3.086082 ## 12 1.4888334 1.672589 # Visualization: we use a bar chart with uncertainty bands plot_moustachium_per_region &lt;- function(region_names, average, ci_low, ci_high) { df_visualization &lt;- data.frame(region=region_names, average=average, low=ci_low, high=ci_high) ggplot(df_visualization, aes(x=region, y=average)) + geom_bar(stat=&quot;identity&quot;) } mask &lt;- endsWith(colnames(results_df), &quot;low&quot;) mask[c(1, 2)] &lt;- T results_df_low &lt;- results_df[, mask] colnames(results_df_low) &lt;- gsub(&#39;.low&#39;,&#39;&#39;, colnames(results_df_low)) mask &lt;- endsWith(colnames(results_df), &quot;high&quot;) mask[c(1, 2)] &lt;- T results_df_high &lt;- results_df[, mask] colnames(results_df_high) &lt;- gsub(&#39;.high&#39;,&#39;&#39;, colnames(results_df_high)) long_results_df_low &lt;- melt(setDT(results_df_low), id.vars=c(&quot;region&quot;, &quot;average_concetration&quot;)) names(long_results_df_low) &lt;- c(&quot;region&quot;, &quot;average_concentration&quot;, &quot;variable&quot;, &quot;low&quot;) long_results_df_high &lt;- melt(setDT(results_df_high), id.vars=c(&quot;region&quot;, &quot;average_concetration&quot;)) names(long_results_df_high) &lt;- c(&quot;region&quot;, &quot;average_concentration&quot;, &quot;variable&quot;, &quot;high&quot;) long_results_df &lt;- merge(long_results_df_low, long_results_df_high, by=c(&quot;region&quot;, &quot;variable&quot;, &quot;average_concentration&quot;), all=T) moustachium_plot &lt;- ggplot(long_results_df, aes(x=region, y=average_concentration)) + geom_bar(stat=&quot;identity&quot;, position=&quot;dodge&quot;, alpha=0.2) + geom_errorbar(aes(ymin=low, ymax=high, color=variable), width=0.2, position=position_dodge(0.9)) + scale_x_discrete(guide = guide_axis(angle = 45)) + ylim(-1, 8) # ggsave(plot=moustachium_plot, width=12, height=8, dpi=300, filename=output_plot_path) moustachium_plot # Visualization: we can also use a map. Circle size denotes concentration in region, low transparency denotes high uncertainty. library(maps) map_data_slo &lt;- map_data(&#39;world&#39;)[map_data(&#39;world&#39;)$region == &quot;Slovenia&quot;,] map_df &lt;- long_results_df[long_results_df$variable == &quot;ci95.an.biased_var&quot;, ] # VERY approximate longitudes and latitudes for different regions. map_df$long &lt;- rep(0, nrow(map_df)) map_df$lat &lt;- rep(0, nrow(map_df)) map_df[map_df$region == &quot;gorenjska&quot;]$long &lt;- 14.2 map_df[map_df$region == &quot;gorenjska&quot;]$lat &lt;- 46.3 map_df[map_df$region == &quot;goriska&quot;]$long &lt;- 13.85 map_df[map_df$region == &quot;goriska&quot;]$lat &lt;- 46.0 map_df[map_df$region == &quot;obalno-kraska&quot;]$long &lt;- 13.9 map_df[map_df$region == &quot;obalno-kraska&quot;]$lat &lt;- 45.65 map_df[map_df$region == &quot;osrednjeslovenska&quot;]$long &lt;- 14.5 map_df[map_df$region == &quot;osrednjeslovenska&quot;]$lat &lt;- 46. map_df[map_df$region == &quot;primorsko-notranjska&quot;]$long &lt;- 14.3 map_df[map_df$region == &quot;primorsko-notranjska&quot;]$lat &lt;- 45.7 map_df[map_df$region == &quot;zasavska&quot;]$long &lt;- 15 map_df[map_df$region == &quot;zasavska&quot;]$lat &lt;- 46.1 map_df[map_df$region == &quot;savinjska&quot;]$long &lt;- 15.2 map_df[map_df$region == &quot;savinjska&quot;]$lat &lt;- 46.25 map_df[map_df$region == &quot;posavska&quot;]$long &lt;- 15.4 map_df[map_df$region == &quot;posavska&quot;]$lat &lt;- 46 map_df[map_df$region == &quot;koroska&quot;]$long &lt;- 15.1 map_df[map_df$region == &quot;koroska&quot;]$lat &lt;- 46.5 map_df[map_df$region == &quot;podravska&quot;]$long &lt;- 15.7 map_df[map_df$region == &quot;podravska&quot;]$lat &lt;- 46.45 map_df[map_df$region == &quot;pomurska&quot;]$long &lt;- 16.2 map_df[map_df$region == &quot;pomurska&quot;]$lat &lt;- 46.65 map_df[map_df$region == &quot;JV Slovenija&quot;]$long &lt;- 15. map_df[map_df$region == &quot;JV Slovenija&quot;]$lat &lt;- 45.7 map_df$ci_size &lt;- (map_df$high - map_df$low) map_df$ci_y &lt;- map_df$lat - 0.05 map_df$ci_label &lt;- sprintf(&quot;(%.2f, %.2f)&quot;, map_df$low, map_df$high) map_df$avg_label &lt;- sprintf(&quot;%.2f&quot;, map_df$average_concentration) country_plot &lt;- ggplot() + # First layer: worldwide map geom_polygon(data = map_data(&quot;world&quot;), aes(x=long, y=lat, group = group), color = &#39;#9c9c9c&#39;, fill = &#39;#f3f3f3&#39;) + # Second layer: Country map geom_polygon( data = map_data_slo, aes(x=long, y=lat, group = group), color=&#39;darkgreen&#39;, fill=&#39;green&#39;, alpha=0.2 ) + geom_point(data=map_df, aes(x=long, y=lat, fill=region, size=average_concentration, alpha=ci_size), color=&quot;black&quot;, pch=21) + geom_text(data=map_df, aes(x=long, y=ci_y, label=ci_label), size=3) + geom_text(data=map_df, aes(x=long, y=lat, label=avg_label), size=3) + scale_size_continuous(range = c(3, 12), trans = &quot;exp&quot;) + scale_alpha_continuous(range = c(0.15, 0.75), trans = &quot;reverse&quot;) + ggtitle(&quot;Estimated average Moustachium concentration with 95% CI&quot;) + coord_cartesian(xlim=c(13.2, 16.7), ylim=c(45.4, 47.)) # ggsave(plot=country_plot, width=18, height=12, dpi=300, filename=&quot;country.pdf&quot;) country_plot "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]
+[["boot.html", "Chapter 14 Bootstrap", " Chapter 14 Bootstrap This chapter deals with bootstrap. The students are expected to acquire the following knowledge: How to use bootstrap to generate coverage intervals. .fold-btn { float: right; margin: 5px 5px 0 0; } .fold { border: 1px solid black; min-height: 40px; } Exercise 14.1 Ideally, a \\(1-\\alpha\\) CI would have \\(1-\\alpha\\) coverage. That is, say a 95% CI should, in the long run, contain the true value of the parameter 95% of the time. In practice, it is impossible to assess the coverage of our CI method, because we rarely know the true parameter. In simulation, however, we can. Let’s assess the coverage of bootstrap percentile intervals. Pick a univariate distribution with readily available mean and one that you can easily sample from. Draw \\(n = 30\\) random samples from the chosen distribution and use the bootstrap (with large enough m) and percentile CI method to construct 95% CI. Repeat the process many times and count how many times the CI contains the true mean. That is, compute the actual coverage probability (don’t forget to include the standard error of the coverage probability!). What can you observe? Try one or two different distributions. What can you observe? Repeat (b) and (c) using BCa intervals (R package boot). How does the coverage compare to percentile intervals? As (d) but using intervals based on asymptotic normality (+/- 1.96 SE). How do results from (b), (d), and (e) change if we increase the sample size to n = 200? What about n = 5? library(boot) set.seed(0) nit &lt;- 1000 # Repeat the process &quot;many times&quot; alpha &lt;- 0.05 # CI parameter nboot &lt;- 100 # m parameter for bootstrap (&quot;large enough m&quot;) # f: change this to 200 or 5. nsample &lt;- 30 # n = 30 random samples from the chosen distribution. Comment out BCa code if it breaks. covers &lt;- matrix(nrow = nit, ncol = 3) covers_BCa &lt;- matrix(nrow = nit, ncol = 3) covers_asymp_norm &lt;- matrix(nrow = nit, ncol = 3) isin &lt;- function (x, lower, upper) { (x &gt; lower) &amp; (x &lt; upper) } for (j in 1:nit) { # Repeating many times # a: pick a univariate distribution - standard normal x1 &lt;- rnorm(nsample) # c: one or two different distributions - beta and poisson x2 &lt;- rbeta(nsample, 1, 2) x3 &lt;- rpois(nsample, 5) X1 &lt;- matrix(data = NA, nrow = nsample, ncol = nboot) X2 &lt;- matrix(data = NA, nrow = nsample, ncol = nboot) X3 &lt;- matrix(data = NA, nrow = nsample, ncol = nboot) for (i in 1:nboot) { X1[ ,i] &lt;- sample(x1, nsample, replace = T) X2[ ,i] &lt;- sample(x2, nsample, T) X3[ ,i] &lt;- sample(x3, nsample, T) } X1_func &lt;- apply(X1, 2, mean) X2_func &lt;- apply(X2, 2, mean) X3_func &lt;- apply(X3, 2, mean) X1_quant &lt;- quantile(X1_func, probs = c(alpha / 2, 1 - alpha / 2)) X2_quant &lt;- quantile(X2_func, probs = c(alpha / 2, 1 - alpha / 2)) X3_quant &lt;- quantile(X3_func, probs = c(alpha / 2, 1 - alpha / 2)) covers[j,1] &lt;- (0 &gt; X1_quant[1]) &amp; (0 &lt; X1_quant[2]) covers[j,2] &lt;- ((1 / 3) &gt; X2_quant[1]) &amp; ((1 / 3) &lt; X2_quant[2]) covers[j,3] &lt;- (5 &gt; X3_quant[1]) &amp; (5 &lt; X3_quant[2]) mf &lt;- function (x, i) return(mean(x[i])) bootX1 &lt;- boot(x1, statistic = mf, R = nboot) bootX2 &lt;- boot(x2, statistic = mf, R = nboot) bootX3 &lt;- boot(x3, statistic = mf, R = nboot) X1_quant_BCa &lt;- boot.ci(bootX1, type = &quot;bca&quot;)$bca X2_quant_BCa &lt;- boot.ci(bootX2, type = &quot;bca&quot;)$bca X3_quant_BCa &lt;- boot.ci(bootX3, type = &quot;bca&quot;)$bca covers_BCa[j,1] &lt;- (0 &gt; X1_quant_BCa[4]) &amp; (0 &lt; X1_quant_BCa[5]) covers_BCa[j,2] &lt;- ((1 / 3) &gt; X2_quant_BCa[4]) &amp; ((1 / 3) &lt; X2_quant_BCa[5]) covers_BCa[j,3] &lt;- (5 &gt; X3_quant_BCa[4]) &amp; (5 &lt; X3_quant_BCa[5]) # e: estimate mean and standard error # sample mean: x1_bar &lt;- mean(x1) x2_bar &lt;- mean(x2) x3_bar &lt;- mean(x3) # standard error (of the sample mean) estimate: sample standard deviation / sqrt(n) x1_bar_SE &lt;- sd(x1) / sqrt(nsample) x2_bar_SE &lt;- sd(x2) / sqrt(nsample) x3_bar_SE &lt;- sd(x3) / sqrt(nsample) covers_asymp_norm[j,1] &lt;- isin(0, x1_bar - 1.96 * x1_bar_SE, x1_bar + 1.96 * x1_bar_SE) covers_asymp_norm[j,2] &lt;- isin(1/3, x2_bar - 1.96 * x2_bar_SE, x2_bar + 1.96 * x2_bar_SE) covers_asymp_norm[j,3] &lt;- isin(5, x3_bar - 1.96 * x3_bar_SE, x3_bar + 1.96 * x3_bar_SE) } apply(covers, 2, mean) ## [1] 0.918 0.925 0.905 apply(covers, 2, sd) / sqrt(nit) ## [1] 0.008680516 0.008333333 0.009276910 apply(covers_BCa, 2, mean) ## [1] 0.927 0.944 0.927 apply(covers_BCa, 2, sd) / sqrt(nit) ## [1] 0.008230355 0.007274401 0.008230355 apply(covers_asymp_norm, 2, mean) ## [1] 0.939 0.937 0.930 apply(covers_asymp_norm, 2, sd) / sqrt(nit) ## [1] 0.007572076 0.007687008 0.008072494 Exercise 14.2 You are given a sample of independent observations from a process of interest: Index 1 2 3 4 5 6 7 8 X 7 2 4 6 4 5 9 10 Compute the plug-in estimate of mean and 95% symmetric CI based on asymptotic normality. Use the plug-in estimate of SE. Same as (a), but use the unbiased estimate of SE. Apply nonparametric bootstrap with 1000 bootstrap replications and estimate the 95% CI for the mean with percentile-based CI. # a x &lt;- c(7, 2, 4, 6, 4, 5, 9, 10) n &lt;- length(x) mu &lt;- mean(x) SE &lt;- sqrt(mean((x - mu)^2)) / sqrt(n) SE ## [1] 0.8915839 z &lt;- qnorm(1 - 0.05 / 2) c(mu - z * SE, mu + z * SE) ## [1] 4.127528 7.622472 # b SE &lt;- sd(x) / sqrt(n) SE ## [1] 0.9531433 c(mu - z * SE, mu + z * SE) ## [1] 4.006873 7.743127 # c set.seed(0) m &lt;- 1000 T_mean &lt;- function(x) {mean(x)} est_boot &lt;- array(NA, m) for (i in 1:m) { x_boot &lt;- x[sample(1:n, n, rep = T)] est_boot[i] &lt;- T_mean(x_boot) } quantile(est_boot, p = c(0.025, 0.975)) ## 2.5% 97.5% ## 4.250 7.625 Exercise 14.3 We are given a sample of 10 independent paired (bivariate) observations: Index 1 2 3 4 5 6 7 8 9 10 X 1.26 -0.33 1.33 1.27 0.41 -1.54 -0.93 -0.29 -0.01 2.40 Y 2.64 0.33 0.48 0.06 -0.88 -2.14 -2.21 0.95 0.83 1.45 Compute Pearson correlation between X and Y. Use the cor.test() from R to estimate a 95% CI for the estimate from (a). Apply nonparametric bootstrap with 1000 bootstrap replications and estimate the 95% CI for the Pearson correlation with percentile-based CI. Compare CI from (b) and (c). Are they similar? How would the bootstrap estimation of CI change if we were interested in Spearman or Kendall correlation instead? x &lt;- c(1.26, -0.33, 1.33, 1.27, 0.41, -1.54, -0.93, -0.29, -0.01, 2.40) y &lt;- c(2.64, 0.33, 0.48, 0.06, -0.88, -2.14, -2.21, 0.95, 0.83, 1.45) # a cor(x, y) ## [1] 0.6991247 # b res &lt;- cor.test(x, y) res$conf.int[1:2] ## [1] 0.1241458 0.9226238 # c set.seed(0) m &lt;- 1000 n &lt;- length(x) T_cor &lt;- function(x, y) {cor(x, y)} est_boot &lt;- array(NA, m) for (i in 1:m) { idx &lt;- sample(1:n, n, rep = T) # !!! important to use same indices to keep dependency between x and y est_boot[i] &lt;- T_cor(x[idx], y[idx]) } quantile(est_boot, p = c(0.025, 0.975)) ## 2.5% 97.5% ## 0.2565537 0.9057664 # d # Yes, but the bootstrap CI is more narrow. # e # We just use the functions for Kendall/Spearman coefficients instead: T_kendall &lt;- function(x, y) {cor(x, y, method = &quot;kendall&quot;)} T_spearman &lt;- function(x, y) {cor(x, y, method = &quot;spearman&quot;)} # Put this in a function that returns the CI bootstrap_95_ci &lt;- function(x, y, t, m = 1000) { n &lt;- length(x) est_boot &lt;- array(NA, m) for (i in 1:m) { idx &lt;- sample(1:n, n, rep = T) # !!! important to use same indices to keep dependency between x and y est_boot[i] &lt;- t(x[idx], y[idx]) } quantile(est_boot, p = c(0.025, 0.975)) } bootstrap_95_ci(x, y, T_kendall) ## 2.5% 97.5% ## -0.08108108 0.78378378 bootstrap_95_ci(x, y, T_spearman) ## 2.5% 97.5% ## -0.1701115 0.8867925 Exercise 14.4 In this problem we will illustrate the use of the nonparametric bootstrap for estimating CIs of regression model coefficients. Load the longley dataset from base R with data(longley). Use lm() to apply linear regression using “Employed” as the target (dependent) variable and all other variables as the predictors (independent). Using lm() results, print the estimated regression coefficients and standard errors. Estimate 95% CI for the coefficients using +/- 1.96 * SE. Use nonparametric bootstrap with 100 replications to estimate the SE of the coefficients from (b). Compare the SE from (c) with those from (b). # a data(longley) # b res &lt;- lm(Employed ~ . , longley) tmp &lt;- data.frame(summary(res)$coefficients[,1:2]) tmp$LB &lt;- tmp[,1] - 1.96 * tmp[,2] tmp$UB &lt;- tmp[,1] + 1.96 * tmp[,2] tmp ## Estimate Std..Error LB UB ## (Intercept) -3.482259e+03 8.904204e+02 -5.227483e+03 -1.737035e+03 ## GNP.deflator 1.506187e-02 8.491493e-02 -1.513714e-01 1.814951e-01 ## GNP -3.581918e-02 3.349101e-02 -1.014616e-01 2.982320e-02 ## Unemployed -2.020230e-02 4.883997e-03 -2.977493e-02 -1.062966e-02 ## Armed.Forces -1.033227e-02 2.142742e-03 -1.453204e-02 -6.132495e-03 ## Population -5.110411e-02 2.260732e-01 -4.942076e-01 3.919994e-01 ## Year 1.829151e+00 4.554785e-01 9.364136e-01 2.721889e+00 # c set.seed(0) m &lt;- 100 n &lt;- nrow(longley) T_coef &lt;- function(x) { lm(Employed ~ . , x)$coefficients } est_boot &lt;- array(NA, c(m, ncol(longley))) for (i in 1:m) { idx &lt;- sample(1:n, n, rep = T) est_boot[i,] &lt;- T_coef(longley[idx,]) } SE &lt;- apply(est_boot, 2, sd) SE ## [1] 1.826011e+03 1.605981e-01 5.693746e-02 8.204892e-03 3.802225e-03 ## [6] 3.907527e-01 9.414436e-01 # Show the standard errors around coefficients library(ggplot2) library(reshape2) df &lt;- data.frame(index = 1:7, bootstrap_SE = SE, lm_SE = tmp$Std..Error) melted_df &lt;- melt(df[2:nrow(df), ], id.vars = &quot;index&quot;) # Ignore bias which has a really large magnitude ggplot(melted_df, aes(x = index, y = value, fill = variable)) + geom_bar(stat=&quot;identity&quot;, position=&quot;dodge&quot;) + xlab(&quot;Coefficient&quot;) + ylab(&quot;Standard error&quot;) # + scale_y_continuous(trans = &quot;log&quot;) # If you want to also plot bias Exercise 14.5 This exercise shows a shortcoming of the bootstrap method when using the plug in estimator for the maximum. Compute the 95% bootstrap CI for the maximum of a standard normal distribution. Compute the 95% bootstrap CI for the maximum of a binomial distribution with n = 15 and p = 0.2. Repeat (b) using p = 0.9. Why is the result different? # bootstrap CI for maximum alpha &lt;- 0.05 T_max &lt;- function(x) {max(x)} # Equal to T_max = max bootstrap &lt;- function(x, t, m = 1000) { n &lt;- length(x) values &lt;- rep(0, m) for (i in 1:m) { values[i] &lt;- t(sample(x, n, replace = T)) } quantile(values, probs = c(alpha / 2, 1 - alpha / 2)) } # a # Meaningless, as the normal distribution can yield arbitrarily large values. x &lt;- rnorm(100) bootstrap(x, T_max) ## 2.5% 97.5% ## 1.819425 2.961743 # b x &lt;- rbinom(100, size = 15, prob = 0.2) # min = 0, max = 15 bootstrap(x, T_max) ## 2.5% 97.5% ## 6 7 # c x &lt;- rbinom(100, size = 15, prob = 0.9) # min = 0, max = 15 bootstrap(x, T_max) ## 2.5% 97.5% ## 15 15 # Observation: to estimate the maximum, we need sufficient probability mass near the maximum value the distribution can yield. # Using bootstrap is pointless when there is too little mass near the true maximum. # In general, bootstrap will fail when estimating the CI for the maximum. Exercise 14.6 (Practical - and fictional - coverage interval comparison) In this exercise, we investigate how different kinds of CI’s behave as we vary the number of measurements. The story behind the data: it’s 2025 and we’ve discovered that Slovenia has rich deposits of a rare mineral called Moustachium, which can be used to accelerate moustache growth. This mineral is highly sought, so the government has decided to contract two different companies to provide information on where to begin mining. Both companies investigated mining sites in each statistical region and gave their best estimate of the average Moustachium concentration in tonnes per square kilometer. The Data Science team has been called to estimate the uncertainty in these estimates and help avoid mining in the wrong region. Generate synthetic data with the script below: set.seed(0) library(comprehenr) regions &lt;- c(&quot;pomurska&quot;, &quot;podravska&quot;, &quot;koroska&quot;, &quot;savinjska&quot;, &quot;zasavska&quot;, &quot;posavska&quot;, &quot;JV Slovenija&quot;, &quot;primorsko-notranjska&quot;, &quot;osrednjeslovenska&quot;, &quot;gorenjska&quot;, &quot;goriska&quot;, &quot;obalno-kraska&quot;) region_rates &lt;- seq(1.3, 2.3, length.out=length(regions)) region_rates &lt;- region_rates[sample.int(length(regions), length(regions))] make_dataset &lt;- function(n_contractors) { measurements &lt;- matrix(nrow=length(regions), ncol=n_contractors) for (i in 1:length(regions)) { measurements[i,] &lt;- rgamma(n_contractors, 5.0, region_rates[i]) } df &lt;- data.frame(measurements) row.names(df) &lt;- regions names(df) &lt;- to_vec(for(i in 1:n_contractors) paste(&quot;Contractor&quot;, i)) return(df) } set.seed(0) df_2025 &lt;- make_dataset(2) set.seed(0) df_2027 &lt;- make_dataset(10) set.seed(0) df_2028 &lt;- make_dataset(100) set.seed(0) df_2029 &lt;- make_dataset(1000) saveRDS(df_2025, file=&quot;moustachium_2025.Rda&quot;) saveRDS(df_2027, file=&quot;moustachium_2027.Rda&quot;) saveRDS(df_2028, file=&quot;moustachium_2028.Rda&quot;) saveRDS(df_2029, file=&quot;moustachium_2029.Rda&quot;) Estimate the average concentration for different regions. Estimate the average concentration uncertainty using 95% CI’s (asymptotic normality with biased and unbiased standard error, standard bootstrap CI, bootstrap percentile CI). Visualize uncertainties with a histogram and discuss the best location to start mining. The year is 2027 and the government has decided to contract 10 companies. Rerun the code with new measurements and discuss how CI’s change. Technological advancements in robotics have enabled site surveys on a massive scale. Repeat the last point for 100 surveyor robots in 2028 and 1000 surveyor robots in 2029. library(ggplot2) library(dplyr) library(data.table) set.seed(0) input_dataset_path = &quot;moustachium_2025.Rda&quot; # Change this for points d and e output_plot_path = &quot;moustachium_2025.pdf&quot; # Change this for points d and e df &lt;- readRDS(input_dataset_path) # Data comes from here n_contractors &lt;- ncol(df) results_df &lt;- data.frame(region=row.names(df)) # Store CI bounds here # 1. average concentration for different mining sites results_df$average_concetration &lt;- rowMeans(df) # CI for the mean based on asymptotic normality (biased SE estimate) biased_SE &lt;- sqrt(apply(df, 1, function(vec) {sum((vec - mean(vec))^2) / length(vec)}) / n_contractors) results_df$ci95.an.biased_var.low &lt;- results_df$average_concetration - 1.96 * biased_SE results_df$ci95.an.biased_var.high &lt;- results_df$average_concetration + 1.96 * biased_SE # CI for the mean based on asymptotic normality (unbiased SE estimate) unbiased_SE &lt;- sqrt(apply(df, 1, var) / n_contractors) results_df$ci95.an.unbiased_var.low &lt;- results_df$average_concetration - 1.96 * unbiased_SE results_df$ci95.an.unbiased_var.high &lt;- results_df$average_concetration + 1.96 * unbiased_SE # Standard bootstrap CI with 1000 samples bootstrap_variance &lt;- function(data, n_samples) { # n_samples is m in pseudocode output &lt;- numeric(n_samples) for (i in 1:n_samples) { index &lt;- sample(1:length(data), length(data), rep = TRUE) resampled_data &lt;- data[index] output[i] &lt;- mean(resampled_data) } return(var(output)) } bootstrap_1000_sd &lt;- sqrt(apply(df, 1, function(vec){bootstrap_variance(vec, 1000)})) results_df$ci95.bootstrap.standard.1000.low &lt;- results_df$average_concetration - 1.96 * bootstrap_1000_sd results_df$ci95.bootstrap.standard.1000.high &lt;- results_df$average_concetration + 1.96 * bootstrap_1000_sd # Bootstrap percentile CI with 1000 samples bootstrap_quantile &lt;- function(data, functional, n_samples, probs) { # n_samples is m in pseudocode output &lt;- numeric(n_samples) for (i in 1:n_samples) { index &lt;- sample(1:length(data), length(data), rep = TRUE) resampled_data &lt;- data[index] output[i] &lt;- functional(resampled_data) } return(quantile(output, probs=probs)) } results_df$ci95.bootstrap.percentile.1000.low &lt;- apply(df, 1, function(vec){bootstrap_quantile(vec, mean, 1000, 0.025)}) results_df$ci95.bootstrap.percentile.1000.high &lt;- apply(df, 1, function(vec){bootstrap_quantile(vec, mean, 1000, 0.975)}) results_df ## region average_concetration ci95.an.biased_var.low ## 1 pomurska 2.814731 1.5351811 ## 2 podravska 2.646518 1.5358919 ## 3 koroska 2.010216 0.5956186 ## 4 savinjska 4.618001 4.4057369 ## 5 zasavska 2.458873 2.0050840 ## 6 posavska 2.153802 1.9001244 ## 7 JV Slovenija 2.433503 1.6860397 ## 8 primorsko-notranjska 3.165394 2.9640430 ## 9 osrednjeslovenska 3.696875 3.5592419 ## 10 gorenjska 1.341931 0.2784547 ## 11 goriska 2.767328 2.3255569 ## 12 obalno-kraska 1.580711 1.4533751 ## ci95.an.biased_var.high ci95.an.unbiased_var.low ci95.an.unbiased_var.high ## 1 4.094281 1.005174095 4.624288 ## 2 3.757144 1.075855548 4.217180 ## 3 3.424813 0.009673183 4.010759 ## 4 4.830264 4.317814385 4.918187 ## 5 2.912662 1.817118318 3.100628 ## 6 2.407479 1.795047746 2.512556 ## 7 3.180965 1.376430415 3.490575 ## 8 3.366746 2.880640556 3.450148 ## 9 3.834508 3.502232367 3.891518 ## 10 2.405407 -0.162051549 2.845913 ## 11 3.209099 2.142569481 3.392086 ## 12 1.708047 1.400630772 1.760792 ## ci95.bootstrap.standard.1000.low ci95.bootstrap.standard.1000.high ## 1 1.5397542 4.089708 ## 2 1.5388631 3.754173 ## 3 0.5492603 3.471171 ## 4 4.4062860 4.829715 ## 5 1.9938049 2.923942 ## 6 1.9010514 2.406552 ## 7 1.6932573 3.173748 ## 8 2.9670216 3.363767 ## 9 3.5602064 3.833544 ## 10 0.2845999 2.399262 ## 11 2.3293359 3.205320 ## 12 1.4543352 1.707087 ## ci95.bootstrap.percentile.1000.low ci95.bootstrap.percentile.1000.high ## 1 1.8914878 3.737975 ## 2 1.8451596 3.447876 ## 3 0.9895308 3.030901 ## 4 4.4648444 4.771157 ## 5 2.1314473 2.786299 ## 6 1.9707640 2.336840 ## 7 1.8941800 2.972825 ## 8 3.0201118 3.310677 ## 9 3.5975676 3.796183 ## 10 0.5745928 2.109269 ## 11 2.4485735 3.086082 ## 12 1.4888334 1.672589 # Visualization: we use a bar chart with uncertainty bands plot_moustachium_per_region &lt;- function(region_names, average, ci_low, ci_high) { df_visualization &lt;- data.frame(region=region_names, average=average, low=ci_low, high=ci_high) ggplot(df_visualization, aes(x=region, y=average)) + geom_bar(stat=&quot;identity&quot;) } mask &lt;- endsWith(colnames(results_df), &quot;low&quot;) mask[c(1, 2)] &lt;- T results_df_low &lt;- results_df[, mask] colnames(results_df_low) &lt;- gsub(&#39;.low&#39;,&#39;&#39;, colnames(results_df_low)) mask &lt;- endsWith(colnames(results_df), &quot;high&quot;) mask[c(1, 2)] &lt;- T results_df_high &lt;- results_df[, mask] colnames(results_df_high) &lt;- gsub(&#39;.high&#39;,&#39;&#39;, colnames(results_df_high)) long_results_df_low &lt;- melt(setDT(results_df_low), id.vars=c(&quot;region&quot;, &quot;average_concetration&quot;)) names(long_results_df_low) &lt;- c(&quot;region&quot;, &quot;average_concentration&quot;, &quot;variable&quot;, &quot;low&quot;) long_results_df_high &lt;- melt(setDT(results_df_high), id.vars=c(&quot;region&quot;, &quot;average_concetration&quot;)) names(long_results_df_high) &lt;- c(&quot;region&quot;, &quot;average_concentration&quot;, &quot;variable&quot;, &quot;high&quot;) long_results_df &lt;- merge(long_results_df_low, long_results_df_high, by=c(&quot;region&quot;, &quot;variable&quot;, &quot;average_concentration&quot;), all=T) moustachium_plot &lt;- ggplot(long_results_df, aes(x=region, y=average_concentration)) + geom_bar(stat=&quot;identity&quot;, position=&quot;dodge&quot;, alpha=0.2) + geom_errorbar(aes(ymin=low, ymax=high, color=variable), width=0.2, position=position_dodge(0.9)) + scale_x_discrete(guide = guide_axis(angle = 45)) + ylim(-1, 8) # ggsave(plot=moustachium_plot, width=12, height=8, dpi=300, filename=output_plot_path) moustachium_plot # Visualization: we can also use a map. Circle size denotes concentration in region, low transparency denotes high uncertainty. library(maps) map_data_slo &lt;- map_data(&#39;world&#39;)[map_data(&#39;world&#39;)$region == &quot;Slovenia&quot;,] map_df &lt;- long_results_df[long_results_df$variable == &quot;ci95.an.biased_var&quot;, ] # VERY approximate longitudes and latitudes for different regions. map_df$long &lt;- rep(0, nrow(map_df)) map_df$lat &lt;- rep(0, nrow(map_df)) map_df[map_df$region == &quot;gorenjska&quot;]$long &lt;- 14.2 map_df[map_df$region == &quot;gorenjska&quot;]$lat &lt;- 46.3 map_df[map_df$region == &quot;goriska&quot;]$long &lt;- 13.85 map_df[map_df$region == &quot;goriska&quot;]$lat &lt;- 46.0 map_df[map_df$region == &quot;obalno-kraska&quot;]$long &lt;- 13.9 map_df[map_df$region == &quot;obalno-kraska&quot;]$lat &lt;- 45.65 map_df[map_df$region == &quot;osrednjeslovenska&quot;]$long &lt;- 14.5 map_df[map_df$region == &quot;osrednjeslovenska&quot;]$lat &lt;- 46. map_df[map_df$region == &quot;primorsko-notranjska&quot;]$long &lt;- 14.3 map_df[map_df$region == &quot;primorsko-notranjska&quot;]$lat &lt;- 45.7 map_df[map_df$region == &quot;zasavska&quot;]$long &lt;- 15 map_df[map_df$region == &quot;zasavska&quot;]$lat &lt;- 46.1 map_df[map_df$region == &quot;savinjska&quot;]$long &lt;- 15.2 map_df[map_df$region == &quot;savinjska&quot;]$lat &lt;- 46.25 map_df[map_df$region == &quot;posavska&quot;]$long &lt;- 15.4 map_df[map_df$region == &quot;posavska&quot;]$lat &lt;- 46 map_df[map_df$region == &quot;koroska&quot;]$long &lt;- 15.1 map_df[map_df$region == &quot;koroska&quot;]$lat &lt;- 46.5 map_df[map_df$region == &quot;podravska&quot;]$long &lt;- 15.7 map_df[map_df$region == &quot;podravska&quot;]$lat &lt;- 46.45 map_df[map_df$region == &quot;pomurska&quot;]$long &lt;- 16.2 map_df[map_df$region == &quot;pomurska&quot;]$lat &lt;- 46.65 map_df[map_df$region == &quot;JV Slovenija&quot;]$long &lt;- 15. map_df[map_df$region == &quot;JV Slovenija&quot;]$lat &lt;- 45.7 map_df$ci_size &lt;- (map_df$high - map_df$low) map_df$ci_y &lt;- map_df$lat - 0.05 map_df$ci_label &lt;- sprintf(&quot;(%.2f, %.2f)&quot;, map_df$low, map_df$high) map_df$avg_label &lt;- sprintf(&quot;%.2f&quot;, map_df$average_concentration) country_plot &lt;- ggplot() + # First layer: worldwide map geom_polygon(data = map_data(&quot;world&quot;), aes(x=long, y=lat, group = group), color = &#39;#9c9c9c&#39;, fill = &#39;#f3f3f3&#39;) + # Second layer: Country map geom_polygon( data = map_data_slo, aes(x=long, y=lat, group = group), color=&#39;darkgreen&#39;, fill=&#39;green&#39;, alpha=0.2 ) + geom_point(data=map_df, aes(x=long, y=lat, fill=region, size=average_concentration, alpha=ci_size), color=&quot;black&quot;, pch=21) + geom_text(data=map_df, aes(x=long, y=ci_y, label=ci_label), size=3) + geom_text(data=map_df, aes(x=long, y=lat, label=avg_label), size=3) + scale_size_continuous(range = c(3, 12), trans = &quot;exp&quot;) + scale_alpha_continuous(range = c(0.15, 0.75), trans = &quot;reverse&quot;) + ggtitle(&quot;Estimated average Moustachium concentration with 95% CI&quot;) + coord_cartesian(xlim=c(13.2, 16.7), ylim=c(45.4, 47.)) # ggsave(plot=country_plot, width=18, height=12, dpi=300, filename=&quot;country.pdf&quot;) country_plot "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]