diff --git a/04-set-up.Rmd b/04-set-up.Rmd
index 0019aa12..18aec830 100644
--- a/04-set-up.Rmd
+++ b/04-set-up.Rmd
@@ -2,28 +2,34 @@
# Setup {#c04-set-up}
-This chapter provides an overview of the packages, datasets, and design objects used throughout this book. We recommend taking the time to walk through the code provided here and make sure you have everything installed to ensure a smoother learning experience. Additionally, as mentioned in Chapter \@ref(c02-overview-surveys), researchers and analysts need to understand how the survey was conducted to better understand the results and interpret findings. Therefore, we provide some overview information about the datasets used throughout this book in examples and exercises. If you have questions or run into issues with the code provided, please visit the GitHub repository for this book, [https://github.com/tidy-survey-r/tidy-survey-book](https://github.com/tidy-survey-r/tidy-survey-book).
+```{r}
+#| echo: false
+knitr::opts_chunk$set(tidy = 'styler')
+```
+
+This chapter provides an overview of the packages, data, and design objects we use throughout this book. For a streamlined learning experience, we recommend taking the time to walk through the code provided and making sure everything is installed. As mentioned in Chapter \@ref(c02-overview-surveys), understanding how a survey was conducted helps us make sense of the results and interpret findings. So, we provide background on the datasets used in examples and exercises. Finally, we walk through how to create the survey design objects necessary to begin analysis. If you have questions or face issues while going through the book, please report them in the book's GitHub repository: [https://github.com/tidy-survey-r/tidy-survey-book](https://github.com/tidy-survey-r/tidy-survey-book).
## Packages
-Most functions in this book are from three main packages: {tidyverse}, {survey}, and {srvyr}. If you have not installed these packages, use the following code. We use the development version of srvyr from GitHub because of its additional functionality compared to the one on CRAN.
+We use several packages throughout the book, but let's install and load specific ones for this chapter. Many functions in the examples and exercises are from three packages: {tidyverse}, {survey}, and {srvyr}. If they are not already installed, use the code below. The {tidyverse} and {survey} package can both be installed from the Comprehensive R Archive Network (CRAN). We use the GitHub development version of {srvyr} because of its additional functionality compared to the one on CRAN. Install the package directly from GitHub using the {remotes} package:
```{r}
#| label: setup-install-core1
#| eval: FALSE
-install.packages(c("tidyverse","survey"))
+install.packages(c("tidyverse", "survey"))
remotes::install_github("https://github.com/gergness/srvyr")
```
-Additionally, to ensure that all readers can follow the examples throughout the book, we provide the analytic datasets in an R package, {srvyrexploR}. Readers can install the package directly from GitHub using the {remotes} package.
+We bundled the datasets used in the book in an R package, {srvyrexploR}. Install it directly from GitHub using the {remotes} package:
```{r}
#| label: setup-install-core2
#| eval: FALSE
+#| warning: FALSE
remotes::install_github("https://github.com/tidy-survey-r/srvyrexploR")
```
-Once these packages are installed, load these packages using the `library()` function:
+After installing these packages, load them using the `library()` function:
```{r}
#| label: setup-pkgs-core
@@ -36,15 +42,16 @@ library(srvyr)
library(srvyrexploR)
```
-In addition to these four packages, we want to highlight three others: {broom}, {gt}, and {gtsummary}. These help display output and create formatted tables that are easy to read and interpret. You will also want to make sure you install these packages using the following code^[Note that {broom} is part of the tidyverse, so it does not need to be installed separately]:
+The packages {broom}, {gt}, and {gtsummary} play a role in displaying output and creating formatted tables. Install them with the provided code^[Note: {broom} s already included in the tidyverse, so no separate installation is required]:
```{r}
#| label: setup-install-extra
#| eval: FALSE
-install.packages(c("gt","gtsummary"))
+install.packages(c("gt", "gtsummary"))
```
-Once these packages are installed, load these packages using the `library()` function:
+After installing these packages, load them using the `library()` function:
+
```{r}
#| label: setup-pkgs-extra
#| error: FALSE
@@ -55,7 +62,7 @@ library(gt)
library(gtsummary)
```
-In addition to the packages above, this setup chapter requires that you install and load the {censusapi} package. This will be used to assist with correctly weighting one of our key datasets used in this book. If you do not already have this package installed, run this code:
+Install and load the {censusapi} package to access the Current Population Survey (CPS), which we use to ensure accurate weighting of a key dataset in the book. Run the code below to install {censusapi}:
```{r}
#| label: setup-install-census
@@ -63,7 +70,8 @@ In addition to the packages above, this setup chapter requires that you install
install.packages("censusapi")
```
-Once this package is installed, load it using the `library()` function:
+After installing this package, load it using the `library()` function:
+
```{r}
#| label: setup-pkgs-census
#| error: FALSE
@@ -72,19 +80,29 @@ Once this package is installed, load it using the `library()` function:
library(censusapi)
```
-Additional packages are used in the Real Life Data and Vignettes sections of the book. We list them in the Prerequisite boxes at the beginning of the chapters. When working through those chapters, please ensure you pay attention to this prerequisite box at the beginning of the chapter and load all necessary packages and data.
+Note that the {censusapi} package requires a Census API key, available for free from the U.S. Census Bureau website (refer to the package documentation for more information). It's recommended to include the Census API key in our R environment instead of directly in the code. After obtaining the API key, save it in your R environment by running `Sys.setenv()`:
+
+```{r}
+#| label: setup-census-api-setup
+#| eval: FALSE
+Sys.setenv(CENSUS_KEY="YOUR_API_KEY_HERE")
+```
+
+Then, restart the R session. Once the Census API key is stored, we can retrieve it in our R code with `Sys.getenv("CENSUS_KEY")`.
+
+There are other packages used throughout the book. We list them in the Prerequisite boxes at the beginning of each chapter. As we work through the book, make sure to check the Prerequisite box and install any missing packages before proceeding.
## Data
-As mentioned above, the {srvyrexploR} package includes the datasets used throughout the book. Once installed and loaded using the code above, readers can explore the documentation using the `help()` function and read the descriptions for the datasets provided:
+As mentioned above, the {srvyrexploR} package contains the datasets used in the book. Once installed and loaded, explore the documentation using the `help()` function. Read the descriptions of the datasets to understand what they contain:
```{r}
#| label: setup-datapkg-help
#| eval: FALSE
-help(package="srvyrexploR")
+help(package = "srvyrexploR")
```
-The book provides examples and exercises using two key datasets: the American National Election Studies [ANES -- @debell] and the Residential Energy Consumption Survey [RECS -- @recs-2020-tech]. Use the `data()` command to load the datasets from the {srvyrexploR} package into the environment. You can either load all datasets by using the `data()` function without any arguments, or you can include the specific datasets (e.g.,`recs_2020`) as an argument. In the code chunk below, we are loading the `anes_2020` and `recs_2020` datasets into objects with their respective names:
+This book uses two main datasets: the American National Election Studies [ANES -- @debell] and the Residential Energy Consumption Survey [RECS -- @recs-2020-tech]. We can load these datasets individually with the `data()` function by specifying the dataset name as an argument. In the code below, we load the `anes_2020` and `recs_2020` datasets into objects with their respective names:
```{r}
#| label: setup-data-readin
@@ -96,25 +114,35 @@ data(anes_2020)
data(recs_2020)
```
-### Data: American National Election Studies (ANES)
+### American National Election Studies (ANES) Data
+
+The ANES is a study that collects data from election surveys dating back to 1948. These surveys contain information on public opinion and voting behavior in U.S. presidential elections. They cover topics such as party affiliation, voting choice, and level of trust in the government. The 2020 survey, the data we use in the book, was fielded online, through live video interviews, or via computer-assisted telephone interviews (CATI).
-The ANES is a series study that has collected data from election surveys since 1948. These surveys contain data on public opinion and voting behavior in U.S. presidential elections. The 2020 survey (the data we use in the book) was fielded to individuals over the web, through live video interviewing, or with computer-assisted telephone interviewing (CATI). The survey includes questions on party affiliation, voting choice, and level of trust in the government.
+When working with new survey data, analysts should review the survey documentation (see Chapter \@ref(c03-understanding-survey-data-documentation)) to understand the data collection methods. The original ANES data contains variables starting with `V20` [@debell], so to assist with our analysis throughout the book, we created descriptive variable names. For example, the respondent's age is now in a variable called `Age`, and gender is in a variable called `Gender`. These descriptive variables are included in the {srvyrexploR} package, and Table \@ref(tab:anes-view-tab) displays the list of these renamed variables. A complete overview of all variables can be found in `r if (!knitr:::is_html_output()) 'the online Appendix ('`Appendix \@ref(anes-cb)`r if (!knitr:::is_html_output()) ')'`.
-When first looking at new survey data, data users should read the survey documentation (see Chapter \@ref(c03-understanding-survey-data-documentation) to understand how the data was collected and implemented). The original data from ANES contained variables starting with V20 [@debell], so to assist with our analysis throughout the book, we created descriptive variable names. For example, the respondent's age is now in a variable called `Age`, and gender is in a variable called `Gender`. These descriptive variables are included in the data from the {srvyrexploR} package, and Table \@ref(tab:ANESvars) displays the list of these renamed variables. A complete overview of all variables can be found in `r if (!knitr:::is_html_output()) 'the online Appendix ('`Appendix \@ref(anes-cb)`r if (!knitr:::is_html_output()) ')'`.
+(ref:anes-view-tab) List of created variables in the ANES Data
-Table: (\#tab:ANESvars) List of Variables in the ANES Data
```{r}
#| label: setup-anes-variables
-#| echo: false
-anes_2020 %>%
+#| echo: FALSE
+#| warning: FALSE
+
+anes_view <- anes_2020 %>%
select(-matches("^V\\d")) %>%
colnames() %>%
as_tibble() %>%
rename(`Variable Name` = value) %>%
- knitr::kable()
+ gt()
```
-Before starting an analysis, it is good to view the data to understand the types of data and variables that are included. The `dplyr::glimpse()` function produces a list of all variables, the type of the variable (e.g., function, double), and a few example values. Below, we remove the variables with numbers to see a glimpse of the ones with descriptive names:
+```{r}
+#| label: anes-view-tab
+#| echo: FALSE
+anes_view %>%
+ print_gt_book(knitr::opts_current$get()[["label"]])
+```
+
+Before beginning an analysis, it is useful to view the data to understand the available variables. The `dplyr::glimpse()` function produces a list of all variables, their types (e.g., function, double), and a few example values. Below, we remove variables containing numbers with `select(-matches("^V\\d"))` before using `glimpse()` to get a quick overview of the data with descriptive variable names:
```{r}
#| label: setup-anes-glimpse
@@ -123,27 +151,38 @@ anes_2020 %>%
glimpse()
```
-From this, we can see there are `r nrow(anes_2020 %>% select(-matches("^V\\d"))) %>% formatC(big.mark = ",")` rows and `r ncol(anes_2020 %>% select(-matches("^V\\d"))) %>% formatC(big.mark = ",")` variables in the ANES data. This output also indicates that most of the variables are factors (e.g., `InterviewMode`), and a few variables are in double (numeric) format (e.g., `Age`).
+From the output, we can see there are `r nrow(anes_2020 %>% select(-matches("^V\\d"))) %>% formatC(big.mark = ",")` rows and `r ncol(anes_2020 %>% select(-matches("^V\\d"))) %>% formatC(big.mark = ",")` variables in the ANES data. This output also indicates that most of the variables are factors (e.g., `InterviewMode`), while a few variables are in double (numeric) format (e.g., `Age`).
+
+### Residential Energy Consumption Survey (RECS) Data
-### Residential Energy Consumption Survey (RECS)
+RECS is a study that measures energy consumption and expenditure in American households. Funded by the Energy Information Administration, the RECS data are collected through interviews with household members and energy suppliers. These interviews take place in person, over the phone, via mail, and on the web. The survey has been fielded 14 times between 1950 and 2020. It includes questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance.
-RECS is a study that provides energy consumption and expenditure data in American households. The Energy Information Administration funds RECS, and the data is collected through interviews with energy suppliers. These interviews happen in person, over the phone, and online. It has been fielded 14 times between 1950 and 2020. The survey includes questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance.
+As mentioned above, analysts should read the survey documentation (see Chapter \@ref(c03-understanding-survey-data-documentation)) to understand how the data was collected and implemented. Table \@ref(tab:recs-view-tab) displays the list of variables in the RECS data (not including the weights, which start with `NWEIGHT` and will be described in more detail in Chapter \@ref(c10-specifying-sample-designs)). An overview of all variables can be found in `r if (!knitr:::is_html_output()) 'the online Appendix ('`Appendix \@ref(recs-cb)`r if (!knitr:::is_html_output()) ')'`.
-When first looking at new survey data, data users should read the survey documentation (see Chapter \@ref(c03-understanding-survey-data-documentation)) to understand how the data was collected and implemented. Table \@ref(tab:RECSvars) displays the list of variables in the RECS data (not including the weights, which start with `NWEIGHT` and will be described in more detail in Chapter \@ref(c10-specifying-sample-designs)). An overview of all variables can be found in `r if (!knitr:::is_html_output()) 'the online Appendix ('`Appendix \@ref(recs-cb)`r if (!knitr:::is_html_output()) ')'`.
+(ref:recs-view-tab) List of Variables in the RECS Data
-Table: (\#tab:RECSvars) List of Variables in the RECS Data
```{r}
#| label: setup-recs-variables
-#| echo: false
-recs_2020 %>%
- select(-matches("^NWEIGHT")) %>%
- colnames() %>%
- as_tibble() %>%
- rename(`Variable Name`=value) %>%
- knitr::kable()
+#| echo: FALSE
+#| warning: FALSE
+
+recs_view <- recs_2020 %>%
+ select(-matches("^NWEIGHT")) %>%
+ colnames() %>%
+ as_tibble() %>%
+ rename(`Variable Name` = value) %>%
+ gt()
+```
+
+
+```{r}
+#| label: recs-view-tab
+#| echo: FALSE
+recs_view %>%
+ print_gt_book("recs-view-tab")
```
-Before starting an analysis, it is good to view the data to understand the types of data and variables that are included. The `dplyr::glimpse()` function produces a list of all variables, the type of the variable (e.g., function, double), and a few example values.
+Before starting an analysis, we recommend viewing the data to understand the types of data and variables that are included. The `dplyr::glimpse()` function produces a list of all variables, the type of the variable (e.g., function, double), and a few example values. Below, we remove the weight variables with `select(-matches("^NWEIGHT"))` before using `glimpse()` to get a quick overview of the data:
```{r}
#| label: setup-recs-glimpse
@@ -152,30 +191,28 @@ recs_2020 %>%
glimpse()
```
-From this, we can see that there are `r nrow(recs_2020 %>% select(-matches("^NWEIGHT"))) %>% formatC(big.mark = ",")` rows and `r ncol(recs_2020 %>% select(-matches("^NWEIGHT"))) %>% formatC(big.mark = ",")` non-weight variables in the RECS data. This output also indicates that most of the variables are in double (numeric) format (e.g., `TOTSQFT_EN`), with some factor (e.g., `Region`), Boolean (e.g., `ACUsed`), character (e.g., `REGIONC`), and ordinal (e.g., `YearMade`) variables.
+From the output, we can see that there are `r nrow(recs_2020 %>% select(-matches("^NWEIGHT"))) %>% formatC(big.mark = ",")` rows and `r ncol(recs_2020 %>% select(-matches("^NWEIGHT"))) %>% formatC(big.mark = ",")` non-weight variables in the RECS data. This output also indicates that most of the variables are in double (numeric) format (e.g., `TOTSQFT_EN`), with some factor (e.g., `Region`), Boolean (e.g., `ACUsed`), character (e.g., `REGIONC`), and ordinal (e.g., `YearMade`) variables.
-## Design Objects
+## Design objects
-The design object is the backbone for survey analysis. This object is where we specify the sampling design, weights, and other necessary information to ensure the error in the data is accounted for. Analysts will need to review the survey documentation to understand the sampling and weighting structure of the data before creating thie design object.
+The design object is the backbone for survey analysis. It is where we specify the sampling design, weights, and other necessary information to ensure we account for errors in the data. Before creating the design object, analysts should carefully review the survey documentation to understand how to create the design object for accurate analysis.
-In this chapter, we will provide details on how to code the design object for the ANES and RECS data used in the book. However, we only provide a high-level overview to get you going. For a more in-depth understanding of creating these design objects for a variety of sampling designs, see Chapter \@ref(c10-specifying-sample-designs).
+In this chapter, we provide details on how to code the design object for the ANES and RECS data used in the book. However, we only provide a high-level overview to get readers started. For a deeper understanding of creating these design objects for a variety of sampling designs, see Chapter \@ref(c10-specifying-sample-designs).
-Once we have the design objects, all analysis uses these objects and not the original survey data. For example, the ANES data is called `anes_2020`. If we create a design object called `anes_des`, all analyses should start with `anes_des` and not `anes_2020`. You can still review the original data and do an exploratory data review on the original data prior to conducting complex survey analysis (and, in fact, is highly recommended -- see Chapter \@ref(c12-pitfalls)).
+While we recommend conducting exploratory data analysis on the original data before diving into complex survey analysis (see Chapter \@ref(c12-pitfalls)), the actual analysis and inference should be performed with the survey design objects instead of the original survey data. This ensures that we appropriately apply the details of the survey design to our calculations. For example, the ANES data is called `anes_2020`. If we create a survey design object called `anes_des`, our analyses should begin with `anes_des` and not `anes_2020`.
-### American National Election Studies (ANES)
+### American National Election Studies (ANES) Design Object
-Creating the ANES design object requires reviewing the documentation [@debell] to understand the sampling and weighting implications for analysis. From this documentation and as noted in Chapter \@ref(c03-understanding-survey-data-documentation), the 2020 ANES data is weighted to the sample, not the population. If we want to get generalizations for the population, we need to weigh the data against the full population count. To do this, we will use the Current Population Survey (CPS) to find a number of the non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or D.C. in March of 2020, as is recommended in the ANES methodology.
+The ANES documentation [@debell] details the sampling and weighting implications for analyzing the survey data. From this documentation and as noted in Chapter \@ref(c03-understanding-survey-data-documentation), the 2020 ANES data is weighted to the sample, not the population. To make generalizations for the population, we need to weigh the data against the full population count.The ANES methodology recommends using the Current Population Survey (CPS) to determine the number of the non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or D.C. in March of 2020.
-The {censusapi} package allows us to run a reproducible analysis of the CPS data. Note that this package requires a Census API key, which you can get for free from the Census website (more information can be found in the package documentation). A best practice is to include the Census API key in our R environment and not directly in the code. We can use the {usethis} package's `edit_r_environ()` function to access the R environment (located in a file called `.Renviron`). Run `edit_r_environ()`, save the Census API key on a new line as `CENSUS_KEY=yourkeyhere`, and restart RStudio. Once the Census API key is saved in the R environment, we access it in our code with `Sys.getenv("CENSUS_KEY")`.
-
-Since the ANES data is from March 2020, we will want to get a population count for that same time period. To do this, we will use the March data from CPS (`cps/basic/mar`) and the year for 2020 (`vintage = 2020`). Additionally, we need to extract several variables from the CPS:
+We can use the {censusapi} package to obtain the information needed for the survey design object. The `getCensus()` function allows us to retrieve the CPS data for March (`cps/basic/mar`) in 2020 (`vintage = 2020`). Additionally, we extract several variables from the CPS:
- month (`HRMONTH`) and year (`HRYEAR4`) of the interview: to confirm the correct time period
- age (`PRTAGE`) of the respondent: to narrow the population to 18 and older (eligible age to vote)
- citizenship status (`PRCITSHP`) of the respondent: to narrow the population to only those eligible to vote
- final person-level weight (`PWSSWGT`)
-Detailed information for these variables can be found in the data dictionary^[https://www2.census.gov/programs-surveys/cps/datasets/2020/basic/2020_Basic_CPS_Public_Use_Record_Layout_plus_IO_Code_list.txt].
+Detailed information for these variables can be found in the CPS data dictionary^[https://www2.census.gov/programs-surveys/cps/datasets/2020/basic/2020_Basic_CPS_Public_Use_Record_Layout_plus_IO_Code_list.txt].
```{r}
#| label: setup-anes-cps-get
@@ -195,7 +232,9 @@ cps_state <- cps_state_in %>%
.fns = as.numeric))
```
-As we narrowed the dataset to March 2020, we expect all interviews to have been conducted during that month and year. As we requested the month and year of interview variables, we can confirm that all the data is from March (`HRMONTH == 3`) of 2020 (`HRYEAR4 == 2020`).
+In the code above, we include `region = "state"`. The default region type for the CPS data is at the state level. While it's not required to include this, it can be helpful for understanding the geographical context of the data.
+
+In `getCensus()`, we filtered the dataset by specifying the month (`HRMONTH == 3`) and year (`HRYEAR4 == 2020`) of our request. Therefore, we expect that all interviews within our output were conducted during that particular month and year. We can confirm that the data is from March of 2020 by running the code below:
```{r}
#| label: setup-anes-cps-date
@@ -203,28 +242,37 @@ cps_state %>%
distinct(HRMONTH, HRYEAR4)
```
-We then can use the age and citizenship variables to filter the data to only those who are 18 years or older (`PRTAGE >= 18`) and have U.S. citizenship (`PRCITSHIP %in% (1:4)`).
+We can narrow down the dataset using the age and citizenship variables to include only individuals who are 18 years or older (`PRTAGE >= 18`) and have U.S. citizenship (`PRCITSHIP %in% c(1:4)`):
```{r}
#| label: setup-anes-cps-narrowresp
cps_narrow_resp <- cps_state %>%
- as_tibble() %>%
filter(PRTAGE >= 18,
- PRCITSHP %in% (1:4))
+ PRCITSHP %in% c(1:4))
```
-To calculate the U.S. population from the narrowed data, we sum the person weights (`PWSSWGT`).
+To calculate the U.S. population from the filtered data, we sum the person weights (`PWSSWGT`):
```{r}
#| label: setup-anes-cps-targetpop
targetpop <- cps_narrow_resp %>%
pull(PWSSWGT) %>%
sum()
+```
+```{r}
+#| label: setup-anes-cps-targetpop-display
+#| eval: false
targetpop
```
-The target population in 2020 is `r scales::comma(targetpop)`. This information gives us what we need to create the survey design object that can be used to estimate data for the population. Using the `anes_2020` data, we will adjust the weighting variable (`V200010b`) using the target population we just calculated (`targetpop`).
+```{r}
+#| label: setup-anes-cps-targetpop-print
+#| echo: false
+scales::comma(targetpop)
+```
+
+The target population in 2020 is `r scales::comma(targetpop)`. This result gives us what we need to create the survey design object for estimating population statistics. Using the `anes_2020` data, we adjust the weighting variable (`V200010b`) using the target population we just calculated (`targetpop`). We determine the proportion of the total weight for each individual weights (`V200010b / sum(V200010b)`) and then multiply that proportion by the calculated target population.
```{r}
#| label: setup-anes-adjust
@@ -232,7 +280,7 @@ anes_adjwgt <- anes_2020 %>%
mutate(Weight = V200010b / sum(V200010b) * targetpop)
```
-Once we have the adjusted weights, we can review the rest of the documentation to determine how to create the survey design. The documentation indicates that the study is conducted using a stratified cluster sampling design This means that we will need to specify variables for `strata` and `ids` (cluster) and fill in the `nest` argument. The document provides information on which strata and cluster variables to use depending on whether you are analyzing pre- or post-election data. Throughout this book, we analyze the post-election data, so we need to use the post-election weight of `V200010b`, strata variable or `V200010d`, and PSU/cluster variable of `V200010c`. Additionally, we specify `nest=TRUE`, which enforces nesting of the clusters within the strata.
+Once we have the adjusted weights, we can refer to the rest of the documentation to create the survey design. The documentation indicates that the study uses a stratified cluster sampling design. This means that we need to specify variables for `strata` and `ids` (cluster) and fill in the `nest` argument. The documentation provides guidance on which strata and cluster variables to use depending on whether we are analyzing pre- or post-election data. In this book, we analyze post-election data, so we need to use the post-election weight `V200010b`, strata variable `V200010d`, and PSU/cluster variable `V200010c`. Additionally, we set `nest=TRUE` to ensure the clusters are nested within the strata.
```{r}
#| label: setup-anes-des
@@ -245,14 +293,13 @@ anes_des <- anes_adjwgt %>%
anes_des
```
-Viewing this new object outputs information about the survey design object and specifies that ANES is a "Stratified 1 - level Cluster Sampling design (with replacement)
-With (101) clusters". Additionally, the output displays the sampling variables and then lists the rest of the variables on the dataset. This design object will be used throughout this book to conduct survey analysis.
+We can examine this new object to learn more about the survey design, such that the ANES is a "Stratified 1 - level Cluster Sampling design (with replacement) With (101) clusters". Additionally, the output displays the sampling variables and then lists the remaning variables in the dataset. This design object will be used throughout this book to conduct survey analysis.
-### Residential Energy Consumption Survey (RECS)
+### Residential Energy Consumption Survey (RECS) Design Object
-Creating the RECS design object requires reviewing the documentation [@recs-2020-tech] to understand the sampling and weighting implications for analysis. The documentation shows the 2020 RECS uses Jackknife weights where the main analytic weight is `NWEIGHT`, and the Jackknife weights are `NWEIGHT1`-`NWEIGHT60`. In the design object code, we can specify these in the `weights` and `repweights` arguments, respectively.
+The RECS documentation [@recs-2020-tech] provides information on the survey's sampling and weighting implications for analysis. The documentation shows the 2020 RECS uses Jackknife weights, where the main analytic weight is `NWEIGHT`, and the Jackknife weights are `NWEIGHT1`-`NWEIGHT60`. In the survey design object code, we can specify these in the `weights` and `repweights` arguments, respectively.
-With Jackknife weights, a few more pieces of information needed: `type`, `scale`, and `mse`. Chapter \@ref(c10-specifying-sample-designs) goes into more detail about each of these arguments. For the purposes of getting you up and running with analyses, the documentation provides information on each of these values, and we can use the following code to create the design object: `type=JK1`, `scale=59/60` and `mse = TRUE`.
+With Jackknife weights, additional information is required: `type`, `scale`, and `mse`. Chapter \@ref(c10-specifying-sample-designs) goes into depth about each of these arguments, but to quickly get started, the documentation lets us know that `type=JK1`, `scale=59/60`, and `mse = TRUE`. We can use the following code to create the survey design object:
```{r}
#| label: setup-recs-des
@@ -262,11 +309,13 @@ recs_des <- recs_2020 %>%
weights = NWEIGHT,
repweights = NWEIGHT1:NWEIGHT60,
type = "JK1",
- scale = 59/60,
+ scale = 59 / 60,
mse = TRUE
)
recs_des
```
-Viewing this new object outputs information about the survey design object and specifies that RECS is an "unstratified cluster jacknife (JK1) with 60 replicates and MSE variances". Additionally, the output displays the sampling variables (as `NWEIGHT1`-`NWEIGHT50`) and then lists the rest of the variables on the dataset. This design object will be used throughout this book to conduct survey analysis.
+Viewing this new object provides information about the survey design, such that the RECS is an "unstratified cluster jacknife (JK1) with 60 replicates and MSE variances". Additionally, the output shows the sampling variables (`NWEIGHT1`-`NWEIGHT50`) and then lists the remaining variables in the dataset. This design object will be used throughout this book to conduct survey analysis.
+
+This chapter walked through the installation and loading of several packages, introduced the survey data available in the {srvyrexploR} package, and provided context on creating survey design objects for the ANES and RECS datasets. With this foundational knowledge, we can follow the instructions listed in the Prerequisite boxes at the start of each chapter.
\ No newline at end of file
diff --git a/09-reproducible-data.Rmd b/09-reproducible-data.Rmd
index 4264fdad..8221e572 100644
--- a/09-reproducible-data.Rmd
+++ b/09-reproducible-data.Rmd
@@ -1,4 +1,4 @@
-# Reproducible Research {#c09-reprex-data}
+# Reproducible research {#c09-reprex-data}
```{r}
#| label: reprex-styler
@@ -6,24 +6,26 @@
knitr::opts_chunk$set(tidy = 'styler')
```
-Reproducing a data analysis's results is a crucial aspect of any research. First, reproducibility serves as a form of quality assurance. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. They can critically assess the methodology and code while detecting potential errors. Enabling the verification of our analysis is another goal of reproducibility. When someone else is able to check our results, it ensures the integrity of the analyses by determining that the conclusions are not dependent on a particular person running the code or workflow on a particular day or in a particular environment.
+## Introduction
-Not only is reproducibility a key component in ethical and accurate research, but it is also a requirement for many scientific journals. These journals now require authors to make code, data, and methodology transparent and accessible to other researchers who wish to verify or build on existing work.
+Reproducing a data analysis's results is a crucial aspect of any research. First, reproducibility serves as a form of quality assurance. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. They can critically assess the methodology and code while detecting potential errors. Another goal of reproducibility is enabling the verification of our analysis. When someone else is able to check our results, it ensures the integrity of the analyses by determining that the conclusions are not dependent on a particular person running the code or workflow on a particular day or in a particular environment.
-Reproducible research requires that the key components of analysis are available, discoverable, documented, and shared with others. The four main components that we should consider are:
+Not only is reproducibility a key component in ethical and accurate research, but it is also a requirement for many scientific journals. For example, the Journal of Survey Statistics and Methodology (JSSAM) and Public Opinion Quarterly (POQ) require authors to make code, data, and methodology transparent and accessible to other researchers who wish to verify or build on existing work.
+
+Reproducible research requires that the key components of analysis are available, discoverable, documented, and shared with others. The four main components that we should consider are:
- **Code**: source code used for data cleaning, analysis, modeling, and reporting
- **Data**: raw data used in the workflow, or if data is sensitive or proprietary, as much data as possible that would allow others to run our workflow (e.g., access to a restricted use file (RUF))
- **Environment**: environment of the project, including the R version, packages, operating system, and other dependencies used in the analysis
- **Methodology**: analysis methodology, including rationale behind decisions, interpretations, and assumptions
-In Chapter \@ref(c08-communicating-results), we briefly mention each of these is important to include in the methodology report and when communicating the findings and results of a study. However, to be transparent and effective researchers, we need to ensure we not only discuss these through text but also provide files and additional information when requested. Often, when starting a project, researchers will dive into the data and make decisions as they go without full documentation, which can be challenging if we need to go back and make changes or understand even what we did a few months ago. Therefore, it would benefit other researchers and potentially our future selves to better document everything from the start. The good news is that many tools, practices, and project management techniques make survey analysis projects easy to reproduce. For best results, researchers should decide which techniques and tools will be used before starting a project (or very early on).
+In Chapter \@ref(c08-communicating-results), we briefly mention how each of these is important to include in the methodology report and when communicating the findings of a study. However, to be transparent and effective researchers, we need to ensure we not only discuss these through text but also provide files and additional information when requested. Often, when starting a project, analysts will dive into the data and make decisions as they go without full documentation, which can be challenging if we need to go back and make changes or understand even what we did a few months ago. It benefits other analysts and potentially our future selves to better document everything from the start. The good news is that many tools, practices, and project management techniques make survey analysis projects easy to reproduce. For best results, analysts should decide which techniques and tools will be used before starting a project (or very early on).
-This chapter covers some of our suggestions for tools and techniques we can use in projects. This list is not comprehensive but aims to provide a starting point for teams looking to create a reproducible workflow.
+This chapter covers some of our suggestions for tools and techniques we can use in projects. This list is not comprehensive but aims to provide a starting point for those looking to create a reproducible workflow.
## Project-based workflows
-We recommend a project-based workflow for analysis projects as described in Hadley Wickham Mine Çetinkaya-Rundel, and Garrett Grolemund's book, R for Data Science, found at [r4ds.hadley.nz](https://r4ds.hadley.nz/). A project-based workflow maintains a "source of truth" for our analyses. It helps with file system discipline by putting everything related to a project in a designated folder. Since all associated files are in a single location, they are easy to find and organize. When we reopen the project, we can recreate the environment in which we originally ran the code to reproduce our results.
+We recommend a project-based workflow for analysis projects as described by @wickham2023r4ds. A project-based workflow maintains a "source of truth" for our analyses. It helps with file system discipline by putting everything related to a project in a designated folder. Since all associated files are in a single location, they are easy to find and organize. When we reopen the project, we can recreate the environment in which we originally ran the code to reproduce our results.
The RStudio IDE has built-in support for projects. When we create a project in RStudio, it creates a `.Rproj` file that store settings specific to that project. Once we have created a project, we can create folders that help us organize our workflow. For example, a project directory could look like this:
@@ -48,7 +50,7 @@ The RStudio IDE has built-in support for projects. When we create a project in R
| anes_report.pdf
```
-The {here} package enables easy file referencing. In a project-based workflow, all paths are relative and, by default, relative to the project’s folder. By using relative paths, others can open and run our files even if their directory configuration differs from ours. Use the `here::here()` function to build the path when we load or save data. Below, we ask R to read the CSV file `anes_2020.csv` in the project directory's `data` folder:
+In a project-based workflow, all paths are relative and, by default, relative to the project’s folder. By using relative paths, others can open and run our files even if their directory configuration differs from ours. The {here} package enables easy file referencing, and we can start with using the `here::here()` function to build the path for loading or saving data. Below, we ask R to read the CSV file `anes_2020.csv` in the project directory's `data` folder:
```{r}
#| eval: false
@@ -59,49 +61,57 @@ anes <-
The combination of projects and the {here} package keep all associated files in an organized manner. This workflow makes it more likely that our analyses can be reproduced by us or our colleagues.
-## Version Control: Git
+## Functions and packages
-Often, a survey analysis project produces a lot of code. Keeping track of the latest version can become challenging as files evolve throughout a project. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or duplicative work.
+We may find ourselves repeating ourselves in our script, and the chances of errors increases whenever we copy and paste our code. By creating a function, we can create a consistent set of commands that reduce the likelihood of mistakes. Functions also organize our code, improve the code readability, and allow others to execute the same commands. Throughout this book, we have created functions, such as in Chapter \@ref(c13-ncvs-vignette), to run sequences of rename, filter, group_by, and summarize statements across different variables. The function helps us avoid overlooking necessary steps.
-Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Survey analysts can use Git to follow code evaluation and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts).
+A package is made up of a collection of functions. If we find ourselves sharing functions with others to replicate the same series of commands in a separate project, creating a package can be a useful tool for sharing the code along with data and documentation.
-Services such as GitHub or GitLab provide hosting and sharing of files as well as version control with Git. For example, we can visit the GitHub repository for this book ([https://github.com/tidy-survey-r/tidy-survey-book](https://github.com/tidy-survey-r/tidy-survey-book)) and see the files that build the book, when they were committed to the repository, and the history of modifications over time.
+## Version control with Git
+
+Often, a survey analysis project produces a lot of code. Keeping track of the latest version can become challenging as files evolve throughout a project. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or redundant work.
-
+Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Analysts can use Git to follow code evaluation and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts).
+
+Services such as GitHub or GitLab provide hosting and sharing of files as well as version control with Git. For example, we can visit the GitHub repository for this book ([https://github.com/tidy-survey-r/tidy-survey-book](https://github.com/tidy-survey-r/tidy-survey-book)) and see the files that build the book, when they were committed to the repository, and the history of modifications over time.
In addition to code scripts, platforms like GitHub can store data and documentation. They provide a way to maintain a history of data modifications through versioning and timestamps. By saving the data and documentation alongside the code, it becomes easier for others to refer to and access everything they need in one place.
-Using version control in analysis projects makes collaboration and maintenance more manageable. For connecting Git with R, we recommend the [Happy Git and GitHub for the useR by Jenny Bryan and Jim Hester](https://happygitwithr.com/) [@git-w-R].
+Using version control in analysis projects makes collaboration and maintenance more manageable. For connecting Git with R, we recommend the book [Happy Git and GitHub for the useR](https://happygitwithr.com/) [@git-w-R].
-## Package Management: {renv}
+## Package management with {renv}
-Ensuring reproducibility involves not only using version control of code, but also managing the versions of packages. If two people run the same code but use different versions of a package, the results might differ because of changes in those packages. For example, this book currently uses a version of the {srvyr} package from GitHub and not from CRAN. This is because the version of {srvyr} has some bugs (errors) when doing some calculations. The version on GitHub has corrected these errors, so we have asked users to install the GitHub version to obtain the same results.
+Ensuring reproducibility involves not only using version control of code, but also managing the versions of packages. If two people run the same code but use different versions of a package, the results might differ because of changes in those packages. For example, this book currently uses a version of the {srvyr} package from GitHub and not from CRAN. This is because the version of {srvyr} on CRAN has some bugs (errors) that result in incorrect calculations. The version on GitHub has corrected these errors, so we have asked readers to install the GitHub version to obtain the same results.
One way to handle different package versions is with the {renv} package. This package allows researchers to set the versions for each package used and manage package dependencies. Specifically, {renv} creates isolated, project-specific environments that record the packages and their versions used in the code. When initiated by a new user, {renv} checks whether the installed packages are consistent with the recorded version for the project. If not, it installs the appropriate versions so that others can replicate the project's environment to rerun the code and obtain consistent results.
-## Workflow Management: {targets}
+## R environments with Docker
+
+Just as different versions of packages can introduce discrepancies or compatibility issues, the version of R can also prevent reproducibility. Tools such as Docker can help with this potential issue by creating isolated environments that define the version of R being used, along with other dependencies and configurations. The entire environment is bundled in a container. The container, defined by a Dockerfile, can be shared so anybody, regardless of their local setup, can run the R code in the same environment.
+
+## Workflow management with {targets}
With complex studies involving multiple code files and dependencies, it is important to ensures each step is executed in the intended sequence. We can do this manually, e.g., numbering files to indicate the order or providing detailed documentation on the order. Alternatively, we can automate the process so the code flows sequentially. Making sure that the code runs in the correct order helps ensure that the research is reproducible. Anyone should be able to pick up the set of scripts and get the same results by following the workflow.
-The {targets} package is growing as a popular workflow manager that documents, automates, and executes complex data workflows with multiple steps and dependencies. With this package, we first define the order of execution for our code, and then it will consistently execute the code in that order each time it is run. One nice feature of {targets} is that if you change code later in the workflow, only the affected code and its downstream targets (i.e., the subsequent code files) are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline.
+The {targets} package is growing as a popular workflow manager that documents, automates, and executes complex data workflows with multiple steps and dependencies. With this package, we first define the order of execution for our code, and then it will consistently execute the code in that order each time it is run. One beneficial feature of {targets} is that if you change code later in the workflow, only the affected code and its downstream targets (i.e., the subsequent code files) are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline.
-## Documentation: Quarto and R Markdown
+## Documentation with Quarto and R Markdown
-Tools like Quarto and R Markdown aid in reproducibility by creating documents that integrate code, text, and results. We can present analysis results alongside the report's narrative, so there's no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output.
+Tools like Quarto and R Markdown aid in reproducibility by creating documents that weave together code, text, and results. We can present analysis results alongside the report's narrative, so there's no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output.
-Quarto and R Markdown documents also allow users to re-execute the underlying code when needed. Another team member can see the steps we took, follow the scripts, and recreate the report. We can include details about our work in one place thanks to the combination of text and code, making our work transparent and easier to verify.
+Quarto and R Markdown documents also allow users to re-execute the underlying code when needed. Another analyst can see the steps we took, follow the scripts, and recreate the report. We can include details about our work in one place thanks to the combination of text and code, making our work transparent and easier to verify.
### Parameterization
-Another great feature of Quarto and R Markdown is the ability to reduce repetitive code by parameterizing the files. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. We can define and modify these parameters to explore different scenarios or inputs. For example, suppose we start by creating a document that provides survey analysis results for Michigan but then later decide we want to look at multiple states. In that case, we can define a `state` parameter and rerun the same analysis for other states like Wisconsin without having to edit the code throughout the document.
+Another useful feature of Quarto and R Markdown is the ability to reduce repetitive code by parameterizing the files. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. We can define and modify these parameters to explore different scenarios or inputs. For example, suppose we start by creating a document that provides survey analysis results for North Carolina but then later decide we want to look at another state. In that case, we can define a `state` parameter and rerun the same analysis for a state like Washington without having to edit the code throughout the document.
-Parameters can be defined in the header or code chunks of our Quarto or R Markdown documents and easily be modified and documented. Thus, we are reducing errors that may occur by manually editing code throughout the script, and it is a flexible way for others to replicate the analysis and explore variations.
+Parameters can be defined in the header or code chunks of our Quarto or R Markdown documents and easily be modified and documented. We reduce errors that may occur by manually editing code throughout the script, and offer a flexible way for others to replicate the analysis and explore variations.
-## Other Tips for Reproducibility
+## Other tips for reproducibility
-### Random Number Seeds
+### Random number seeds
-Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results, facilitating reproducibility.
+Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results.
In R, we can use the `set.seed()` function to control the randomness in our code. Set a seed value by providing an integer to the function:
@@ -117,21 +127,16 @@ The `runif()` function generates five random numbers from a uniform distribution
[1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676
```
-The choice of the seed number is up to the researcher. For example, this could be the date (`20240102`) or time of day (`1056`) when the analysis was first conducted, a phone number (`8675309`), or the first few numbers that come to mind (`369`). As long as the seed is set for a given analysis, the actual number can be up to the researcher to decide. However, it is important to note that `set.seed()` should be used *before* random number generation but is only necessary once per program to make the entire program reproducible. For example, we could set the seed at the top of a program where libraries are loaded.
-
-### Descriptive Names and Labels
+The choice of the seed number is up to the analyst. For example, this could be the date (`20240102`) or time of day (`1056`) when the analysis was first conducted, a phone number (`8675309`), or the first few numbers that come to mind (`369`). As long as the seed is set for a given analysis, the actual number is up to the analyst to decide. It is important to note that `set.seed()` should be used *before* random number generation. Run it once per program, and the seed will be applied to the entire script. We recommend setting the seed at the beginning of a script, where libraries are loaded.
-Something else to assist with reproducible research is using descriptive variable names or labeling data. For example, in the ANES data, the variable names in the raw data all start with `V20` and are a string of numbers. To make things easier to reproduce, we opted to change the variable names to be more descriptive of what they contained (e.g., `Age`). This can also be done with the data values themselves. One way to accomplish this is by creating factors for categorical data, which can ensure that we know that a value of `1` really means `Female`, for example. There are other ways of handling this, such as attaching labels to the data instead of recoding variables to be descriptive (see Chapter \@ref(c11-missing-data)). As with random number seeds, the exact method is up to the researcher, but providing this information can help ensure your research is reproducible.
+### Descriptive names and labels
-### Databases
-
-Researchers may consider creating a database for projects with complex or large data structures to manage the data and any changes. Many databases will allow for a history of changes, which can be useful when recoding variables to ensure no inadvertent errors are introduced. Additionally, a database may be more accessible to pass to other researchers if existing relationships between tables and types are complex to map.
+Using descriptive variable names or labeling data can also assist with reproducible research. For example, in the ANES data, the variable names in the raw data all start with `V20` and are a string of numbers. To make things easier to reproduce, we opted to change the variable names to be more descriptive of what they contained (e.g., `Age`). This can also be done with the data values themselves. One way to accomplish this is by creating factors for categorical data, which can ensure that we know that a value of `1` really means `Female`, for example. There are other ways of handling this, such as attaching labels to the data instead of recoding variables to be descriptive (see Chapter \@ref(c11-missing-data)). As with random number seeds, the exact method is up to the analyst, but providing this information can help ensure our research is reproducible.
## Summary
-We can promote accuracy and verification of results by making our analysis reproducible. This chapter discussed different ways to make research reproducible. There are various tools and guides available to help you achieve reproducibility in your work. Here are additional resources to explore:
+We can promote accuracy and verification of results by making our analysis reproducible. There are various tools and guides available to help you achieve reproducibility in your work, a few of which were described in this chapter. Here are additional resources to explore:
* R for Data Science chapter on project-based workflows: [https://r4ds.hadley.nz/workflow-scripts.html#projects](https://r4ds.hadley.nz/workflow-scripts.html#projects)
* Building reproducible analytical pipelines with R by Bruno Rodrigues: [https://raps-with-r.dev/](https://raps-with-r.dev/)
* Posit Solutions Site page on reproducible environments: [https://solutions.posit.co/envs-pkgs/environments/](https://solutions.posit.co/envs-pkgs/environments/)
-
diff --git a/10-specifying-sample-designs.Rmd b/10-specifying-sample-designs.Rmd
index 133bef19..441bedf4 100644
--- a/10-specifying-sample-designs.Rmd
+++ b/10-specifying-sample-designs.Rmd
@@ -25,7 +25,7 @@ library(srvyr)
library(srvyrexploR)
```
-To help explain the different types of sample designs, this chapter will use the `api` and `scd` data that comes in the {survey} package:
+To help explain the different types of sample designs, this chapter will use the `api` and `scd` data that are included in the {survey} package:
```{r}
#| label: samp-setup-surveydata
data(api)
@@ -44,9 +44,9 @@ data(recs_2020)
## Introduction
-The primary reason for using packages like {survey} and {srvyr} is to incorporate the sampling design or replicate weights into estimates. By incorporating the sampling design or replicate weights, precision estimates (e.g., standard errors and confidence intervals) are appropriately calculated.
+The primary reason for using packages like {survey} and {srvyr} is to account for the sampling design or replicate weights into estimates. By incorporating the sampling design or replicate weights, precision estimates (e.g., standard errors and confidence intervals) are appropriately calculated.
-In this chapter, we will introduce common sampling designs and common types of replicate weights, the mathematical methods for calculating estimates and standard errors for a given sampling design, and the R syntax to specify the sampling design or replicate weights. While we will show the math behind the estimates, the functions in these packages will do the calculation. To deeply understand the math and the derivation, refer to @pennstate506, @sarndal2003model, @wolter2007introduction, or @fuller2011sampling.
+In this chapter, we will introduce common sampling designs and common types of replicate weights, the mathematical methods for calculating estimates and standard errors for a given sampling design, and the R syntax to specify the sampling design or replicate weights. While we will show the math behind the estimates, the functions in these packages will do the calculation. To deeply understand the math and the derivation, refer to @pennstate506, @sarndal2003model, @wolter2007introduction, or @fuller2011sampling (these are listed in order of increasing statistical rigorousness).
The general process for estimation in the {srvyr} package is to:
@@ -58,25 +58,25 @@ The general process for estimation in the {srvyr} package is to:
4. Within `summarize()`, specify variables to calculate, including means, totals, proportions, quantiles, and more
-This chapter includes details on the first step - creating the survey object. The other steps are detailed in chapters \@ref(c05-descriptive-analysis) through \@ref(c07-modeling).
+This chapter includes details on the first step - creating the survey object. Once this survey object is created, it can be used in the other steps (detailed in chapters \@ref(c05-descriptive-analysis) through \@ref(c07-modeling)) to account for the complex survey design.
## Common sampling designs
-A sampling design is the method used to draw a sample. Both logistical and statistical elements are considered when developing a sampling design. When specifying a sampling design in R, the levels of sampling are specified along with the weights. The weight for each record is constructed so that the particular record represents that many units in the population. For example, in a survey of 6th-grade students in the United States, the weight associated with each responding student reflects how many students that record represents. Generally, the sum of the weights corresponds to the total population size, although some studies may have the sum of the weights equal to the number of respondent records.
+A sampling design is the method used to draw a sample. Both logistical and statistical elements are considered when developing a sampling design. When specifying a sampling design in R, the levels of sampling are specified along with the weights. The weight for each record is constructed so that the particular record represents that many units in the population. For example, in a survey of 6th-grade students in the United States, the weight associated with each responding student reflects how many 6th grade students across the country that record represents. Generally, the weights represent the inverse of the probability of selection such that the sum of the weights corresponds to the total population size, although some studies may have the sum of the weights equal to the number of respondent records.
Some common terminology across the designs are:
- **sample size**, generally denoted as $n$, is the number of units selected to be sampled
- **population size**, generally denoted as $N$, is the number of units in the target population
- - **sampling frame**, the list of units from which the sample is drawn
+ - **sampling frame**, the list of units from which the sample is drawn (see Chapter \@ref(c02-overview-surveys) for more information)
### Simple random sample without replacement
-The simple random sample (SRS) without replacement is a sampling design where a fixed sample size is selected from a sampling frame, and every possible subsample has an equal probability of selection.
+The simple random sample (SRS) without replacement is a sampling design where a fixed sample size is selected from a sampling frame, and every possible subsample has an equal probability of selection. Without replacement refers to the fact that once a sampling unit has been selected, it is removed from the sample frame and cannot be selected again.
- **Requirements**: The sampling frame must include the entire population.
- **Advantages**: SRS requires no information about the units apart from contact information.
- - **Disadvantages**: The sampling frame may not be available for the entire population. This design is not generally feasible for in-person data collection.
+ - **Disadvantages**: The sampling frame may not be available for the entire population.
- **Example**: Randomly select students in a university from a roster provided by the registrar's office.
#### The math {-}
@@ -92,7 +92,7 @@ $$se(\bar{y})=\sqrt{\frac{s^2}{n}\left( 1-\frac{n}{N} \right)}$$ where
$$s^2=\frac{1}{n-1}\sum_{i=1}^n\left(y_i-\bar{y}\right)^2.$$
-and $N$ is the population size. This standard error estimate might look very similar to equations in other applications except for the part on the right side of the equation: $1-\frac{n}{N}$. This is called the finite population correction (FPC) factor, and if the size of the frame, $N$, is very large, the FPC is negligible, so it is often ignored.
+and $N$ is the population size. This standard error estimate might look very similar to equations in other applications except for the part on the right side of the equation: $1-\frac{n}{N}$. This is called the finite population correction (FPC) factor. If the size of the frame, $N$, is very large in comparison to the sample, the FPC is negligible, so it is often ignored. A common guideline is if the sample is less than 10% of the population, the FPC is negligible.
To estimate proportions, we define $x_i$ as the indicator if the outcome is observed. That is, $x_i=1$ if the outcome is observed, and $x_i=0$ if the outcome is not observed for respondent $i$. Then the estimated proportion from an SRS design is:
@@ -110,7 +110,7 @@ srs1_des <- dat %>%
as_survey_design(fpc = fpcvar)
```
-where `dat` is a tibble or data.frame with the survey data, and `fpcvar` is a variable on the tibble indicating the sampling frame's size. If the frame is very large, sometimes the frame size is not provided. In that case, the FPC is not needed, and specify the design as:
+where `dat` is a tibble or data.frame with the survey data, and `fpcvar` is a variable in the data indicating the sampling frame's size (this variable will have the same value for all cases in an SRS design). If the frame is very large, sometimes the frame size is not provided. In that case, the FPC is not needed, and specify the design as:
```r
srs2_des <- dat %>%
@@ -125,7 +125,7 @@ srs3_des <- dat %>%
fpc = fpcvar)
```
-where `wtvar` is the variable for the weight on the data. Again, the FPC can be omitted if it is unnecessary because the frame is large.
+where `wtvar` is a variable in the data indicating the weight for each case. Again, the FPC can be omitted if it is unnecessary because the frame is large compared to the sample size.
#### Example {-}
@@ -181,9 +181,13 @@ Similar to the SRS design, the simple random sample with replacement (SRSWR) des
- **Requirements**: The sampling frame must include the entire population.
- **Advantages**: SRSWR requires no information about the units apart from contact information.
- - **Disadvantages**: The sampling frame may not be available for the entire population. This design is not generally feasible for in-person data collection. Units can be selected more than once, resulting in a smaller realized sample size because receiving duplicate information from a single respondent does not provide additional information. For small populations, SRSWR has larger standard errors than SRS designs.
+ - **Disadvantages**:
+ - The sampling frame may not be available for the entire population.
+ - Units can be selected more than once, resulting in a smaller realized sample size because receiving duplicate information from a single respondent does not provide additional information.
+ - For small populations, SRSWR has larger standard errors than SRS designs.
- **Example**: A professor puts all students' names on paper slips and selects them randomly to ask students questions, but the professor replaces the paper after calling on the student so they can be selected again at any time.
+In general for surveys, using an SRS design (without replacement) is preferred as we do not want respondents to answer a survey more than once.
#### The math {-}
@@ -212,7 +216,7 @@ srswr1_des <- dat %>%
as_survey_design()
```
-where `dat` is a tibble or data.frame containing our survey data. This syntax is the same as a SRS design without an FPC. Therefore, with large enough samples that do not have an FPC, the underlying formulas for SRS and SRSWR designs are the same.
+where `dat` is a tibble or data.frame containing our survey data. This syntax is the same as a SRS design, except a finite population correction (FPC) is not included. This is because when you claculate a sample with replacement, the population pool to select from is no longer finite, so a correction is not needed. Therefore, with large populations where the FPC is negligble, the underlying formulas for SRS and SRSWR designs are the same.
If some post-survey adjustments were implemented and the weights are not all equal, specify the design as:
@@ -225,7 +229,7 @@ where `wtvar` is the variable for the weight on the data.
#### Example {-}
-The {survey} package does not include an example of SRSWR, so to illustrate this design we create an example from the population data provided. We call this new dataset `apisrswr`.
+The {survey} package does not include an example of SRSWR, so to illustrate this design we need to create an example. We use the api population data provided by the {survey} package `apipop` and select a sample of 200 cases using the `slice_sample()` function from the tidyverse. One of the arguments in the `slice_sample()` function is `replace`. If `replace=TRUE`, then we are conducting a SRSWR. We then calculate selection weights as the inverse of the probability of selection and call this new dataset `apisrswr`.
```{r}
#| label: samp-des-apisrs-wr-display
@@ -273,11 +277,14 @@ In the output above, the design object and the object summary are shown. Both no
Stratified sampling occurs when a population is divided into mutually exclusive subpopulations (strata), and then samples are selected independently within each stratum.
- **Requirements**: The sampling frame must include the information to divide the population into groups for every unit.
- - **Advantages**: This design ensures sample representation in all subpopulations. If the strata are correlated with survey outcomes, a stratified sample has smaller standard errors compared to a SRS sample of the same size. Thus is a more efficient design.
+ - **Advantages**:
+ - This design ensures sample representation in all subpopulations.
+ - If the strata are correlated with survey outcomes, a stratified sample has smaller standard errors compared to a SRS sample of the same size.
+ - This results in a more efficient design.
- **Disadvantages**: Auxiliary data may not exist to divide the sampling frame into groups, or the data may be outdated.
- **Examples**:
- - **Example 1**: A population of North Carolina residents could be separated into urban and rural areas, and then a SRS of residents from both rural and urban areas is selected independently. This ensures there are residents from both areas in the sample.
- - **Example 2**: There are three primary general-purpose law enforcement agencies in the US: local police, sheriff's departments, and state police. In a survey of law enforcement agencies, the agency type could be used to form strata.
+ - **Example 1**: A population of North Carolina residents could be separated (stratified) into urban and rural areas, and then a SRS of residents from both rural and urban areas is selected independently. This ensures there are residents from both areas in the sample.
+ - **Example 2**: Law enforcement agencies could be separated (stratified) into the three primary general-purpose categories in the US: local police, sheriff's departments, and state police. A SRS of agencies from each of the three types is then selected independently to ensure all three types of agencies are represented.
#### The math {-}
@@ -301,7 +308,7 @@ $$se(\hat{p}) = \frac{1}{N} \sqrt{ \sum_{h=1}^H N_h^2 \frac{\hat{p}_h(1-\hat{p}_
#### The syntax {-}
-To specify a stratified SRS design in {srvyr} when using the FPC, that is, where the population sizes of the strata are not too large and are known, specify the design as:
+In addition to the `fpc` and `weights` arguments discussed in the types above, stratified designs requires the addition of the `strata` argument. For example, to specify a stratified SRS design in {srvyr} when using the FPC, that is, where the population sizes of the strata are not too large and are known, specify the design as:
```r
stsrs1_des <- dat %>%
@@ -319,7 +326,7 @@ stsrs2_des <- dat %>%
#### Example {-}
-In the example API data, `apistrat` is a stratified random sample, stratified by school type (`stype`). As with the SRS example above, we sort and select specific variables for use in printing. The data are illustrated below, including a count of the number of cases per stratum:
+In the example API data, `apistrat` is a stratified random sample, stratified by school type (`stype`) with three levels: `E` for elementary school, `M` for middle school, and `H` for high school. As with the SRS example above, we sort and select specific variables for use in printing. The data are illustrated below, including a count of the number of cases per stratum:
```{r}
#| label: samp-des-apistrat-dis
@@ -333,7 +340,7 @@ apistrat_slim %>%
count(stype, fpc)
```
-The FPC is the same within each stratum, and 100 elementary schools were sampled, while 50 schools were sampled from both the middle and high school levels. This design should be specified as follows:
+The FPC is the same for each case within each stratum. This output also shows that 100 elementary schools, 50 middle schools, and 50 high schools were sampled. It is often common for the number of units sampled from each strata to be different based on the goals of the project, or to mirror the size of each strata in the population. This design should be specified as follows:
```{r}
#| label: samp-des-apistrat-des
@@ -353,11 +360,13 @@ When printing the object, it is specified as a "Stratified Independent Sampling
Clustered sampling occurs when a population is divided into mutually exclusive subgroups called clusters or primary sampling units (PSUs). A random selection of PSUs is sampled, and then another level of sampling is done within these clusters. There can be multiple levels of this selection. Clustered sampling is often used when a list of the entire population is not available, or data collection involves interviewers needing direct contact with respondents.
- **Requirements**: There must be a way to divide the population into clusters. Clusters are commonly structural such as institutions (e.g., schools, prisons) or geography (e.g., states, counties).
- - **Advantages**: Clustered sampling is advantageous when data collection is done in person, so interviewers are sent to specific sampled areas rather than completely at random across a country. With cluster sampling, a list of the entire population is not necessary. For example, if sampling students, we do not need a list of all students but only a list of all schools. Once the schools are sampled, lists of students can be obtained within the sampled schools.
+ - **Advantages**:
+ - Clustered sampling is advantageous when data collection is done in person, so interviewers are sent to specific sampled areas rather than completely at random across a country.
+ - With clustered sampling, a list of the entire population is not necessary. For example, if sampling students, we do not need a list of all students but only a list of all schools. Once the schools are sampled, lists of students can be obtained within the sampled schools.
- **Disadvantages**: Compared to a simple random sample for the same sample size, clustered samples generally have larger standard errors of estimates.
- **Examples**:
- - **Example 1**: Consider a study needing a sample of 6th-grade students in the United States, no list likely exists of all these students. However, it is more likely to obtain a list of schools that have 6th graders, so a study design could select a random sample of schools that have 6th graders. The selected schools can then provide a list of students to do a second stage of sampling where 6th-grade students are randomly sampled within each of the sampled schools. This is a one-stage sample design and will be the type of design we will discuss in the formulas below.
- - **Example 2**: Consider a study sending interviewers to households for a survey. This is a more complicated example that requires two levels of selection to efficiently use interviewers in geographic clusters. First, in the U.S., counties could be selected as the PSU, then Census block groups within counties could be selected as the secondary sampling unit (SSU). Households could then be randomly sampled within the block groups. This type of design is popular for in-person surveys as it reduces the travel necessary for interviewers.
+ - **Example 1**: Consider a study needing a sample of 6th-grade students in the United States, no list likely exists of all these students. However, it is more likely to obtain a list of schools that have 6th graders, so a study design could select a random sample of schools that have 6th graders. The selected schools can then provide a list of students to do a second stage of sampling where 6th-grade students are randomly sampled within each of the sampled schools. This is a one-stage sample design (the one representing the number of clusters) and will be the type of design we will discuss in the formulas below.
+ - **Example 2**: Consider a study sending interviewers to households for a survey. This is a more complicated example that requires two levels of clustering (two-stage sample design) to efficiently use interviewers in geographic clusters. First, in the U.S., counties could be selected as the PSU, then Census block groups within counties could be selected as the secondary sampling unit (SSU). Households could then be randomly sampled within the block groups. This type of design is popular for in-person surveys as it reduces the travel necessary for interviewers.
#### The math {-}
@@ -367,19 +376,21 @@ $$\bar{y}=\frac{\sum_{i=1}^a B_i \bar{y}_{i}}{ \sum_{i=1}^a B_i}$$
Note this is a consistent but biased estimator. Often the population size is not known, so this is a method to estimate a mean without knowing the population size. The estimated standard error of the mean is:
$$se(\bar{y})= \frac{1}{\hat{N}}\sqrt{\left(1-\frac{a}{A}\right)\frac{s_a^2}{a} + \frac{A}{a} \sum_{i=1}^a \left(1-\frac{b_i}{B_i}\right) \frac{s_i^2}{b_i} }$$
-where $s_a^2$ is the between-cluster variance:
+where $\hat{N}$ is the estimated population size, $s_a^2$ is the between-cluster variance and $s_i^2$ is the within-cluster variance.
+
+The formula for the between-cluster variance ($s_a^2$) is:
$$s_a^2=\frac{1}{a-1}\sum_{i=1}^a \left( \hat{y}_i - \frac{\sum_{i=1}^a \hat{y}_{i} }{a}\right)^2$$
+where $\hat{y}_i =B_i\bar{y_i}$ .
-and $s_i^2$ is the within-cluster variance:
+The formula for the within-cluster variance ($s_i^2$) is:
$$s_b^2=\frac{1}{a(b_i-1)} \sum_{j=1}^{b_i} \left(y_{ij}-\bar{y}_i\right)^2$$
-
-and $\hat{y}_i =B_i\bar{y_i}$ and $\hat{N}$ is the estimated population size.
+where $y_{ij}$ is the outcome for sampled unit $j$ within cluster $i$.
#### The syntax {-}
-To specify a two-stage clustered design without replacement, use the following syntax:
+Clustered sampling designs require the addition of the `ids` argument which specifies what variables are the cluster levels. To specify a two-stage clustered design without replacement, use the following syntax:
```r
clus2_des <- dat %>%
@@ -403,6 +414,8 @@ clus2wrb_des <- dat %>%
```
+Note that there is one additional argument that is sometimes necessary which is `nest = TRUE`. This option relabels cluster IDs to enforce nesting within strata. Sometimes, as an example, there may be a cluster `1` and a cluster `2` within each stratum but these are actually different clusters. This option indicates that the repeated use of numbering does not mean it is the same cluster. If this option is not used and there are repeated cluster IDs across different strata, an error will be generated.
+
#### Example {-}
The `survey` package includes a two-stage cluster sample data, `apiclus2`, in which school districts were sampled, and then a random sample of five schools was selected within each district. For districts with fewer than five schools, all schools were sampled. School districts are identified by `dnum`, and schools are identified by `snum`. The variable `fpc1` indicates how many districts there are in California (`A`), and `fpc2` indicates how many schools were in a given district with at least 100 students (`B`). The data has a row for each school. In the data printed below, there are 757 school districts, as indicated by `fpc1`, and there are nine schools in District 731, one school in District 742, two schools in District 768, and so on as indicated by `fpc2`. For illustration purposes, the object `apiclus2_slim` has been created from `apiclus2`, which subsets the data to only the necessary columns and sorts data.
@@ -431,29 +444,55 @@ apiclus2_des
summary(apiclus2_des)
```
-The design objects are described as "2 - level Cluster Sampling design" and include the ids (cluster), FPC, and weight variables. The summary notes that the sample includes 40 first-level clusters (PSUs), which are school districts, and 126 second-level clusters (SSUs), which are schools. Additionally, the summary includes a numeric summary of the probabilities and the population size (number of PSUs) as 757.
+The design objects are described as "2 - level Cluster Sampling design" and include the ids (cluster), FPC, and weight variables. The summary notes that the sample includes 40 first-level clusters (PSUs), which are school districts, and 126 second-level clusters (SSUs), which are schools. Additionally, the summary includes a numeric summary of the probabilities of selection and the population size (number of PSUs) as 757.
+
+
+## Combining sampling methods {#samp-combo}
+
+SRS, stratified, and clustered designs are the backbone of sampling designs, and the features are often combined in one design. Additionally, rather than using SRS for selection, other sampling mechanisms are commonly used, such as probability proportional to size (PPS), systematic sampling, or selection with unequal probabilities, which are briefly described here. In PPS sampling, a size measure is constructed for each unit (e.g., the population of the PSU or the number of occupied housing units) and then units with larger size measures are more likely to be sampled. Systematic sampling is commonly used to ensure representation across a population. Units are sorted by a feature and then every $k$ units are selected from a random start point so the sample is spread across the population. In addition to PPS, other unequal probabilities of selection may be used. For example, in a study of establishments (e.g., businesses or public institutions) that conducts a survey every year, an establishment that recently participated (e.g., participated last year) may have a reduced chance of selection in a subsequent round to reduce the burden on the establishment. To learn more about sampling designs, refer to @valliant2013practical, @cox2011business, @cochran1977sampling, and @deming1991sample.
+
+A common method of sampling is to stratify PSUs, select PSUs within the stratum using PPS selection, and then select units within the PSUs either with SRS or PPS. Reading survey documentation is an important first step in survey analysis to understand the design of the survey we are using and variables necessary to specify the design. Good documentation will highlight the variables necessary to specify the design. This is often found in User's Guides, methodology, analysis guides, or technical documentation (see Chapter \@ref(c03-understanding-survey-data-documentation) for more details).
+
+### Example {-}
+
+For example, the 2017-2019 National Survey of Family Growth (NSFG)^[2017-2019 National Survey of Family Growth (NSFG): Sample Design Documentation - https://www.cdc.gov/nchs/data/nsfg/NSFG-2017-2019-Sample-Design-Documentation-508.pdf] had a stratified multi-stage area probability sample:
+ 1. In the first stage, PSUs are counties or collections of counties and are stratified by Census region/division, size (population), and MSA status. Within each stratum, PSUs were selected via PPS.
+ 2. In the second stage, neighborhoods were selected within the sampled PSUs using PPS selection.
+ 3. In the third stage, housing units were selected within the sampled neighborhoods.
+ 4. In the fourth stage, a person was randomly chosen within the selected housing units among eligible persons using unequal probabilities based on the person's age and sex.
+
+The public use file does not include all these levels of selection and instead has pseudo-strata and pseudo-clusters, which are the variables used in R to specify the design. As specified on page 4 of the documentation, the stratum variable is `SEST`, the cluster variable is `SECU`, and the weight variable is `WGT2017_2019`. Thus, to specify this design in R, use the following syntax:
+
+```r
+nsfg_des <- nsfgdata %>%
+ as_survey_design(ids = SECU,
+ strata = SEST,
+ weights = WGT2017_2019)
+```
## Replicate weights
-Replicate weights are often included on analysis files instead of, or in addition to, the design variables (strata and PSUs). Replicate weights are used as another method to estimate variability and are often used specifically so that design variables are not published as a measure to limit disclosure risk. There are several types of replicate weights, including balanced repeated replication (BRR), Fay's BRR, jackknife, and bootstrap methods. An overview of the process for using replicate weights is as follows:
+Replicate weights are often included on analysis files instead of, or in addition to, the design variables (strata and PSUs). Replicate weights are used as another method to estimate variability. Often researchers choose to use replicate weights to avoid publishing design variables (strata or clustering variables) as a measure to reduce the risk of disclosure. There are several types of replicate weights, including balanced repeated replication (BRR), Fay's BRR, jackknife, and bootstrap methods. An overview of the process for using replicate weights is as follows:
-1. Divide the sample into subsample **replicates** that mirror the design of the sample
-2. Calculate weights for each **replicate** using the same procedures for the full-sample weight (i.e., nonresponse and post-stratification)
-3. Calculate estimates for each **replicate** using the same method as the full-sample estimate
+1. Divide the sample into subsample replicates that mirror the design of the sample
+2. Calculate weights for each replicate using the same procedures for the full-sample weight (i.e., nonresponse and post-stratification)
+3. Calculate estimates for each replicate using the same method as the full-sample estimate
4. Calculate the estimated variance, which will be proportional to the variance of the replicate estimates
-The different types of replicate weights largely differ in step 1 - how the sample is divided into subsamples, and step 4 - which multiplication factors (scales) are used to multiply the variance. The general format for the standard error is:
+The different types of replicate weights largely differ between step 1 (how the sample is divided into subsamples) and step 4 (which multiplication factors (scales) are used to multiply the variance). The general format for the standard error is:
$$ \sqrt{\alpha \sum_{r=1}^R \alpha_r (\hat{\theta}_r - \hat{\theta})^2 }$$
where $R$ is the number of replicates, $\alpha$ is a constant that depends on the replication method, $\alpha_r$ is a factor associated with each replicate, $\hat{\theta}$ is the weighted estimate based on the full sample, and $\hat{\theta}_r$ is the weighted estimate of $\theta$ based on the $r^{\text{th}}$ replicate.
-### Balanced Repeated Replication (BRR) Method
+To create the design object for surveys with replicate weights, we use `as_survey_rep()` instead of `as_survey_design()` that we use for the common sampling designs in the sections above.
+
+### Balanced Repeated Replication (BRR) method
-The BRR method requires a stratified sample design with two PSUs in each stratum. Each replicate is constructed by deleting one PSU per stratum using a Hadamard matrix. For the PSU that is included, the weight is generally multiplied by two but may have other adjustments, such as post-stratification. A Hadamard matrix is a special square matrix with entries of +1 or -1 with mutually orthogonal rows. Hadamard matrices must have one row, two rows, or a multiple of four rows. The size of the Hadamard matrix is determined by the first multiple of 4 greater than or equal to the number of strata. For example, if a survey had 7 strata, the Hadamard matrix would be an $8\times8$ matrix. Additionally, a survey with 8 strata would also have an $8\times8$ Hadamard matrix. An example of a $4\times4$ Hadamard matrix is below:
+The BRR method requires a stratified sample design with two PSUs in each stratum. Each replicate is constructed by deleting one PSU per stratum using a Hadamard matrix. For the PSU that is included, the weight is generally multiplied by two but may have other adjustments, such as post-stratification. A Hadamard matrix is a special square matrix with entries of +1 or -1 with mutually orthogonal rows. Hadamard matrices must have one row, two rows, or a multiple of four rows. The size of the Hadamard matrix is determined by the first multiple of 4 greater than or equal to the number of strata. For example, if a survey had 7 strata, the Hadamard matrix would be an $8\times8$ matrix. Additionally, a survey with 8 strata would also have an $8\times8$ Hadamard matrix. The columns in the matrix specify the strata and the rows specify the replicate. In each replicate (row), a +1 means to use the first PSU and a -1 means to use the second PSU in the estimate. For example, here is a $4\times4$ Hadamard matrix:
$$ \begin{array}{rrrr} +1 &+1 &+1 &+1\\ +1&-1&+1&-1\\ +1&+1&-1&-1\\ +1 &-1&-1&+1 \end{array} $$
-The columns specify the strata and the rows the replicate. In the first replicate, all the values are +1, so in each stratum, the first PSU would be used in the estimate. In the second replicate, the first PSU would be used in stratum 1 and 3, while the second PSU would be used in stratum 2 and 4. In the third replicate, the first PSU would be used in stratum 1 and 2, while the second PSU would be used in strata 3 and 4. Finally, in the fourth replicate, the first PSU would be used in strata 1 and 4, while the second PSU would be used in strata 2 and 3.
+In the first replicate (row), all the values are +1, so in each stratum, the first PSU would be used in the estimate. In the second replicate, the first PSU would be used in stratum 1 and 3, while the second PSU would be used in stratum 2 and 4. In the third replicate, the first PSU would be used in stratum 1 and 2, while the second PSU would be used in strata 3 and 4. Finally, in the fourth replicate, the first PSU would be used in strata 1 and 4, while the second PSU would be used in strata 2 and 3. For more information about Hadamard matrices see @wolter2007introduction. Note that supplied BRR weights from a data provider will already incorporate this adjustment, and the {survey} package generates the Hadamard matrix, if necessary for calculating BRR weights so an analyst will not need to provide the matrix.
#### The math {-}
@@ -461,15 +500,17 @@ A weighted estimate for the full sample is calculated as $\hat{\theta}$, and the
$$se(\hat{\theta})=\sqrt{\frac{1}{R} \sum_{r=1}^R \left( \hat{\theta}_r-\hat{\theta}\right)^2}$$
-Specifying replicate weights in R requires specifying the type of replicate weights, the main weight variable, the replicate weight variables, and other options. One of the key options is for `mse`. If `mse=TRUE`, variances are computed around the point estimate $(\hat{\theta})$, whereas if `mse=FALSE`, variances are computed around the mean of the replicates $(\bar{\theta})$ instead which looks like this:
+Specifying replicate weights in R requires specifying the type of replicate weights, the main weight variable, the replicate weight variables, and other options. One of the key options is for the mean squared error (MSE). If `mse=TRUE`, variances are computed around the point estimate $(\hat{\theta})$, whereas if `mse=FALSE`, variances are computed around the mean of the replicates $(\bar{\theta})$ instead which looks like this:
$$se(\hat{\theta})=\sqrt{\frac{1}{R} \sum_{r=1}^R \left( \hat{\theta}_r-\bar{\theta}\right)^2}$$ where $$\bar{\theta}=\frac{1}{R}\sum_{r=1}^R \hat{\theta}_r$$
-The default option for `mse` is to use the global option of "survey.replicates.mse" which is set to `FALSE` initially unless a user changes it. To determine if `mse` should be set to `TRUE` or `FALSE`, read the survey documentation. If there is no indication in the survey documentation, for BRR, set `mse` to `TRUE`.
+The default option for `mse` is to use the global option of "survey.replicates.mse" which is set to `FALSE` initially unless a user changes it. To determine if `mse` should be set to `TRUE` or `FALSE`, read the survey documentation. If there is no indication in the survey documentation, for BRR, we recommend setting `mse` to `TRUE` as this is the default in other software (e.g., SAS, SUDAAN).
#### The syntax {-}
-Replicate weights generally come in groups and are sequentially numbered, such as PWGTP1, PWGTP2, ..., PWGTP80 for the person weights in the American Community Survey (ACS) [@acs-pums-2021] or BRRWT1, BRRWT2, ..., BRRWT96 in the 2015 Residential Energy Consumption Survey (RECS) [@recs-2015-micro, @recs-2020-micro]. This makes it easy to use some of the tidy selection^[dplyr documentation on tidy-select: https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html] functions in R. For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated WT1, WT2, ..., WT20, we can use the following syntax (both are equivalent):
+Replicate weights generally come in groups and are sequentially numbered, such as PWGTP1, PWGTP2, ..., PWGTP80 for the person weights in the American Community Survey (ACS) [@acs-pums-2021] or BRRWT1, BRRWT2, ..., BRRWT96 in the 2015 Residential Energy Consumption Survey (RECS) [@recs-2015-micro]. This makes it easy to use some of the tidy selection^[dplyr documentation on tidy-select: https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html] functions in R.
+
+To specify a BRR design, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights is BRR (`type = BRR`), and whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated WT1, WT2, ..., WT20, we can use the following syntax (both are equivalent):
```r
brr_des <- dat %>%
@@ -511,7 +552,7 @@ brr_des <- dat %>%
mse = TRUE)
```
-Typically, the replicate weights sum to a value similar to the main weight, as they are both supposed to provide population estimates. Rarely, an alternative method will be used where the replicate weights have values of 0 or 2 in the case of BRR weights. This would be indicated in the documentation (see Section \@ref(und-surv-doc) and Chapter \@ref(c03-understanding-survey-data-documentation) for more information on how to understand the provided documentation). In this case, the replicate weights are not combined, and the option `combined_weights = FALSE` should be indicated, as the default value for this argument is TRUE. This specific syntax is shown below:
+Typically, each replicate weight sums to a value similar to the main weight, as both the replicate weights and the main weight are supposed to provide population estimates. Rarely, an alternative method will be used where the replicate weights have values of 0 or 2 in the case of BRR weights. This would be indicated in the documentation (see Chapter \@ref(c03-understanding-survey-data-documentation) for more information on how to understand the provided documentation). In this case, the replicate weights are not combined, and the option `combined_weights = FALSE` should be indicated, as the default value for this argument is TRUE. This specific syntax is shown below:
```r
brr_des <- dat %>%
@@ -556,9 +597,9 @@ summary(scdbrr_des)
Note that `combined_weights` was specified as `FALSE` because these weights are simply specified as 0 and 2 and do not incorporate the overall weight. When printing the object, the type of replication is noted as Balanced Repeated Replicates, and the replicate weights and the weight variable are specified. Additionally, the summary lists the variables included.
-### Fay's BRR Method
+### Fay's BRR method
-Fay's BRR method for replicate weights is similar to the BRR method in that it uses a Hadamard matrix to construct replicate weights. However, rather than deleting PSUs for each replicate, with Fay's BRR half of the PSUs have a replicate weight which is the main weight multiplied by $\rho$, and the other half have the main weight multiplied by $(2-\rho)$ where $0 \le \rho < 1$. Note that when $\rho=0$, this is equivalent to the standard BRR weights, and as $\rho$ becomes closer to 1, this method is more similar to jackknife discussed in the next section. To obtain the value of $\rho$, it is necessary to read the documentation (see Section \@ref(und-surv-doc) and Chapter \@ref(c03-understanding-survey-data-documentation)).
+Fay's BRR method for replicate weights is similar to the BRR method in that it uses a Hadamard matrix to construct replicate weights. However, rather than deleting PSUs for each replicate, with Fay's BRR half of the PSUs have a replicate weight which is the main weight multiplied by $\rho$, and the other half have the main weight multiplied by $(2-\rho)$ where $0 \le \rho < 1$. Note that when $\rho=0$, this is equivalent to the standard BRR weights, and as $\rho$ becomes closer to 1, this method is more similar to jackknife discussed in the next section. To obtain the value of $\rho$, it is necessary to read the survey documentation (see Chapter \@ref(c03-understanding-survey-data-documentation)).
#### The math {-}
@@ -568,7 +609,7 @@ $$se(\hat{\theta})=\sqrt{\frac{1}{R (1-\rho)^2} \sum_{r=1}^R \left( \hat{\theta}
#### The syntax {-}
-The syntax is very similar for BRR and Fay's BRR. If a dataset had WT0 for the main weight and had 20 BRR weights indicated as WT1, WT2, ..., WT20, and Fay's multiplier is 0.5, use the following syntax:
+The syntax is very similar for BRR and Fay's BRR. To specify a Fay's BRR design, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights is Fay's BRR (`type = Fay`), whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`), and Fay's multiplier (`rho`). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated as WT1, WT2, ..., WT20, and Fay's multiplier is 0.3, use the following syntax:
```r
fay_des <- dat %>%
@@ -576,14 +617,12 @@ fay_des <- dat %>%
repweights = num_range("WT", 1:20),
type = "Fay",
mse = TRUE,
- rho = 0.5)
+ rho = 0.3)
```
#### Example {-}
-The 2015 RECS [@recs-2015-micro] uses Fay's BRR weights with the final weight as NWEIGHT and replicate weights as BRRWT1 - BRRWT96 with $\rho=0.5$. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already read in the RECS data and created a dataset called `recs_2015` above in the prerequisites.
-
-To specify this design, use the following syntax:
+The 2015 RECS [@recs-2015-micro] uses Fay's BRR weights with the final weight as NWEIGHT and replicate weights as BRRWT1 - BRRWT96 and the documentation specifies a Fay's multiplier of 0.5. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already pulled in the 2015 RECS data from the {srvyrexploR} package that provides data for this book. To specify the design for the `recs_2015` data, use the following syntax:
```{r}
#| label: samp-des-recs-2015-read
@@ -611,11 +650,11 @@ summary(recs_2015_des)
```
-In specifying the design, the `variables` option was also used to include which variables might be used in analyses. This is optional but can make our object smaller. When printing the design object or looking at the summary, the replicate weight type is re-iterated as `Fay's variance method (rho= 0.5) with 96 replicates and MSE variances`, and the variables are included. No weight or probability summary is included in this output as we have seen in some other design objects.
+In specifying the design, the `variables` option was also used to include which variables might be used in analyses. This is optional but can make our object smaller and easier to work with. When printing the design object or looking at the summary, the replicate weight type is re-iterated as `Fay's variance method (rho= 0.5) with 96 replicates and MSE variances`, and the variables are included. No weight or probability summary is included in this output as we have seen in some other design objects.
### Jackknife method
-There are three jackknife estimators implemented in {srvyr} - Jackknife 1 (JK1), Jackknife n (JKn), and Jackknife 2 (JK2). The JK1 method can be used for unstratified designs, and replicates are created by removing one PSU at a time so the number of replicates is the same as the number of PSUs. If there is no clustering, then the PSU is the ultimate sampling unit (e.g., unit).
+There are three jackknife estimators implemented in {srvyr} - jackknife 1 (JK1), jackknife n (JKn), and jackknife 2 (JK2). The JK1 method can be used for unstratified designs, and replicates are created by removing one PSU at a time so the number of replicates is the same as the number of PSUs. If there is no clustering, then the PSU is the ultimate sampling unit (e.g., unit).
The JKn method is used for stratified designs and requires two or more PSUs per stratum. In this case, each replicate is created by deleting one PSU from a single stratum, so the number of replicates is the number of total PSUs across all strata. The JK2 method is a special case of JKn when there are exactly 2 PSUs sampled per stratum. For variance estimation, scaling constants must also be specified.
@@ -630,9 +669,7 @@ $$se(\hat{\theta})=\sqrt{\sum_{r=1}^R \alpha_r \left( \hat{\theta}_r-\hat{\theta
#### The syntax {-}
-To specify the Jackknife method, the type would be `JK1`, `JKn`, or `JK2`. Additionally, the overall multiplier for JK1 is specified with the scale argument, whereas the replicate-specific multiplier ($\alpha_r$) is specified with the scales argument.
-
-Consider a case for the JK1 method where the multiplier, $(R-1)/R=19/20=0.95$ and the dataset had WT0 for the main weight and had 20 JK1 weights indicated WT1, WT2, ..., WT20, then the syntax would be
+To specify the jackknife method, we use the survey documentation to understand the type of jackknife (1, n, or 2) and the multiplier. In the syntax we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights as jackknife 1 (`type = "JK1"`), n (`type = "JKN"`), or 2 (`type = "JK2"`), whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`), and the multiplier (`scale`). For example, if the survey is a jackknife 1 method with a multiplier of $\alpha_r=(R-1)/R=19/20=0.95$, the dataset has WT0 for the main weight and 20 replicate weights indicated as WT1, WT2, ..., WT20, use the following syntax:
```r
jk1_des <- dat %>%
@@ -643,7 +680,7 @@ jk1_des <- dat %>%
scale=0.95)
```
-Consider a case for the JKn method where $\alpha_r=0.1$ for all replicates and the dataset had WT0 for the main weight and had 20 JK1 weights indicated as WT1, WT2, ..., WT20, then the syntax would be:
+For a jackknife n method, we need to specify the multiplier for all replicates. In this case we use the `rscales` argument to specify each one. The documentation will provide details on what the multipliers ($\alpha_r$) are, and they may be the same for all replicates. For example, consider a case where $\alpha_r=0.1$ for all replicates and the dataset had WT0 for the main weight and had 20 replicate weights indicated as WT1, WT2, ..., WT20. We specify the type as `type = "JKN"`, and the multiplier as `rscales=rep(0.1,20)`:
```r
jkn_des <- dat %>%
@@ -698,7 +735,7 @@ recs_des <- recs_2020 %>%
When printing the design object or looking at the summary, the replicate weight type is re-iterated as `Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances`, and the variables are included. No weight or probability summary is included.
-### Bootstrap Method
+### Bootstrap method
In bootstrap resampling, replicates are created by selecting random samples of the PSUs with replacement (SRSWR). If there are $M$ PSUs in the sample, then each replicate will be created by selecting a random sample of $M$ PSUs with replacement. Each replicate is created independently, and the weights for each replicate are adjusted to reflect the population, generally using the same method as how the analysis weight was adjusted.
@@ -708,12 +745,12 @@ A weighted estimate for the full sample is calculated as $\hat{\theta}$, and the
$$se(\hat{\theta})=\sqrt{\alpha \sum_{r=1}^R \left( \hat{\theta}_r-\hat{\theta}\right)^2}$$
-where $\alpha$ is the scaling constant. Note that the scaling constant ($\alpha$) is provided in the documentation as there are many types of bootstrap methods which generate custom scaling constants.
+where $\alpha$ is the scaling constant. Note that the scaling constant ($\alpha$) is provided in the survey documentation as there are many types of bootstrap methods which generate custom scaling constants.
#### The syntax {-}
-If a dataset had WT0 for the main weight, 20 bootstrap weights indicated WT1, WT2, ..., WT20, and $\alpha=.02$, use the following syntax:
+To specify a bootstrap method, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights as bootstrap (`type = "bootstrap"`), whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`), and the multiplier (`scale`). For example, if a dataset had WT0 for the main weight, 20 bootstrap weights indicated WT1, WT2, ..., WT20, and a multiplier of $\alpha=.02$, use the following syntax:
```r
bs_des <- dat %>%
@@ -725,11 +762,9 @@ bs_des <- dat %>%
```
-Since $\alpha$ is a constant, it is not generally a variable on the dataset and is entered into the code as a constant.
-
#### Example {-}
-Returning to the api example, we are going to create a dataset with bootstrap weights to use as an example. In this example, we construct a one-cluster design with fifty replicate weights.
+Returning to the api example, we are going to create a dataset with bootstrap weights to use as an example. In this example, we construct a one-cluster design with fifty replicate weights.^[We provide the code here for you to replicate this example, but are not focusing on the creation of the weights as that is outside the scope of this book. We recommend you reference @wolter2007introduction for more information on creating bootstrap weights.]
```{r}
#| label: samp-des-genbs
@@ -780,32 +815,19 @@ summary(api1_bs_des)
As with other replicate design objects, when printing the object or looking at the summary, the replicate weights are provided along with the data variables.
-## Understanding survey design documentation {#und-surv-doc}
-
-SRS, stratified, and clustered designs are the backbone of sampling designs, and the features are often combined in one design. Additionally, rather than using SRS for selection, other sampling mechanisms are commonly used, such as probability proportional to size (PPS), systematic sampling, or selection with unequal probabilities, which are briefly described here. In PPS sampling, a size measure is constructed for each unit (e.g., the population of the PSU or the number of occupied housing units) and then units with larger size measures are more likely to be sampled. Systematic sampling is commonly used to ensure representation across a population. Units are sorted by a feature and then every $k$ units are selected from a random start point so the sample is spread across the population. In addition to PPS, other unequal probabilities of selection may be used. For example, in a study of establishments (e.g., businesses or public institutions) that conducts a survey every year, an establishment that recently participated (e.g., participated last year) may have a reduced chance of selection in a subsequent round to reduce the burden on the establishment. To learn more about sampling designs, refer to @valliant2013practical, @cox2011business, @cochran1977sampling, and @deming1991sample.
-
-A common method of sampling is to stratify PSUs, select PSUs within the stratum using PPS selection, and then select units within the PSUs either with SRS or PPS. Reading survey documentation is an important first step in survey analysis to understand the design of the survey we are using and variables necessary to specify the design. Good documentation will highlight the variables necessary to specify the design. This is often found in User's Guides, methodology, analysis guides, or technical documentation (see Chapter \@ref(c03-understanding-survey-data-documentation) for more details).
-
-#### Example {-}
-
-For example, the 2017-2019 National Survey of Family Growth (NSFG)^[2017-2019 National Survey of Family Growth (NSFG): Sample Design Documentation - https://www.cdc.gov/nchs/data/nsfg/NSFG-2017-2019-Sample-Design-Documentation-508.pdf] had a stratified multi-stage area probability sample. In the first stage, PSUs are counties or collections of counties and are stratified by Census region/division, size (population), and MSA status. Within each stratum, PSUs were selected via PPS. In the second stage, neighborhoods were selected within the sampled PSUs using PPS selection. In the third stage, housing units were selected within the sampled neighborhoods. In the fourth stage, a person was randomly chosen within the selected housing units among eligible persons using unequal probabilities based on the person's age and sex. The public use file does not include all these levels of selection and instead has pseudo-strata and pseudo-clusters, which are the variables used in R to specify the design. As specified on page 4 of the documentation, the stratum variable is `SEST`, the cluster variable is `SECU`, and the weight variable is `WGT2017_2019`. Thus, to specify this design in R, use the following syntax:
-
-```r
-nsfg_des <- nsfgdata %>%
- as_survey_design(ids = SECU,
- strata = SEST,
- weights = WGT2017_2019)
-```
## Exercises
-1. The American National Election Studies (ANES) collect data before and after elections approximately every four years around the presidential election cycle. Each year with the data release, a user's guide is also released^[ANES 2020 User's Guide: https://electionstudies.org/wp-content/uploads/2022/02/anes_timeseries_2020_userguidecodebook_20220210.pdf]. What is the syntax for specifying the analysis of the full sample post-election data?
+1. The National Health Interview Survey (NHIS) is an annual household survey conducted by the National Center for Health Statistics (NCHS). The NHIS includes a wide variety of health topics for adults including health status and conditions, functioning and disability, health care access and health service utilization, health-related behaviors, health promotion, mental health, barriers to care, and community engagement. Like many national in-person surveys, the sampling design is a stratified clustered design with details included in the Survey Description^[2022 National Health Interview Survey (NHIS) Survey Description: https://www.cdc.gov/nchs/nhis/2022nhis.htm]. The Survey Description provides information on setting up syntax in SUDAAN, Stata, SPSS, SAS, and R ({survey} package implementation). How would you specify the design using {srvyr} using either `as_survey_design` or `as_survey_rep()`?
```r
-anes_des <- anes_data %>%
- as_survey_design(weight)
+nhis_adult_des <- nhis_adult_data %>%
+ as_survey_design(ids=PPSU,
+ strata=PSTRAT,
+ nest=TRUE,
+ weights=WTFA_A)
```
2. The General Social Survey is a survey that has been administered since 1972 on social, behavioral, and attitudinal topics. The 2016-2020 GSS Panel codebook^[2016-2020 GSS Panel Codebook Release 1a: https://gss.norc.org/Documents/codebook/2016-2020%20GSS%20Panel%20Codebook%20-%20R1a.pdf] provides examples of setting up syntax in SAS and Stata but not R. How would you specify the design in R?
diff --git a/13-ncvs-vignette.Rmd b/13-ncvs-vignette.Rmd
index ca491799..51a4a70d 100644
--- a/13-ncvs-vignette.Rmd
+++ b/13-ncvs-vignette.Rmd
@@ -26,7 +26,7 @@ library(srvyrexploR)
library(gt)
```
-We will use data from NCVS. Here is the code to read in the three datasets from the {srvyrexploR} package:
+We will use data from the United States National Crime Victimization Survey (NCVS). Here is the code to read in the three datasets from the {srvyrexploR} package:
```{r}
#| label: ncvs-data
#| cache: TRUE
@@ -38,7 +38,7 @@ data(ncvs_2021_person)
## Introduction
-The United States National Crime Victimization Survey (NCVS) is a household survey sponsored by the Bureau of Justice Statistics (BJS), which collects data on criminal victimization, including characteristics of the crimes, offenders, and victims. Crime types include both household and personal crimes, as well as violent and non-violent crimes. The target population of this survey is all people in the United States age 12 and older living in housing units and noninstitutional group quarters.
+The NCVS is a household survey sponsored by the Bureau of Justice Statistics (BJS), which collects data on criminal victimization, including characteristics of the crimes, offenders, and victims. Crime types include both household and personal crimes, as well as violent and non-violent crimes. The target population of this survey is all people in the United States age 12 and older living in housing units and noninstitutional group quarters.
The NCVS has been ongoing since 1992. An earlier survey, the National Crime Survey, was run from 1972 to 1991 [@ncvs_tech_2016]. The survey is administered using a rotating panel. When an address enters the sample, the residents of that address are interviewed every six months for a total of seven interviews. If the initial residents move away from the address during the period, the new residents are included in the survey, as people are not followed when they move.
@@ -54,7 +54,7 @@ The data from ICPSR is distributed with five files, each having its unique ident
- Incident Record - `YEARQ`, `IDHH`, `IDPER`
- 2021 Collection Year Incident - `YEARQ`, `IDHH`, `IDPER`
-We will focus on the household, person, and incident files. From these files, we selected a subset of columns for examples to use in this vignette. We have included data in our OSF repository, but you can download the complete files at ICPSR^[https://www.icpsr.umich.edu/web/NACJD/studies/38429].
+We will focus on the household, person, and incident files. From these files, we selected a subset of columns for examples to use in this vignette. We have included data in the {srvyexploR} package with a subset of columns, but you can download the complete files at ICPSR^[https://www.icpsr.umich.edu/web/NACJD/studies/38429].
## Survey Notation
@@ -102,7 +102,7 @@ For victimization rates, we need to know the victimization status for both victi
Each record on the incident file represents one victimization, which is not the same as one incident. Some victimizations have several instances that make it difficult for the victim to differentiate the details of these incidents, labeled as "series crimes". Appendix A of the User's Guide indicates how to calculate the series weight in other statistical languages.
-Here, we adapt that code for R. Essentially, if a victimization is a series crime, its series weight is top-coded at 10 based on the number of actual victimizations. If an incident is a series crime, but the number of occurrences is unknown, the series weight is set to 6. A description of the variables used to create indicators of series and the associated weights is included in Table \@ref(tab:cb-incident).
+Here, we adapt that code for R. Essentially, if a victimization is a series crime, its series weight is top-coded at 10 based on the number of actual victimizations, that is that even if the crime repeatedly occurred more than 10 times, it is counted as 10 times to reduce the influence of extreme outliers. If an incident is a series crime, but the number of occurrences is unknown, the series weight is set to 6. A description of the variables used to create indicators of series and the associated weights is included in Table \@ref(tab:cb-incident).
Table: (\#tab:cb-incident) Codebook for incident variables - related to series weight
@@ -121,7 +121,7 @@ Table: (\#tab:cb-incident) Codebook for incident variables - related to series w
| | | 8 | Residue (invalid data) |
| WGTVICCY | Adjusted victimization weight | | Numeric |
-We want to create four variables to indicate if an incident is a series crime. First, we create a variable called series using `V4017`, `V4018`, and `V4019`. Next, we top code the number of incidents (`V4016`). Finally, we create the series weight using our new top-coded variable and the existing weight.
+We want to create four variables to indicate if an incident is a series crime. First, we create a variable called series using `V4017`, `V4018`, and `V4019` where an incident is considered a series crime if there are 6 or more incidents (`V4107`), the incidents are similar in detail (`V4018`), or there is not enough detail to distinguish the incidents (`V4019`). Next, we top-code the number of incidents (`V4016`) by creating a variable `n10v4016` which is set to 10 if `V4016 > 10`. Finally, we create the series weight using our new top-coded variable and the existing weight.
```{r}
#| label: ncvs-vign-incfile
@@ -132,7 +132,7 @@ inc_series <- ncvs_2021_incident %>%
series = case_when(V4017 %in% c(1, 8) ~ 1,
V4018 %in% c(2, 8) ~ 1,
V4019 %in% c(1, 8) ~ 1,
- TRUE ~ 2 # series
+ TRUE ~ 2
),
n10v4016 = case_when(V4016 %in% c(997, 998) ~ NA_real_,
V4016 > 10 ~ 10,
@@ -635,19 +635,19 @@ vt2a
vt2b
```
-The number of victimizations estimated using the incident file is equivalent to the person and household file method. There are `r vt1$Property_Vzn` property incidents and `r vt1$Violent_Vzn` violent incidents in a six-month period.
+The number of victimizations estimated using the incident file is equivalent to the person and household file method. There are `r prettyNum(vt1$Property_Vzn, big.mark=",")` property incidents and `r prettyNum(vt1$Violent_Vzn, big.mark=",")` violent incidents in a six-month period.
### Estimation 2: Victimization Proportions {#vic-prop}
Victimization proportions are proportions describing features of a victimization. The key here is that these are questions among victimizations, not among the population. These types of estimates can only be calculated using the incident design object (`inc_des`).
-For example, we could be interested in the percentage of property victimizations reported to the police:
+For example, we could be interested in the percentage of property victimizations reported to the police as shown in the following code with an estimate, the standard error, and 95% confidence interval:
```{r}
#| label: ncvs-vign-vic-prop-police
prop1 <- inc_des %>%
filter(Property) %>%
- summarize(Pct = survey_mean(ReportPolice, na.rm = TRUE) * 100)
+ summarize(Pct = survey_mean(ReportPolice, na.rm = TRUE, proportion=TRUE, vartype=c("se", "ci")) * 100)
prop1
```
@@ -706,7 +706,7 @@ pers_des %>%
))
```
-A common desire is to calculate victimization rates by several characteristics. For example, we may want to calculate the violent victimization rate and aggravated assault rate by sex, race/Hispanic origin, age group, marital status, and household income. This requires a `group_by()` statement for each categorization separately. Thus, we make a function to do this and then use `map_df()` from the {purrr} package (part of the tidyverse) to loop through the variables. Finally, the {gt} package is used to make a publishable table shown in Table \@ref(tab:ncvs-vign-rates-demo-tab).
+A common desire is to calculate victimization rates by several characteristics. For example, we may want to calculate the violent victimization rate and aggravated assault rate by sex, race/Hispanic origin, age group, marital status, and household income. This requires a `group_by()` statement for each categorization separately. Thus, we make a function to do this and then use `map_df()` from the {purrr} package (part of the tidyverse) to loop through the variables. This function takes a demographic variable as its input (`byarvar`) and calculates the violent and aggravated assault vicitimization rate for each level. It then creates some columns with the variable, the level of each variable, and a numeric version of the variable (`LevelNum`) for sorting later. The function is run across multiple variables using `map()` and then stacks the results into a single output using `bind_rows()`.
```{r}
#| label: ncvs-vign-rates-demo
@@ -729,7 +729,14 @@ pers_est_by <- function(byvar) {
pers_est_df <-
c("Sex", "RaceHispOrigin", "AgeGroup", "MaritalStatus", "Income") %>%
- map_df(pers_est_by)
+ map(pers_est_by) %>%
+ bind_rows()
+```
+
+The output from all the estimates is cleanded to create better labels such as going from "RaceHispOrigin" to "Race/Hispanic Origin". Finally, the {gt} package is used to make a publishable table (Table \@ref(tab:ncvs-vign-rates-demo-tab)). Using the functions from the {gt} package, column labels and footnotes are added and estimates are presented to the first decimal place.
+
+```{r}
+#| label: ncvs-vgn-rates-demo-gt-create
vr_gt<-pers_est_df %>%
mutate(
@@ -789,6 +796,8 @@ vr_gt<-pers_est_df %>%
by type of crime and demographic characteristics, 2021")
```
+
+
```{r}
#| label: ncvs-vign-rates-demo-noeval
#| eval: false
@@ -835,6 +844,40 @@ pers_prev_ests
In the example above, the indicator is multiplied by 100 to return a percentage rather than a proportion. In 2021, we estimate that `r formatC(pers_prev_ests$Violent_Prev, digits=2, format="f")`% of people aged 12 and older were a victim of violent crime in the United States, and `r formatC(pers_prev_ests$AAST_Prev, digits=2, format="f")`% were victims of aggravated assault.
+## Statistical testing
+
+For any of the types of estimates discussed, we can also perform statistical testing. For example, we could test whether property victimization rates are different between properties that are owned versus rented. First, we calculate the point estimates.
+
+```{r}
+prop_tenure <- hh_des %>%
+ group_by(Tenure) %>%
+ summarize(
+ Property_Rate = survey_mean(Property * ADJINC_WT * 1000,
+ na.rm = TRUE, vartype="ci"),
+ )
+
+prop_tenure
+```
+
+The property victimization rate for rented households is `r prop_tenure %>% filter(Tenure=="Rented") %>% pull(Property_Rate) %>% round(1)` per 1,000 households while the property victimization rate for owned households is `r prop_tenure %>% filter(Tenure=="Owned") %>% pull(Property_Rate) %>% round(1)`, which seem very different especially given the non-overlapping confidence intervals. However, survey data is inheriently non-independent so statistical testing cannot be done by comparing confidence intervals. To conduct the statistical test, we first need to create a variable that we will compare which incorporates the adjusted incident weight (`ADJINC_WT`) and then the test can be conducted as discussed in Chapter \@ref(c06-statistical-testing).
+
+```{r}
+prop_tenure_test <- hh_des %>%
+ mutate(
+ Prop_Adj=Property * ADJINC_WT * 1000
+ ) %>%
+ svyttest(
+ formula = Prop_Adj ~ Tenure,
+ design = .,
+ na.rm = TRUE
+ ) %>%
+ broom::tidy()
+
+prop_tenure_test
+```
+
+The output of the statistical test shows the same difference of `r prop_tenure_test$estimate %>% round(1)` between the property victimization rates of renters and owners and the test is highly significant with the p-value of `r prettyunits::pretty_p_value(prop_tenure_test$p.value)`.
+
## Exercises
1. What proportion of completed motor vehicle thefts are not reported to the police? Hint: Use the codebook to look at the definition of Type of Crime (V4529).
@@ -865,3 +908,25 @@ hh_des %>%
summarize(Property_Rate = survey_mean(Property * ADJINC_WT * 1000,
na.rm = TRUE))
```
+
+4. What is the difference between the violent victimization rate between males and females? Is it statistically different?
+
+```{r}
+pers_des %>%
+ group_by(Sex) %>%
+ summarize(
+ Violent_rate=survey_mean(Violent * ADJINC_WT * 1000, na.rm=TRUE)
+ )
+
+pers_des %>%
+ mutate(
+ Violent_Adj=Violent * ADJINC_WT * 1000
+ ) %>%
+ svyttest(
+ formula = Violent_Adj ~ Sex,
+ design = .,
+ na.rm = TRUE
+ ) %>%
+ broom::tidy()
+```
+
diff --git a/14-ambarom-vignette.Rmd b/14-ambarom-vignette.Rmd
index 3f099f3e..e7b6fbaf 100644
--- a/14-ambarom-vignette.Rmd
+++ b/14-ambarom-vignette.Rmd
@@ -27,7 +27,7 @@ library(gt)
library(ggpattern)
```
-In this vignette, we will be using data from the 2021 AmericasBarometer survey. Download the raw files yourself from the [LAPOP website](http://datasets.americasbarometer.org/database/index.php). This book uses version 1.2 of the data and each country has its own file for a total of 22 files. To read all files into R and ignore the Stata labels, we recommend running code like this:
+In this vignette, we use a subset of data from the 2021 AmericasBarometer survey. Download the raw files, available on the [LAPOP website](http://datasets.americasbarometer.org/database/index.php). We work with version 1.2 of the data, and there are separate files for each of the 22 countries. To read all files into R while ignoring the Stata labels, we recommend running code like this:
```r
stata_files <- list.files(here("RawData", "LAPOP_2021"), "*.dta")
@@ -48,25 +48,24 @@ ambarom_in <- here("RawData", "LAPOP_2021", stata_files) %>%
r15, r18n, r18)
```
-The code above will read all files of type `.dta` in and stack them into one tibble. We then selected a subset of variables for this vignette.
+The code above reads all `.dta` files and combines them into one tibble.
:::
## Introduction
-The AmericasBarometer surveys are conducted by the LAPOP Lab [@lapop]. These surveys are public opinion surveys of the Americas focused on democracy. The study was launched in 2004/2005 with 11 countries, with the countries growing and fluctuating over time, and creates a study with consistent methodology across many countries. In 2021, the study included 22 countries ranging from the north in Canada to the South in Chile and Argentina [@lapop-about].
+The AmericasBarometer surveys, conducted by the LAPOP Lab [@lapop], are public opinion surveys of the Americas focused on democracy. The study was launched in 2004/2005 with 11 countries. Though the countries grow and fluctuate over time, AmericasBarometers maintains a consistent methodology across many countries. In 2021, the study included 22 countries ranging from Canada in the north to Chile and Argentina in the South [@lapop-about].
-Historically, surveys were administered with face-to-face household interviews, but the COVID-19 pandemic changed the study significantly to the use of random-digit dialing (RDD) of mobile phones in all countries except the United States and Canada [@lapop-tech]. In Canada, LAPOP collaborated with the Environics Institute to collect data from a panel of Canadians using a web survey [@lapop-can]. While in the United States, YouGov conducted the survey on behalf of LAPOP by conducting a web survey among their panelists [@lapop-usa].
+Historically, surveys were administered through in-person household interviews, but the COVID-19 pandemic changed the study significantly. Now, random-digit dialing (RDD) of mobile phones is used in all countries except the United States and Canada [@lapop-tech]. In Canada, LAPOP collaborated with the Environics Institute to collect data from a panel of Canadians using a web survey [@lapop-can]. In the United States, YouGov conducted the survey on behalf of LAPOP by conducting a web survey among its panelists [@lapop-usa].
-The survey has a core set of questions across the countries, but not all questions are asked everywhere. Additionally, some questions are only asked to half of the respondents within a country, presumably to reduce the burden as different sections are randomized to different respondents [@lapop-svy].
+The survey includes a core set of questions for all countries, but not every question is asked in each country. Additionally, some questions are only posed to half of the respondents in a country, with different sections randomized to respondents [@lapop-svy].
-## Data Structure
+## Data structure
-Each country and year has its own file available in Stata format (`.dta`). In this vignette, we downloaded and stacked all the data from all 22 participating countries in 2021. We subset the data to a smaller set of columns as noted in the prerequisites box for usage in the vignette. To understand variables that are used across the several countries, the core questionnaire is useful [@lapop-svy].
+Each country and year has its own file available in Stata format (`.dta`). In this vignette, we download and combine all the data from the 22 participating countries in 2021. We subset the data to a smaller set of columns, as noted in the prerequisites box. Review the core questionnaire to understand the common variables across the countries [@lapop-svy].
## Preparing files
-Many of the variables are coded as numeric and do not have intuitive variable names, so the next step is to create derived variables and analysis-ready data. Using the core questionnaire as a codebook, derived variables are created below with relevant factors with informative names.
-
+Many of the variables are coded as numeric and do not have intuitive variable names, so the next step is to create derived variables and wrangle the data for analysis. Using the core questionnaire as a codebook, we reference the factor descriptions to create derived variables with informative names:
```{r}
#| label: ambarom-read-secret
@@ -82,7 +81,7 @@ lapop_rds_files <- osf_retrieve_node("https://osf.io/z5c3m/") %>%
pattern = ".rds")
filedet <- lapop_rds_files %>%
- osf_download(conflicts = "overwrite", path = here::here("osf_dl"))
+ osf_download(conflicts = "overwrite")
ambarom_in <- filedet %>%
pull(local_path) %>%
@@ -91,7 +90,6 @@ ambarom_in <- filedet %>%
unlink(pull(filedet, "local_path"))
```
-
```{r}
#| label: ambarom-derive
ambarom <- ambarom_in %>%
@@ -138,17 +136,21 @@ ambarom <- ambarom_in %>%
Internet = r18)
```
-At this point, it is helpful to check the cross-tabs between the original variables and the newly derived variables. By outputting these tables, we can check to make sure that we have correctly aligned the numeric data from the original data to the factored data with informative labels in the new data.
+At this point, it is a good time to check the cross-tabs between the original and newly derived variables. These tables help us confirm that we have correctly matched the numeric data from the original dataset to the renamed factor data in the new dataset. For instance, let's check the original variable `pais` and the derived variable `Country`. We can consult the questionnaire or codebook to confirm that Argentina is coded as `17`, Bolivia as `10`, etc. Similarly, for `CovidWorry` and `covid2at`, we can verify that `Very worried` is coded as `1`, and so on for the other variables.
```{r}
#| label: ambarom-derive-check
-ambarom %>% count(Country, pais) %>% print(n = 22)
-ambarom %>% count(CovidWorry, covid2at)
+ambarom %>%
+ count(Country, pais) %>%
+ print(n = 22)
+
+ambarom %>%
+ count(CovidWorry, covid2at)
```
## Survey design objects
-The technical report is the best source to understand how to specify the sampling design in R [@lapop-tech]. The data includes two weights: `wt` and `weight1500`. The first weight variable is country-specific and sums to the sample size but is calibrated to reflect each country's demographics, while the second weight variable sums to 1500 for each country. The second weight is indicated as the weight to use for multi-country analyses. While the documentation does not directly state this, the example Stata syntax (`svyset upm [pw=weight1500], strata(strata)`) indicates the variable `upm` is a clustering variable, and `strata` is the strata variable. Therefore, the design object is setup in R as follows:
+The technical report is the best reference for understanding how to specify the sampling design in R [@lapop-tech]. The data includes two weights: `wt` and `weight1500`. The first weight variable is specific to each country and sums to the sample size, but it is calibrated to reflect each country's demographics. The second weight variable sums to 1500 for each country and is recommended for multi-country analyses. Although not explicitly stated in the documentation, the Stata syntax example (`svyset upm [pw=weight1500], strata(strata)`) indicates the variable `upm` is a clustering variable and `strata` is the strata variable. Therefore, the design object is created in R as follows:
```{r}
#| label: ambarom-design
@@ -158,28 +160,55 @@ ambarom_des <- ambarom %>%
weight = weight1500)
```
-One interesting thing to note is that these can only give us estimates to compare countries but not multi-country estimates since the weights do not account for different sizes of countries. For example, Canada has about 10% of the population of the United States, but an estimate that uses records from both countries would weigh them equally.
+One interesting thing to note is that these weight variables can provide estimates for comparing countries but not for multi-country estimates. The reason is that the weights do not account for the different sizes of countries. For example, Canada has about 10% of the population of the United States, but an estimate that uses records from both countries would weigh them equally.
+
+## Calculating estimates {#ambarom-estimates}
+
+When calculating estimates from the data, we use the survey design object `ambarom_des` and then apply the `survey_mean()` function. The next sections walk through a few examples.
-## Calculating estimates and making tables {#ambarom-tables}
+### Example: Worried about COVID
-This survey was administered in 2021 between March and August, varying by country^[See Table 2 in @lapop-tech for dates by country]. Given the state of the pandemic at that time, several questions about COVID were included. The first question about COVID asked whether people were worried about the possibility that they or someone in their household will get sick from coronavirus in the next three months. We will calculate the percentage of people in each country who are very worried or somewhat worried.
+This survey was administered between March and August of 2021, with the specific timing varying by country^[See Table 2 in @lapop-tech for dates by country]. Given the state of the pandemic at that time, several questions about COVID were included. The first question about COVID asked:
-In the following code, we calculate estimate for each country and then create a table (see Table \@ref(tab:ambarom-est1-tab)) of the estimates for display using the {gt} package.
+> How worried are you about the possibility that you or someone in your household will get sick from coronavirus in the next 3 months?
+>
+> - Very worried
+> - Somewhat worried
+> - A little worried
+> - Not worried at all
+
+If we are interested in those who are very worried or somewhat worried, we can create a new variable (`CovidWorry_bin`) that groups levels of the original question using the `fct_collapse()` function from the {forcats} package. We then use the `survey_count()` function to understand how responses are distributed across each category of the original variable (`CovidWorry`) and the new variable (`CovidWorry_bin`).
```{r}
-#| label: ambarom-est1
-covid_worry_country_ests <-
- ambarom_des %>%
+#| label: ambarom-worry-est1
+covid_worry_collapse <- ambarom_des %>%
mutate(CovidWorry_bin = fct_collapse(
CovidWorry,
WorriedHi = c("Very worried", "Somewhat worried"),
WorriedLo = c("A little worried", "Not worried at all")
- )) %>%
+ ))
+
+covid_worry_collapse %>%
+ survey_count(CovidWorry_bin, CovidWorry)
+```
+
+With this new variable, we can now use `survey_mean()` to calculate the percentage of people in each country who are either very or somewhat worried about COVID. There are missing data, as indicated in the `survey_count()` output above, so we need to use `na.rm = TRUE` in the `survey_mean()` function to handle the missing values.
+
+```{r}
+#| label: ambarom-worry-est2
+covid_worry_country_ests <- covid_worry_collapse %>%
group_by(Country) %>%
summarize(p = survey_mean(CovidWorry_bin == "WorriedHi",
- na.rm = TRUE) * 100)
+ na.rm = TRUE) * 100)
+
+covid_worry_country_ests
+```
-covid_worry_country_ests_gt<-covid_worry_country_ests %>%
+To view the results for all countries, we can use the {gt} package to create Table \@ref(tab:ambarom-worry-tab).
+
+```{r}
+#| label: ambarom-worry-gt
+covid_worry_country_ests_gt <- covid_worry_country_ests %>%
gt(rowname_col = "Country") %>%
cols_label(p = "Percent",
p_se = "SE") %>%
@@ -188,23 +217,25 @@ covid_worry_country_ests_gt<-covid_worry_country_ests %>%
```
```{r}
-#| label: ambarom-est1-noeval
+#| label: ambarom-worry-noeval
#| eval: false
covid_worry_country_ests_gt
```
-(ref:ambarom-est1-tab) Proportion worried about the possibility that they or someone in their household will get sick from coronavirus in the next 3 months
+(ref:ambarom-worry-tab) Percentage worried about the possibility that they or someone in their household will get sick from coronavirus in the next 3 months
```{r}
-#| label: ambarom-est1-tab
+#| label: ambarom-worry-tab
#| echo: FALSE
#| warning: FALSE
covid_worry_country_ests_gt %>%
- print_gt_book(knitr::opts_current$get()[["label"]])
+ print_gt_book(knitr::opts_current$get()[["label"]])
```
-Another question asked how education was affected by the pandemic. This question was asked among households with children under the age of 13, and respondents could select more than one option as follows:
+### Example: Education affected by COVID
+
+Respondents were also asked a question about how the pandemic affected education. This question was asked to households with children under the age of 13, and respondents could select more than one option, as follows:
> Did any of these children have their school education affected due to the pandemic?
>
@@ -214,36 +245,61 @@ Another question asked how education was affected by the pandemic. This question
> | - Yes, they switched to a combination of virtual and in-person classes
> | - Yes, they cut all ties with the school
-Multiple-choice questions are interesting. If we want to look at how education was impacted only among those in school, we need to filter to the relevant responses, which is anyone that responded **no** to the first part. The variable `Educ_NotInSchool` in the dataset has values of 0 and 1. A value of 1 means that the respondent selected the first option in the question (none of the children are in school) and a value of 0 means that at least one of their children are in school. Using this variable, we can filter the data to only those with a value of 0.
+Working with multiple-choice questions can be both challenging and interesting. Let's walk through how to analyze this question. If we are interested in the impact on education, we should focus on the data of those whose children are attending school. This means we need to exclude those who selected the first response option: "No, because they are not yet school age or because they do not attend school for another reason." To do this, we use the `Educ_NotInSchool` variable in the dataset, which has values of `0` and `1`. A value of `1` indicates that the respondent chose the first response option (none of the children are in school), and a value of `0` means that at least one of their children is in school. By filtering the data to those with a value of `0` (they have at least one child in school), we can consider only respondents with at least one child attending school.
-There are three additional variables that we can look at that correlate to the second option (`Educ_NormalSchool`), third option (`Educ_VirtualSchool`), and fourth option (`Educ_Hybrid`). An unweighted cross-tab for the responses is included below, and we can see there is a wide-range of impacts and that many combinations of effects on education are possible.
+Now, let's review the data for those who selected one of the next three response options:
+
+- No, their classes continued normally: `Educ_NormalSchool`
+- Yes, they went to virtual or remote classes: `Educ_VirtualSchool`
+- Yes, they switched to a combination of virtual and in-person classes: `Educ_Hybrid`
+
+The unweighted cross-tab for these responses is included below. It reveals a wide range of impacts, where many combinations of effects on education are possible.
```{r}
#| label: ambarom-covid-ed-skip
-ambarom %>% filter(Educ_NotInSchool == 0) %>%
- distinct(Educ_NormalSchool,
+ambarom %>%
+ filter(Educ_NotInSchool == 0) %>%
+ count(Educ_NormalSchool,
Educ_VirtualSchool,
- Educ_Hybrid) %>%
- print(n = 50)
+ Educ_Hybrid)
```
-We might create multiple outcomes for a table as follows:
+In reviewing the survey question, we might be interested in knowing the answers to the following:
-- Indicator that school continued as normal with no virtual or hybrid option
-- Indicator that the education medium was changed - either virtual or hybrid
+- What percentage of households indicated that school continued as normal with no virtual or hybrid option?
+- What percentage of households indicated that the education medium was changed to either virtual or hybrid?
+- What percentage of households indicated that they cut ties with their school?
-In this next code chunk, we create these indicators, make national estimates, and display a summary table of the data shown in Table \@ref(tab:ambarom-covid-ed-der-tab).
+To find the answers, we create indicators for the first two questions, make national estimates for all three questions, and then construct a summary table for easy viewing. First, we create and inspect the indicators and their distributions using `survey_count()`.
```{r}
-#| label: ambarom-covid-ed-der
+#| label: ambarom-covid-ed-inds
ambarom_des_educ <- ambarom_des %>%
filter(Educ_NotInSchool == 0) %>%
- mutate(Educ_OnlyNormal = (Educ_NormalSchool == 1 &
- Educ_VirtualSchool == 0 &
- Educ_Hybrid == 0),
- Educ_MediumChange = (Educ_VirtualSchool == 1 |
- Educ_Hybrid == 1))
+ mutate(
+ Educ_OnlyNormal = (Educ_NormalSchool == 1 &
+ Educ_VirtualSchool == 0 &
+ Educ_Hybrid == 0),
+ Educ_MediumChange = (Educ_VirtualSchool == 1 |
+ Educ_Hybrid == 1)
+ )
+
+ambarom_des_educ %>%
+ survey_count(Educ_OnlyNormal,
+ Educ_NormalSchool,
+ Educ_VirtualSchool,
+ Educ_Hybrid)
+ambarom_des_educ %>%
+ survey_count(Educ_MediumChange,
+ Educ_VirtualSchool,
+ Educ_Hybrid)
+```
+
+Next, we group the data by country and calculate the population estimates for our three questions.
+
+```{r}
+#| label: ambarom-covid-ed-ests
covid_educ_ests <-
ambarom_des_educ %>%
group_by(Country) %>%
@@ -253,14 +309,23 @@ covid_educ_ests <-
p_noschool = survey_mean(Educ_NoSchool, na.rm = TRUE) * 100,
)
-covid_educ_ests_gt<-covid_educ_ests %>%
+covid_educ_ests
+```
+
+Finally, to view the results for all countries, we can use the {gt} package to construct Table \@ref(tab:ambarom-covid-ed-der-tab).
+
+```{r}
+#| label ambarom-covid-ed-gt
+covid_educ_ests_gt <- covid_educ_ests %>%
gt(rowname_col = "Country") %>%
- cols_label(p_onlynormal = "%",
- p_onlynormal_se = "SE",
- p_mediumchange = "%",
- p_mediumchange_se = "SE",
- p_noschool = "%",
- p_noschool_se = "SE") %>%
+ cols_label(
+ p_onlynormal = "%",
+ p_onlynormal_se = "SE",
+ p_mediumchange = "%",
+ p_mediumchange_se = "SE",
+ p_noschool = "%",
+ p_noschool_se = "SE"
+ ) %>%
tab_spanner(label = "Normal school only",
columns = c("p_onlynormal", "p_onlynormal_se")) %>%
tab_spanner(label = "Medium change",
@@ -288,11 +353,11 @@ covid_educ_ests_gt %>%
print_gt_book(knitr::opts_current$get()[["label"]])
```
-Of the countries that used this question, many had households where their children had an education medium change, except Haiti, where only `r covid_educ_ests %>% filter(Country=="Haiti") %>% pull(p_mediumchange) %>% signif(.,2)`% of households with students changed to virtual or hybrid learning.
+In the countries that were asked this question, many households experienced a change in their child's education medium. However, in Haiti, only `r covid_educ_ests %>% filter(Country=="Haiti") %>% pull(p_mediumchange) %>% signif(.,2)`% of households with children switched to virtual or hybrid learning.
-## Mapping survey data
+## Mapping survey data {#ambarom-maps}
-While the table presents the data well, a map could also be used. To obtain maps of the countries, the package {rnaturalearth} is used, subsetting North and South America using the function `ne_countries()`. This returns an sf object with many columns but, most importantly `soverignt` (sovereignty), `geounit` (country or territory), and `geometry` (the shape). As an example of the difference between soverignty and country/territory, the United States, Puerto Rico, and the US Virgin Islands are all separate units with the same sovereignty. This map (without data) is plotted in Figure \@ref(fig:ambarom-americas-map).
+While the table effectively presents the data, a map could also be insightful. To generate maps of the countries, we can use the package {rnaturalearth} and subset North and South America with the `ne_countries()` function. The function returns an sf (simple features) object with many columns, but most importantly, `soverignt` (sovereignty), `geounit` (country or territory), and `geometry` (the shape). For an example of the difference between sovereignty and country/territory, the United States, Puerto Rico, and the US Virgin Islands are all separate units with the same sovereignty. A map without data is plotted in Figure \@ref(fig:ambarom-americas-map).
```{r}
#| label: ambarom-americas-map
@@ -306,11 +371,11 @@ country_shape <-
)
country_shape %>%
- ggplot() +
+ ggplot() +
geom_sf()
```
-This map in Figure \@ref(fig:ambarom-americas-map) is very wide as the Aleutian islands in Alaska extend into the Eastern Hemisphere. We can crop the shape file to only the Western Hemisphere to remove some of the trailing islands of Alaska.
+The map in Figure \@ref(fig:ambarom-americas-map) appears very wide due to the Aleutian islands in Alaska extending into the Eastern Hemisphere. We can crop the shapefile to include only the Western Hemisphere, which removes some of the trailing islands of Alaska.
```{r}
#| label: ambarom-update-map
@@ -322,27 +387,61 @@ country_shape_crop <- country_shape %>%
ymax = 90))
```
-Now that we have the shape files we need, our next step is to match our survey data to the map. Countries can be called by different names (e.g., "U.S", "U.S.A", "United States"). To make sure we can plot our survey data on the map, we will need to make sure the country in both datasets match. To do this, we can use the `anti_join()` function and check to see what countries are in the survey data but not in the map data. As shown below, the United States is referred to as "United States" in the survey data but "United States of America" in the map data. The code below shows countries in the survey but not the map data.
+Now that we have the necessary shape files, our next step is to match our survey data to the map. Countries can be named differently (e.g., "U.S", "U.S.A", "United States"). To make sure we can visualize our survey data on the map, we need to match the country names in both the survey data and the map data. To do this, we can use the `anti_join()` function to identify the countries in the survey data that aren't in the map data. For example, as shown below, the United States is referred to as "United States" in the survey data but "United States of America" in the map data. Table \@ref(tab:ambarom-map-merge-check-1-tab) shows the countries in the survey data but not the map data and Table \@ref(tab:ambarom-map-merge-check-2-tab) shows the countries in the map data but not the survey data.
```{r}
-#| label: ambarom-map-merge-check
+#| label: ambarom-map-merge-check-1-gt
survey_country_list <- ambarom %>% distinct(Country)
-survey_country_list %>%
- anti_join(country_shape_crop, by = c("Country" = "geounit"))
+
+survey_country_list_gt <- survey_country_list %>%
+ anti_join(country_shape_crop, by = c("Country" = "geounit")) %>%
+ gt()
+```
+
+```{r}
+#| label: ambarom-map-merge-check-1-noeval
+#| eval: false
+survey_country_list_gt
```
-The code below shows countries in the map data but not hte survey data.
+(ref:ambarom-map-merge-check-1-tab) Countries in the survey data but not the map data
```{r}
-#| label: ambarom-map-merge-check-2
-country_shape_crop %>% as_tibble() %>%
+#| label: ambarom-map-merge-check-1-tab
+#| echo: FALSE
+#| warning: FALSE
+
+survey_country_list_gt %>%
+ print_gt_book(knitr::opts_current$get()[["label"]])
+```
+
+```{r}
+#| label: ambarom-map-merge-check-2-gt
+map_country_list_gt<-country_shape_crop %>% as_tibble() %>%
select(geounit, sovereignt) %>%
anti_join(survey_country_list, by = c("geounit" = "Country")) %>%
arrange(geounit) %>%
- print(n = 30)
+ gt()
+```
+
+```{r}
+#| label: ambarom-map-merge-check-2-noeval
+#| eval: false
+map_country_list_gt
+```
+
+(ref:ambarom-map-merge-check-2-tab) Countries in the map data but not the survey data
+
+```{r}
+#| label: ambarom-map-merge-check-2-tab
+#| echo: FALSE
+#| warning: FALSE
+
+map_country_list_gt %>%
+ print_gt_book(knitr::opts_current$get()[["label"]])
```
-With the mismatched names, there are several ways to remedy the data to join later. The most straightforward fix is to rename the shape object's data before merging. We then can plot the survey estimates after merging the data.
+There are several ways to fix the mismatched names for a successful join. The simplest solution is to rename the data in the shape object before merging. Since only one country name in the survey data differs from the map data, we rename the map data accordingly.
```{r}
#| label: ambarom-update-map-usa
@@ -351,89 +450,113 @@ country_shape_upd <- country_shape_crop %>%
"United States", geounit))
```
-To merge the data and make a map, we begin with the map file, merge the estimates data, and then plot. Let's use the outcomes we created in section \@ref(ambarom-tables) for the table output (`covid_worry_country_ests` and `covid_educ_ests`). Figures \@ref(fig:ambarom-make-maps-covid) and \@ref(fig:ambarom-make-maps-covid-ed) display the maps for each measure.
+Now that the country names match, we can merge the survey and map data and then plot the data. We begin with the map file and merge it with the survey estimates generated in Section \@ref(ambarom-estimates) (`covid_worry_country_ests` and `covid_educ_ests`). We use the tidyverse function of `full_join()`, which joins the rows in the map data and the survey estimates based on the columns `geounit` and `Country`. A full join keeps all the rows from both datasets, matching rows when possible. For any rows without matches, the function fills in an `NA` for the missing value.
```{r}
-#| label: ambarom-make-maps-covid
-#| fig.cap: "Percent of people worried someone in their household will get COVID-19 in the next 3 months by country"
-#| error: true
+#| label: ambarom-join-maps-ests
covid_sf <- country_shape_upd %>%
full_join(covid_worry_country_ests,
by = c("geounit" = "Country")) %>%
full_join(covid_educ_ests,
by = c("geounit" = "Country"))
+```
+
+After the merge, we create two figures that display the population estimates for the percentage of people worried about COVID (Figure \@ref(fig:ambarom-make-maps-covid)) and the percentage of households with at least one child participating in virtual or hybrid learning (Figure \@ref(fig:ambarom-make-maps-covid-ed)).
+
+```{r}
+#| label: ambarom-make-maps-covid
+#| fig.cap: "Percent of households worried someone in their household will get COVID-19 in the next 3 months by country"
+#| error: true
ggplot() +
- geom_sf(data = covid_sf, aes(fill = p, geometry = geometry)) +
+ geom_sf(data = covid_sf,
+ aes(fill = p, geometry = geometry),
+ color = "darkgray") +
scale_fill_gradientn(
guide = "colorbar",
name = "Percent",
labels = scales::comma,
- colors = c("#BFD7EA", "#087E8B", "#0B3954"),
+ colors = c("#BFD7EA", "#087e8b", "#0B3954"),
na.value = NA
) +
geom_sf_pattern(
data = filter(covid_sf, is.na(p)),
pattern = "crosshatch",
- pattern_fill = "black",
- fill = NA
+ pattern_fill = "lightgray",
+ pattern_color = "lightgray",
+ fill = NA,
+ color = "darkgray"
) +
theme_minimal()
```
```{r}
#| label: ambarom-make-maps-covid-ed
-#| fig.cap: "Percent of students who participated in virtual or hybrid learning"
+#| fig.cap: "Percent of households who had at least one child participate in virtual or hybrid learning"
#| error: true
ggplot() +
- geom_sf(data = covid_sf, aes(fill = p_mediumchange, geometry = geometry)) +
+ geom_sf(
+ data = covid_sf,
+ aes(fill = p_mediumchange, geometry = geometry),
+ color = "darkgray"
+ ) +
scale_fill_gradientn(
guide = "colorbar",
name = "Percent",
labels = scales::comma,
- colors = c("#BFD7EA", "#087E8B", "#0B3954"),
+ colors = c("#BFD7EA", "#087e8b", "#0B3954"),
na.value = NA
) +
geom_sf_pattern(
data = filter(covid_sf, is.na(p_mediumchange)),
pattern = "crosshatch",
- pattern_fill = "black",
- fill = NA
+ pattern_fill = "lightgray",
+ pattern_color = "lightgray",
+ fill = NA,
+ color = "darkgray"
) +
theme_minimal()
```
-In Figure \@ref(fig:ambarom-make-maps-covid-ed) we can see that Canada, Mexico, and the United States have missing data (the crosshatch pattern). Reviewing the questionnaires indicate that these three countries did not include the education question in the survey. To better see the differences in the data, it may make sense to remove North America from the map and focus on Central and South America. This is done below by restricting the shape files to Latin America and the Caribbean as seen in Figure \@ref(fig:ambarom-make-maps-covid-ed-c-s)
+In Figure \@ref(fig:ambarom-make-maps-covid-ed), we observe missing data (represented by the crosshatch pattern) for Canada, Mexico, and the United States. The questionnaires indicate that these three countries did not include the education question in the survey. To focus on countries with available data, we can remove North America from the map and show only Central and South America. We do this below by restricting the shape files to Latin America and the Caribbean, as depicted in Figure \@ref(fig:ambarom-make-maps-covid-ed-c-s).
```{r}
#| label: ambarom-make-maps-covid-ed-c-s
-#| fig.cap: "Percent of students who participated in virtual or hybrid learning, Central and South America"
+#| fig.cap: "Percent of households who had at least one child participate in virtual or hybrid learning, Central and South America"
#| error: true
covid_c_s <- covid_sf %>%
filter(region_wb == "Latin America & Caribbean")
ggplot() +
- geom_sf(data = covid_c_s, aes(fill = p_mediumchange, geometry = geometry)) +
+ geom_sf(
+ data = covid_c_s,
+ aes(fill = p_mediumchange, geometry = geometry),
+ color = "darkgray"
+ ) +
scale_fill_gradientn(
guide = "colorbar",
name = "Percent",
labels = scales::comma,
- colors = c("#BFD7EA", "#087E8B", "#0B3954"),
+ colors = c("#BFD7EA", "#087e8b", "#0B3954"),
na.value = NA
) +
geom_sf_pattern(
data = filter(covid_c_s, is.na(p_mediumchange)),
pattern = "crosshatch",
- pattern_fill = "black",
- fill = NA
- ) +
+ pattern_fill = "lightgray",
+ pattern_color = "lightgray",
+ fill = NA,
+ color = "darkgray"
+ ) +
theme_minimal()
```
+In Figure \@ref(fig:ambarom-make-maps-covid-ed-c-s), we can see that most countries with available data have similar percentages (reflected in their similar shades). However, Haiti stands out with a lighter shade, indicating a considerably lower percentage of households with at least one child participating in virtual or hybrid learning.
+
## Exercises
-1. Calculate the percentage of households with broadband internet and those with any internet at home, including from phone or tablet. Hint: if you see countries with 0% Internet usage, you may want to filter by something first.
+1. Calculate the percentage of households with broadband internet and those with any internet at home, including from a phone or tablet. Hint: if you come across countries with 0% internet usage, you may want to filter by something first.
```{r}
#| label: ambarom-int-prev
@@ -450,7 +573,7 @@ int_ests %>%
print(n = 30)
```
-2. Make a faceted map showing both broadband internet and any internet usage.
+2. Create a faceted map showing both broadband internet and any internet usage.
```{r}
#| label: ambarom-facet-map
@@ -467,7 +590,8 @@ b_int_sf <- internet_sf %>%
filter(region_wb == "Latin America & Caribbean")
b_int_sf %>%
- ggplot(aes(fill = p)) +
+ ggplot(aes(fill = p),
+ color="darkgray") +
geom_sf() +
facet_wrap( ~ Type) +
scale_fill_gradientn(
@@ -480,8 +604,10 @@ b_int_sf %>%
geom_sf_pattern(
data = filter(b_int_sf, is.na(p)),
pattern = "crosshatch",
- pattern_fill = "black",
- fill = NA
+ pattern_fill = "lightgray",
+ pattern_color = "lightgray",
+ fill = NA,
+ color = "darkgray"
) +
theme_minimal()
```
\ No newline at end of file
diff --git a/DataCleaningScripts/00_Run.R b/DataCleaningScripts/00_Run.R
deleted file mode 100644
index 1ec96c83..00000000
--- a/DataCleaningScripts/00_Run.R
+++ /dev/null
@@ -1,4 +0,0 @@
-rmarkdown::render(
- input=here::here("DataCleaningScripts", "LAPOP_2021_DataPrep.Rmd"),
- envir=new.env()
-)
diff --git a/DataCleaningScripts/ANES Codebook Metadata.xlsx b/DataCleaningScripts/ANES Codebook Metadata.xlsx
deleted file mode 100644
index 30c72710..00000000
--- a/DataCleaningScripts/ANES Codebook Metadata.xlsx
+++ /dev/null
@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:b5b72b23c3f647d5e8108be3201f4dcf9528720b83c815245162c09101083743
-size 12737
diff --git a/DataCleaningScripts/ANES_2020_DataPrep.Rmd b/DataCleaningScripts/ANES_2020_DataPrep.Rmd
deleted file mode 100644
index e69de29b..00000000
diff --git a/DataCleaningScripts/ANES_2020_DataPrep.md b/DataCleaningScripts/ANES_2020_DataPrep.md
deleted file mode 100644
index e69de29b..00000000
diff --git a/DataCleaningScripts/LAPOP_2021_DataPrep.Rmd b/DataCleaningScripts/LAPOP_2021_DataPrep.Rmd
deleted file mode 100644
index 8457d6ab..00000000
--- a/DataCleaningScripts/LAPOP_2021_DataPrep.Rmd
+++ /dev/null
@@ -1,87 +0,0 @@
----
-title: "AmericasBarometer 2021"
-output:
- github_document:
- html_preview: false
----
-
-```{r setup, include=FALSE}
-knitr::opts_chunk$set(echo = TRUE)
-```
-
-## Data information
-
-All data and resources were downloaded from http://datasets.americasbarometer.org/database/ on May 7, 2023.
-
-```{r}
-#| label: loadpackages
-
-library(tidyverse) #data manipulation
-library(haven) #data import
-library(tidylog) #informative logging messages
-library(osfr) # be sure to have PAT saved in Renviron as OSF_PAT
-```
-
-## Import data and create derived variables
-
-```{r}
-#| label: derivedata
-
-stata_files <- osf_retrieve_node("https://osf.io/z5c3m/") %>%
- osf_ls_files(path="LAPOP_2021", n_max=40, pattern=".dta")
-
-read_stata_unlabeled <- function(osf_tbl_i){
- filedet <- osf_tbl_i %>%
- osf_download(conflicts="overwrite", path=here::here("osf_dl"))
-
- tibin <- filedet %>%
- pull(local_path) %>%
- read_stata() %>%
- zap_labels() %>%
- zap_label()
-
- unlink(pull(filedet, "local_path"))
-
- return(tibin)
-}
-
-lapop_in <- stata_files %>%
- split(1:nrow(stata_files)) %>%
- map_df(read_stata_unlabeled)
-
-# https://www.vanderbilt.edu/lapop/ab2021/AB2021-Core-Questionnaire-v17.5-Eng-210514-W-v2.pdf
-lapop <- lapop_in %>%
- select(pais, strata, upm, weight1500, strata, core_a_core_b,
- q2, q1tb, covid2at, a4, idio2, idio2cov, it1, jc13,
- m1, mil10a, mil10e, ccch1, ccch3, ccus1, ccus3,
- edr, ocup4a, q14, q11n, q12c, q12bn,
- starts_with("covidedu1"), gi0n,
- r15, r18n, r18
- )
-
-
-```
-
-
-
-## Save data
-
-```{r savedat}
-
-summary(lapop)
-
-dir.create(here::here("osf_dl", "LAPOP_2021"))
-
-lapop_temp_loc <- here::here("osf_dl", "LAPOP_2021", "lapop_2021.rds")
-
-write_rds(lapop, lapop_temp_loc)
-
-# target_dir <- osf_retrieve_node("https://osf.io/gzbkn/?view_only=8ca80573293b4e12b7f934a0f742b957")
-
-target_dir <- osf_retrieve_node("https://osf.io/z5c3m/")
-
-osf_upload(target_dir, path=here::here("osf_dl", "LAPOP_2021"), conflicts="overwrite")
-
-unlink(lapop_temp_loc)
-```
-
diff --git a/DataCleaningScripts/LAPOP_2021_DataPrep.md b/DataCleaningScripts/LAPOP_2021_DataPrep.md
deleted file mode 100644
index 41347bba..00000000
--- a/DataCleaningScripts/LAPOP_2021_DataPrep.md
+++ /dev/null
@@ -1,149 +0,0 @@
-AmericasBarometer 2021
-================
-
-## Data information
-
-All data and resources were downloaded from
- on May 7, 2023.
-
-``` r
-library(tidyverse) #data manipulation
-library(haven) #data import
-library(tidylog) #informative logging messages
-library(osfr) # be sure to have PAT saved in Renviron as OSF_PAT
-```
-
-## Import data and create derived variables
-
-``` r
-stata_files <- osf_retrieve_node("https://osf.io/z5c3m/") %>%
- osf_ls_files(path="LAPOP_2021", n_max=40, pattern=".dta")
-
-read_stata_unlabeled <- function(osf_tbl_i){
- filedet <- osf_tbl_i %>%
- osf_download(conflicts="overwrite", path=here::here("osf_dl"))
-
- tibin <- filedet %>%
- pull(local_path) %>%
- read_stata() %>%
- zap_labels() %>%
- zap_label()
-
- unlink(pull(filedet, "local_path"))
-
- return(tibin)
-}
-
-lapop_in <- stata_files %>%
- split(1:nrow(stata_files)) %>%
- map_df(read_stata_unlabeled)
-
-# https://www.vanderbilt.edu/lapop/ab2021/AB2021-Core-Questionnaire-v17.5-Eng-210514-W-v2.pdf
-lapop <- lapop_in %>%
- select(pais, strata, upm, weight1500, strata, core_a_core_b,
- q2, q1tb, covid2at, a4, idio2, idio2cov, it1, jc13,
- m1, mil10a, mil10e, ccch1, ccch3, ccus1, ccus3,
- edr, ocup4a, q14, q11n, q12c, q12bn,
- starts_with("covidedu1"), gi0n,
- r15, r18n, r18
- )
-```
-
- ## select: dropped 483 variables (idnum, uniq_id, year, wave, nationality, …)
-
-## Save data
-
-``` r
-summary(lapop)
-```
-
- ## pais strata upm weight1500 core_a_core_b
- ## Min. : 1.00 Min. :1.000e+08 Min. :1.001e+07 Min. :0.004136 Length:64352
- ## 1st Qu.: 6.00 1st Qu.:6.000e+08 1st Qu.:6.153e+07 1st Qu.:0.251556 Class :character
- ## Median :11.00 Median :1.100e+09 Median :1.202e+08 Median :0.417251 Mode :character
- ## Mean :13.03 Mean :1.303e+09 Mean :1.666e+08 Mean :0.512805
- ## 3rd Qu.:17.00 3rd Qu.:1.700e+09 3rd Qu.:2.105e+08 3rd Qu.:0.674477
- ## Max. :41.00 Max. :4.100e+09 Max. :1.135e+09 Max. :7.024495
- ##
- ## q2 q1tb covid2at a4 idio2 idio2cov
- ## Min. : 16.00 Min. :1.000 Min. :1.000 Min. : 1.00 Min. :1.000 Min. :1.000
- ## 1st Qu.: 27.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 3.00 1st Qu.:2.000 1st Qu.:1.000
- ## Median : 36.00 Median :2.000 Median :2.000 Median : 22.00 Median :3.000 Median :1.000
- ## Mean : 38.86 Mean :1.521 Mean :2.076 Mean : 36.73 Mean :2.439 Mean :1.242
- ## 3rd Qu.: 49.00 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 71.00 3rd Qu.:3.000 3rd Qu.:1.000
- ## Max. :121.00 Max. :3.000 Max. :4.000 Max. :865.00 Max. :3.000 Max. :2.000
- ## NA's :90 NA's :90 NA's :6686 NA's :4965 NA's :2766 NA's :31580
- ## it1 jc13 m1 mil10a mil10e ccch1
- ## Min. :1.000 Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00
- ## 1st Qu.:2.000 1st Qu.:1.00 1st Qu.:2.00 1st Qu.:2.00 1st Qu.:2.00 1st Qu.:1.00
- ## Median :2.000 Median :2.00 Median :3.00 Median :3.00 Median :2.00 Median :1.00
- ## Mean :2.275 Mean :1.62 Mean :2.98 Mean :2.72 Mean :2.39 Mean :1.78
- ## 3rd Qu.:3.000 3rd Qu.:2.00 3rd Qu.:4.00 3rd Qu.:3.00 3rd Qu.:3.00 3rd Qu.:2.00
- ## Max. :4.000 Max. :2.00 Max. :5.00 Max. :4.00 Max. :4.00 Max. :4.00
- ## NA's :3631 NA's :50827 NA's :33238 NA's :49939 NA's :44021 NA's :50535
- ## ccch3 ccus1 ccus3 edr ocup4a q14
- ## Min. :1.00 Min. :1.00 Min. :1.00 Min. :0.000 Min. :1.000 Min. :1.0
- ## 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.0
- ## Median :2.00 Median :1.00 Median :2.00 Median :2.000 Median :1.000 Median :2.0
- ## Mean :1.82 Mean :1.58 Mean :1.76 Mean :2.192 Mean :2.627 Mean :1.6
- ## 3rd Qu.:2.00 3rd Qu.:2.00 3rd Qu.:2.00 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:2.0
- ## Max. :3.00 Max. :4.00 Max. :3.00 Max. :3.000 Max. :7.000 Max. :2.0
- ## NA's :51961 NA's :50028 NA's :51226 NA's :4114 NA's :29505 NA's :44130
- ## q11n q12c q12bn covidedu1_1 covidedu1_2 covidedu1_3
- ## Min. :1.000 Min. : 1.000 Min. : 0.000 Min. :0.00 Min. :0.00 Min. :0.00
- ## 1st Qu.:1.000 1st Qu.: 3.000 1st Qu.: 0.000 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.00
- ## Median :2.000 Median : 4.000 Median : 1.000 Median :0.00 Median :0.00 Median :1.00
- ## Mean :2.214 Mean : 4.036 Mean : 1.001 Mean :0.17 Mean :0.07 Mean :0.62
- ## 3rd Qu.:3.000 3rd Qu.: 5.000 3rd Qu.: 2.000 3rd Qu.:0.00 3rd Qu.:0.00 3rd Qu.:1.00
- ## Max. :7.000 Max. :20.000 Max. :16.000 Max. :1.00 Max. :1.00 Max. :1.00
- ## NA's :31198 NA's :29144 NA's :29449 NA's :51297 NA's :51297 NA's :51297
- ## covidedu1_4 covidedu1_5 gi0n r15 r18n r18
- ## Min. :0.00 Min. :0.00 Min. :1.000 Min. :0.000 Min. :0.000 Min. :0.000
- ## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:1.000
- ## Median :0.00 Median :0.00 Median :1.000 Median :1.000 Median :1.000 Median :1.000
- ## Mean :0.12 Mean :0.08 Mean :1.646 Mean :0.513 Mean :0.537 Mean :0.815
- ## 3rd Qu.:0.00 3rd Qu.:0.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
- ## Max. :1.00 Max. :1.00 Max. :5.000 Max. :1.000 Max. :1.000 Max. :1.000
- ## NA's :51297 NA's :51297 NA's :1240 NA's :4118 NA's :4386 NA's :4249
-
-``` r
-dir.create(here::here("osf_dl", "LAPOP_2021"))
-```
-
- ## Warning in dir.create(here::here("osf_dl", "LAPOP_2021")):
- ## 'C:\Users\steph\Documents\GitHub\tidy-survey-book\osf_dl\LAPOP_2021' already exists
-
-``` r
-lapop_temp_loc <- here::here("osf_dl", "LAPOP_2021", "lapop_2021.rds")
-
-write_rds(lapop, lapop_temp_loc)
-
-# target_dir <- osf_retrieve_node("https://osf.io/gzbkn/?view_only=8ca80573293b4e12b7f934a0f742b957")
-
-target_dir <- osf_retrieve_node("https://osf.io/z5c3m/")
-
-osf_upload(target_dir, path=here::here("osf_dl", "LAPOP_2021"), conflicts="overwrite")
-```
-
- ## Searching for conflicting files on OSF
-
- ## Retrieving 24 of 24 available items:
-
- ## ..retrieved 10 items
-
- ## ..retrieved 20 items
-
- ## ..retrieved 24 items
-
- ## ..done
-
- ## Updating 1 existing file(s) on OSF
-
- ## # A tibble: 1 × 3
- ## name id meta
- ##
- ## 1 LAPOP_2021 647ce3443c3a380884a04379
-
-``` r
-unlink(lapop_temp_loc)
-```
diff --git a/DataCleaningScripts/RECS 2020 Codebook Questions.xlsx b/DataCleaningScripts/RECS 2020 Codebook Questions.xlsx
deleted file mode 100644
index 60d9eb84..00000000
--- a/DataCleaningScripts/RECS 2020 Codebook Questions.xlsx
+++ /dev/null
@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:3322a8e1e810839b553f99ebb6e1e1bbd069e3480e6cee3f45e937914023d5b3
-size 68588
diff --git a/DataCleaningScripts/RECS_2020_DataPrep.Rmd b/DataCleaningScripts/RECS_2020_DataPrep.Rmd
deleted file mode 100644
index e69de29b..00000000
diff --git a/DataCleaningScripts/RECS_2020_DataPrep.md b/DataCleaningScripts/RECS_2020_DataPrep.md
deleted file mode 100644
index e69de29b..00000000
diff --git a/book.bib b/book.bib
index 784794f1..e75c4972 100644
--- a/book.bib
+++ b/book.bib
@@ -253,6 +253,14 @@ @book{wickham2023ggplot2
publisher = {Springer},
howpublished = {\url{https://ggplot2-book.org/}}
}
+@book{wickham2023r4ds,
+ title = {R for Data Science: Import, Tidy, Transform, Visualize, and Model Data},
+ author = {Wickham, Hadley and Çetinkaya-Rundel, Mine and Grolemund, Garrett},
+ edition = {2rd Edition},
+ year = 2023,
+ publisher = {O'Reilly Media},
+ howpublished = {\url{https://r4ds.hadley.nz/}}
+}
@misc{acs-pums-2021,
title = {{Understanding and Using the American Community Survey Public Use Microdata Sample Files What Data Users Need to Know}},
author = {{U.S. Census Bureau}},
@@ -455,4 +463,4 @@ @Article{naniar
number = {7},
pages = {1--31},
doi = {10.18637/jss.v105.i07},
-}
\ No newline at end of file
+}
diff --git a/data/anes_timeseries_2020_stata_20220210.dta b/data/anes_timeseries_2020_stata_20220210.dta
deleted file mode 100644
index a18596f1..00000000
--- a/data/anes_timeseries_2020_stata_20220210.dta
+++ /dev/null
@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:108ebb3a3f0f07a1c9bd5100ea19dd6fad3c6e1b2e3b14854930a5c4f5701786
-size 123173124
diff --git a/osf_dl/README.md b/osf_dl/README.md
deleted file mode 100644
index 95750591..00000000
--- a/osf_dl/README.md
+++ /dev/null
@@ -1 +0,0 @@
-Temp folder for downloading osf data while compiling book
\ No newline at end of file
diff --git a/renv.lock b/renv.lock
index 13f1a37c..e42b3cdd 100644
--- a/renv.lock
+++ b/renv.lock
@@ -392,13 +392,6 @@
],
"Hash": "3f038e5ac7f41d4ac41ce658c85e3042"
},
- "clisymbols": {
- "Package": "clisymbols",
- "Version": "1.2.0",
- "Source": "Repository",
- "Repository": "CRAN",
- "Hash": "96c01552bfd5661b9bbdefbc762f4bcd"
- },
"colorspace": {
"Package": "colorspace",
"Version": "2.1-0",
@@ -945,16 +938,6 @@
],
"Hash": "8b331e659e67d757db0fcc28e689c501"
},
- "here": {
- "Package": "here",
- "Version": "1.0.1",
- "Source": "Repository",
- "Repository": "CRAN",
- "Requirements": [
- "rprojroot"
- ],
- "Hash": "24b224366f9c2e7534d2344d10d59211"
- },
"hexbin": {
"Package": "hexbin",
"Version": "1.28.3",
@@ -1756,16 +1739,6 @@
],
"Hash": "0d34b89b43e900467e60f5449226f3e3"
},
- "rprojroot": {
- "Package": "rprojroot",
- "Version": "2.0.3",
- "Source": "Repository",
- "Repository": "CRAN",
- "Requirements": [
- "R"
- ],
- "Hash": "1de7ab598047a87bba48434ba35d497d"
- },
"rstudioapi": {
"Package": "rstudioapi",
"Version": "0.14",
@@ -2081,19 +2054,6 @@
],
"Hash": "a84e2cc86d07289b3b6f5069df7a004c"
},
- "tidylog": {
- "Package": "tidylog",
- "Version": "1.0.2",
- "Source": "Repository",
- "Repository": "CRAN",
- "Requirements": [
- "clisymbols",
- "dplyr",
- "glue",
- "tidyr"
- ],
- "Hash": "a55d41e241dbe858d1456d952ce3301f"
- },
"tidyr": {
"Package": "tidyr",
"Version": "1.3.0",